新建Mxnet任务

更新时间：2024-06-13

您可以新建一个Mxnet类型的任务。

前提条件

您已成功安装CCE AI Job Scheduler和CCE Deep Learning Frameworks Operator组件，否则云原生AI功能将无法使用。
若您是子用户，队列关联的用户中有您才能使用该队列新建任务。
安装组件CCE Deep Learning Frameworks Operator时，系统安装了Mxnet深度学习框架。

操作步骤

登录百度智能云官网，并进入管理控制台。
选择“产品服务 > 云原生 > 容器引擎 CCE”，单击进入容器引擎管理控制台。
单击左侧导航栏中的集群管理 > 集群列表。
在集群列表页面中，单击目标集群名称进入集群管理页面。
在集群管理页面单击云原生AI > 任务管理。
在任务管理页面单击新建任务。
在新建任务页面中，完成任务基本信息配置：

截屏2024-05-31 下午5.27.38.png

任务名称：自定义任务名称，支持小写字母、数字、以及-或.且开头与结尾必须是小写字母或者数字，长度 1-65。
命名空间：选择新建任务所在的命名空间。
选择队列：选择新建任务关联的队列。
任务优先级：选择任务对应的任务优先级。
允许超发：允许超发将使用任务抢占超发功能，需要先安装CCE AI Job Scheduler组件并升级到1.4.0及以上版本。
延迟容忍：系统将优先把任务或工作负载调度到集群碎片资源，以提高集群资源利用率，但可能对业务延迟行能造成影响。

完成代码基本信息配置：

截屏2024-05-31 下午5.32.15.png

代码配置类型：指定代码配置方式，目前支持“BOS文件”、“本地文件上传”、“git代码仓库“与“暂不配置”。
执行命令：指定代码的执行命令。

9.完成数据相关信息配置：

截屏2024-05-31 下午5.33.38.png

设置数据源：当前支持数据集、持久卷声明、选择临时路径和选择主机路径。选择数据集时列出所有可用的数据集，选择后会同时选择与数据集同名的持久卷声明；使用持久卷声明时直接选择即可。

10.点击“下一步”，进入容器相关配置。

11.完成任务类型相关信息配置：

截屏2024-05-31 下午6.06.41.png

选择框架：选择“ Mxnet ”。
训练方式：指定训练方式为“单机”或“分布式”。
选择角色：训练方式为“单机”时，只能选择“Woker”；训练方式为“分布式”时，可额外选择“PS”、“Chief”、“Evaluator”。

12.完成容器组相关信息配置，可以根据需要同时进行高级设置。

截屏2024-05-31 下午5.41.49.png

期望Pod数：指定容器组的Pod数目。
重启策略：指定容器组的重启策略，可选择的策略有“失败重启”或“从不重启”。
镜像地址：指定容器的镜像拉取地址，也可以直接点击“选择镜像”，选择需要使用的镜像。
镜像版本：指定镜像的版本，若不指定默认拉取latest版。
容器配额：指定容器的CPU、内存、GPU/NPU资源相关信息。
环境变量：填写变量名和变量值。
生命周期：包含启动命令、启动参数、启动后执行和停止前执行，可根据需要添加。

13.完成任务高级信息相关配置。

截屏2024-05-31 下午5.43.32.png

私有仓库凭证：若需要使用私有镜像仓库，请在此处添加对应镜像仓库的访问凭证。
Tensorboard：若需要任务可视化时，可开启Tensorboard功能，开启后需要指定“服务类型”与“ 训练日志读取路径”。
K8S标签：指定任务对应的K8S Label。
注释：指定任务对应的Annotation。

点击“完成”按钮，完成任务的新建。

Yaml创建任务示例

Plain Text

1apiVersion: "kubeflow.org/v1"
2kind: "MXJob"
3metadata:
4  name: "mxnet-job"
5spec:
6  jobMode: MXTrain
7  mxReplicaSpecs:
8    Scheduler:
9      replicas: 1
10      restartPolicy: Never
11      template:
12        metadata:
13          annotations:
14            sidecar.istio.io/inject: "false"
15            # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
16            # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
17        spec:
18          schedulerName: volcano
19          containers:
20            - name: mxnet
21              image: registry.baidubce.com/cce-public/kubeflow/cce-public/mxjob/mxnet:gpu
22              resources:
23                limits:
24                  baidu.com/v100_32g_cgpu: "1"
25                  # for gpu core/memory isolation
26                  baidu.com/v100_32g_cgpu_core: 5
27                  baidu.com/v100_32g_cgpu_memory: "1"
28              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
29              # ${'`'}train_mnist.py${'`'} needs to be replaced with the name of your gpu process.
30              lifecycle:
31                preStop:
32                  exec:
33                    command: [
34                      "/bin/sh", "-c",
35                      "kill -10 ${'`'}ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'${'`'} && sleep 1"
36                    ]
37    Server:
38      replicas: 1
39      restartPolicy: Never
40      template:
41        metadata:
42          annotations:
43            sidecar.istio.io/inject: "false"
44            # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
45            # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
46        spec:
47          schedulerName: volcano
48          containers:
49            - name: mxnet
50              image: registry.baidubce.com/cce-public/kubeflow/cce-public/mxjob/mxnet:gpu
51              resources:
52                limits:
53                  baidu.com/v100_32g_cgpu: "1"
54                  # for gpu core/memory isolation
55                  baidu.com/v100_32g_cgpu_core: 5
56                  baidu.com/v100_32g_cgpu_memory: "1"
57              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
58              # ${'`'}train_mnist.py${'`'} needs to be replaced with the name of your gpu process.
59              lifecycle:
60                preStop:
61                  exec:
62                    command: [
63                      "/bin/sh", "-c",
64                      "kill -10 ${'`'}ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'${'`'} && sleep 1"
65                    ]
66    Worker:
67      replicas: 1
68      restartPolicy: Never
69      template:
70        metadata:
71          annotations:
72            sidecar.istio.io/inject: "false"
73            # if your libcuda.so.1 is in custom path, set the correct path with the following annotation
74            # kubernetes.io/baidu-cgpu.nvidia-driver-lib: /usr/lib64
75        spec:
76          schedulerName: volcano
77          containers:
78            - name: mxnet
79              image: registry.baidubce.com/cce-public/kubeflow/cce-public/mxjob/mxnet:gpu
80              env:
81                # for gpu memory over request, set 0 to disable
82                - name: CGPU_MEM_ALLOCATOR_TYPE
83                  value: “1”
84              command: ["python"]
85              args: [
86                "/incubator-mxnet/example/image-classification/train_mnist.py",
87                "--num-epochs","10","--num-layers","2","--kv-store","dist_device_sync","--gpus","0"
88              ]
89              resources:
90                requests:
91                  cpu: 1
92                  memory: 1Gi
93                limits:
94                  baidu.com/v100_32g_cgpu: "1"
95                  # for gpu core/memory isolation
96                  baidu.com/v100_32g_cgpu_core: 20
97                  baidu.com/v100_32g_cgpu_memory: "4"
98              # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
99              # ${'`'}train_mnist.py${'`'} needs to be replaced with the name of your gpu process.
100              lifecycle:
101                preStop:
102                  exec:
103                    command: [
104                      "/bin/sh", "-c",
105                      "kill -10 ${'`'}ps -ef | grep train_mnist.py | grep -v grep | awk '{print $2}'${'`'} && sleep 1"
106                    ]

新建Pytorch任务

新建PaddlePaddle任务

容器引擎 CCE

新建Mxnet任务

前提条件

操作步骤

Yaml创建任务示例