任务式建模

最近更新时间:2024-07-11 09:10:11

我的收藏

命名空间

Namespace = QCE/TI_TRAINTASK

监控指标

指标英文名
指标中文名
说明
单位
维度
统计规则 [period, statType]
CfsClientDataReadBandwidth
turocfs单节点服务端读带宽
turocfs单节点服务端读带宽
KBytes/s
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
CfsClientDataWriteBandwidth
turocfs单节点服务端写带宽
turocfs单节点服务端写带宽
KBytes/s
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
CfsDataReadIoBytes
cfs服务端读带宽
cfs服务端读带宽
KBytes/s
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
CfsDataReadIoLatency
cfs读延迟
cfs读延迟
ms
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
CfsDataWriteIoBytes
cfs服务端写带宽
cfs服务端写带宽
KBytes/s
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
CfsDataWriteIoLatency
cfs写延迟
cfs写延迟
ms
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
CfsStrageUsageGb
cfs存储数据容量
cfs存储数据容量
GBytes
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Cpuutil
CPU利用率
CPU利用率
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DcgmFiDevFbUsed
显存使用量
显存使用量
MBytes
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DcgmFiDevGpuUtil
GPU使用率
GPU使用率
%
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DcgmFiDevMemCopyUtil
显存使用率
显存使用率
%
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DiskIoUtil
磁盘ioutil
磁盘ioutil
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DiskIoWait
磁盘iowait
磁盘iowait
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DiskReadByte
磁盘读取带宽
磁盘读取带宽
MBytes/s
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DiskReadIops
磁盘读取iops
磁盘读取iops
Count
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DiskUsageRadio
系统盘分区利用率
系统盘分区利用率
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DiskWriteByte
磁盘写入带宽
磁盘写入带宽
MBytes/s
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
DiskWriteIops
磁盘写入iops
磁盘写入iops
Count
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Fp16EngineActivity
FP16活跃时间比
FP16活跃时间比
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Fp32EngineActivity
FP32活跃时间比
FP32活跃时间比
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Fp64EngineActivity
FP64活跃时间比
FP64活跃时间比
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuFp16EngineActivity
FP16活跃时间比
FP16活跃时间比
%
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuFp32EngineActivity
FP32活跃时间比
FP32活跃时间比
%
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuFp64EngineActivity
FP64活跃时间比
FP64活跃时间比
%
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Gpumemutil
GPU显存利用率
GPU显存利用率
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Gpumemvalue
显存使用量
显存使用量
MBytes
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuNvlinkBandwidth
nvlink传输速率
nvlink传输速率
Bytes/s
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuPcieBandwidth
PCIe总线传输速率
PCIe总线传输速率
Bytes/s
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuRdmaInpkt
RDMA网卡入包量
RDMA网卡入包量
pps
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuRdmaIntraffic
RDMA网卡接收带宽
RDMA网卡接收带宽
Mbps
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuRdmaOutpkt
RDMA网卡出包量
RDMA网卡出包量
pps
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuRdmaOuttraffic
RDMA网卡发送带宽
RDMA网卡发送带宽
Mbps
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuSmActivity
SM活跃状态时间比
SM活跃状态时间比
%
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GpuTensorActivity
Tensor活跃状态时间比
Tensor活跃状态时间比
%
taskInsGpuNum
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Gpuutil
GPU利用率
GPU利用率
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
GroupCpuUsage
CPU利用率
CPU利用率
%
group_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
GroupGpuUtil
GPU使用率
GPU使用率
%
group_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
GroupLanInTraffic
内网入带宽
内网入带宽
Mbps
group_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
GroupLanOutTraffic
内网出带宽
内网出带宽
Mbps
group_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
GroupMemUsage
内存利用率
内存利用率
%
group_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
GroupWanInTraffic
外网入带宽
外网入带宽
Mbps
group_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
GroupWanOutratio
外网带宽利用率
外网带宽利用率
%
group_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
GroupWanOutTraffic
外网出带宽
外网出带宽
Mbps
group_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
InsCpuUsage
CPU利用率
CPU利用率
%
instance_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
InsGpuUtil
GPU使用率
GPU使用率
%
instance_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
InsLanInTraffic
内网入带宽
内网入带宽
Mbps
instance_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
InsLanOutTraffic
内网出带宽
内网出带宽
Mbps
instance_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
InsMemUsage
内存利用率
内存利用率
%
instance_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
Instancecpuutil
CPU利用率
CPU利用率
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Instancegpumemutil
GPU显存利用率
GPU显存利用率
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Instancegpumemvalue
显存使用量
显存使用量
MBytes
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Instancegpuutil
GPU利用率
GPU利用率
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Instancememutil
内存利用率
内存利用率
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Instancememvalue
内存使用量
内存使用量
MBytes
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
InsWanInTraffic
外网入带宽
外网入带宽
Mbps
instance_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
InsWanOutratio
外网带宽利用率
外网带宽利用率
%
instance_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
InsWanOutTraffic
外网出带宽
外网出带宽
Mbps
instance_id
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ]
Memutil
内存利用率
内存利用率
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
Memvalue
内存用量
内存用量
MBytes
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
NvlinkBandwidth
nvlink传输速率
nvlink传输速率
Bytes/s
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
PcieBandwidth
PCIe总线传输速率
PCIe总线传输速率
Bytes/s
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
RdmaInpkt
RDMA网卡入包量
RDMA 网卡入包量
pps
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
RdmaIntraffic
RDMA网卡接收带宽
RDMA网卡接收带宽
Mbps
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
RdmaOutpkt
RDMA网卡出包量
RDMA网卡出包量
pps
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
RdmaOuttraffic
RDMA网卡发送带宽
RDMA网卡发送带宽
Mbps
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
SmActivity
SM活跃状态时间比
SM活跃状态时间比
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskCfsClientDataReadBandwidth
turocfs单节点服务端读带宽
turocfs单节点服务端读带宽
KBytes/s
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskCfsClientDataWriteBandwidth
turocfs单节点服务端写带宽
turocfs单节点服务端写带宽
KBytes/s
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskCfsDataReadIoBytes
cfs服务端读带宽
cfs服务端读带宽
KBytes/s
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskCfsDataReadIoLatency
cfs读延迟
cfs读延迟
ms
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskCfsDataWriteIoBytes
cfs服务端写带宽
cfs服务端写带宽
KBytes/s
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskCfsDataWriteIoLatency
cfs写延迟
cfs写延迟
ms
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskCfsStrageUsageGb
cfs存储数据容量
cfs存储数据容量
GBytes
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskDiskIoUtil
磁盘ioutil
磁盘ioutil
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskDiskIoWait
磁盘iowait
磁盘iowait
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskDiskReadByte
磁盘读取带宽
磁盘读取带宽
MBytes/s
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskDiskReadIops
磁盘读取iops
磁盘读取iops
Count
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskDiskUsageRadio
系统盘分区利用率
系统盘分区利用率
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskDiskWriteByte
磁盘写入带宽
磁盘写入带宽
MBytes/s
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskDiskWriteIops
磁盘写入iops
磁盘写入iops
Count
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskFp16EngineActivity
FP16活跃时间比
FP16活跃时间比
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskFp32EngineActivity
FP32活跃时间比
FP32活跃时间比
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskFp64EngineActivity
FP64活跃时间比
FP64活跃时间比
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskNvlinkBandwidth
nvlink传输速率
nvlink传输速率
Bytes/s
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskPcieBandwidth
PCIe总线传输速率
PCIe总线传输速率
Bytes/s
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskRdmaInpkt
RDMA网卡入包量
RDMA 网卡入包量
pps
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskRdmaIntraffic
RDMA网卡接收带宽
RDMA网卡接收带宽
Mbps
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskRdmaOutpkt
RDMA网卡出包量
RDMA网卡出包量
pps
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskRdmaOuttraffic
RDMA网卡发送带宽
RDMA网卡发送带宽
Mbps
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskSmActivity
SM活跃状态时间比
SM活跃状态时间比
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TaskTensorActivity
Tensor活跃状态时间比
Tensor活跃状态时间比
%
TaskId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]
TensorActivity
Tensor活跃状态时间比
Tensor活跃状态时间比
%
InstanceId
[ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ]

各维度对应参数总览

参数名称
维度名称
维度解释
格式
Instances.N.Dimensions.0.Name
InstanceId
训练任务实例ID
输入 String 类型维度名称:InstanceId
Instances.N.Dimensions.0.Value
InstanceId
训练任务实例ID
输入具体实例 ID,例如:train-9187850047592xxxxx-6zaq3zh9mvpc-master-0
Instances.N.Dimensions.0.Name
TaskId
训练任务ID
输入 String 类型维度名称:TaskId
Instances.N.Dimensions.0.Value
TaskId
训练任务/notebook ID
输入具体实例 ID,例如:train-9187850047592xxxxx
Instances.N.Dimensions.0.Name
taskInsGpuNum
训练任务实例使用的 GPU 卡号(仅限 GPU 任务)
输入 String 类型维度名称:taskInsGpuNum
Instances.N.Dimensions.0.Value
taskInsGpuNum
训练任务实例使用的 GPU 卡号(仅限 GPU 任务)
输入训练任务实例 ID 拼接 GPU 卡号/avg,例如:输入具体实例 ID,例如:train-9187850047592xxxxx-6zaq3zh9mvpc-master-0-0、train-9187850047592xxxxx-6zaq3zh9mvpc-master-0-avg
Instances.N.Dimensions.0.Name
instance_id
资源组中的某个资源 ID
输入 String 类型维度名称:instance_id
Instances.N.Dimensions.0.Value
instance_id
资源组中的某个资源 ID
输入具体 instance_id,例如:tins-dn8nkg82
Instances.N.Dimensions.0.Name
group_id
资源组ID
输入 String 类型维度名称:group_id
Instances.N.Dimensions.0.Value
group_id
资源组ID
输入具体 group_id,例如:trsg-8rpgr4k6