命名空间
Namespace = QCE/TI_TRAINTASK
监控指标
指标英文名 | 指标中文名 | 说明 | 单位 | 维度 | 统计规则
[period, statType] |
CfsClientDataReadBandwidth | turocfs 单节点服务端读带宽 | turocfs 单节点服务端读带宽 | KBytes/s | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsClientDataWriteBandwidth | turocfs 单节点服务端写带宽 | turocfs 单节点服务端写带宽 | KBytes/s | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsDataReadIoBytes | cfs 服务端读带宽 | cfs 服务端读带宽 | KBytes/s | InstanceIdAppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsDataReadIoLatency | cfs 读延迟 | cfs 读延迟 | ms | InstanceIdAppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsDataWriteIoBytes | cfs 服务端写带宽 | cfs 服务端写带宽 | KBytes/s | InstanceIdAppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsDataWriteIoLatency | cfs 写延迟 | cfs 写延迟 | ms | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
CfsStrageUsageGb | cfs 存储数据容量 | cfs 存储数据容量 | GBytes | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Cpuutil | CPU 利用率 | CPU 利用率 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DcgmFiDevFbUsed | 显存使用量 | 显存使用量 | MBytes | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DcgmFiDevGpuUtil | GPU 使用率 | GPU 使用率 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DcgmFiDevMemCopyUtil | 显存使用率 | 显存使用率 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskIoUtil | 磁盘 ioutil | 磁盘 ioutil | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskIoWait | 磁盘 iowait | 磁盘 iowait | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskReadByte | 磁盘读取带宽 | 磁盘读取带宽 | MBytes/s | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskReadIops | 磁盘读取 iops | 磁盘读取 iops | Count | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskUsageRadio | 系统盘分区利用率 | 系统盘分区利用率 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskWriteByte | 磁盘写入带宽 | 磁盘写入带宽 | MBytes/s | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
DiskWriteIops | 磁盘写入 iops | 磁盘写入 iops | Count | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Fp16EngineActivity | FP16活跃时间比 | FP16活跃时间比 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Fp32EngineActivity | FP32活跃时间比 | FP32活跃时间比 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Fp64EngineActivity | FP64活跃时间比 | FP64活跃时间比 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuFp16EngineActivity | FP16活跃时间比 | FP16活跃时间比 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuFp32EngineActivity | FP32活跃时间比 | FP32活跃时间比 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuFp64EngineActivity | FP64活跃时间比 | FP64活跃时间比 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Gpumemutil | GPU 显存利用率 | GPU 显存利用率 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Gpumemvalue | 显存使用量 | 显存使用量 | MBytes | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuNvlinkBandwidth | nvlink 传输速率 | nvlink 传输速率 | Bytes/s | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuPcieBandwidth | PCIe 总线传输速率 | PCIe 总线传输速率 | Bytes/s | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuSmActivity | SM 活跃状态时间比 | SM 活跃状态时间比 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuTensorActivity | Tensor 活跃状态时间比 | Tensor 活跃状态时间比 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Gpuutil | GPU 利用率 | GPU 利用率 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancecpuutil | CPU 利用率 | CPU 利用率 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancegpumemutil | GPU 显存利用率 | GPU 显存利用率 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancegpumemvalue | 显存使用量 | 显存使用量 | MBytes | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancegpuutil | GPU 利用率 | GPU 利用率 | % | AppId,InstanceId,SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancememutil | 内存利用率 | 内存利用率 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Instancememvalue | 内存使用量 | 内存使用量 | MBytes | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Memutil | 内存利用率 | 内存利用率 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
Memvalue | 内存用量 | 内存用量 | MBytes | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
NvlinkBandwidth | nvlink 传输速率 | nvlink 传输速率 | Bytes/s | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
PcieBandwidth | PCIe 总线传输速率 | PCIe 总线传输速率 | Bytes/s | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
RdmaInpkt | RDMA 网卡入包量 | RDMA 网卡入包量 | pps | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
RdmaIntraffic | RDMA 网卡接收带宽 | RDMA 网卡接收带宽 | Mbps | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
RdmaOutpkt | RDMA 网卡出包量 | RDMA 网卡出包量 | pps | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
RdmaOuttraffic | RDMA 网卡发送带宽 | RDMA 网卡发送带宽 | Mbps | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
SmActivity | SM 活跃状态时间比 | SM 活跃状态时间比 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsClientDataReadBandwidth | turocfs 单节点服务端读带宽 | turocfs 单节点服务端读带宽 | KBytes/s | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsClientDataWriteBandwidth | turocfs 单节点服务端写带宽 | turocfs 单节点服务端写带宽 | KBytes/s | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsDataReadIoBytes | cfs 服务端读带宽 | cfs 服务端读带宽 | KBytes/s | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsDataReadIoLatency | cfs 读延迟 | cfs 读延迟 | ms | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsDataWriteIoBytes | cfs 服务端写带宽 | cfs 服务端写带宽 | KBytes/s | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsDataWriteIoLatency | cfs 写延迟 | cfs 写延迟 | ms | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskCfsStrageUsageGb | cfs 存储数据容量 | cfs 存储数据容量 | GBytes | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskIoUtil | 磁盘 ioutil | 磁盘 ioutil | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskIoWait | 磁盘 iowait | 磁盘 iowait | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskReadByte | 磁盘读取带宽 | 磁盘读取带宽 | MBytes/s | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskReadIops | 磁盘读取 iops | 磁盘读取 iops | Count | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskUsageRadio | 系统盘分区利用率 | 系统盘分区利用率 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskWriteByte | 磁盘写入带宽 | 磁盘写入带宽 | MBytes/s | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskDiskWriteIops | 磁盘写入iops | 磁盘写入iops | Count | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskFp16EngineActivity | FP16活跃时间比 | FP16活跃时间比 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskFp32EngineActivity | FP32活跃时间比 | FP32活跃时间比 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskFp64EngineActivity | FP64活跃时间比 | FP64活跃时间比 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskNvlinkBandwidth | nvlink 传输速率 | nvlink 传输速率 | Bytes/s | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskPcieBandwidth | PCIe 总线传输速率 | PCIe 总线传输速率 | Bytes/s | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskRdmaInpkt | RDMA 网卡入包量 | RDMA 网卡入包量 | pps | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskRdmaIntraffic | RDMA 网卡接收带宽 | RDMA 网卡接收带宽 | Mbps | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskRdmaOutpkt | RDMA 网卡出包量 | RDMA 网卡出包量 | pps | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskRdmaOuttraffic | RDMA 网卡发送带宽 | RDMA 网卡发送带宽 | Mbps | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskSmActivity | SM 活跃状态时间比 | SM 活跃状态时间比 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TaskTensorActivity | Tensor 活跃状态时间比 | Tensor 活跃状态时间比 | % | AppId SubUin TaskId | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
TensorActivity | Tensor 活跃状态时间比 | Tensor 活跃状态时间比 | % | AppId InstanceId SubUin | [ 10s, avg ]
[ 60s, avg ]
[ 300s, avg ]
[ 3600s, avg ]
[ 86400s, avg ] |
GpuDecUtil | GPU 解码使用率 | GPU 解码使用率 | % | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuEncUtil | GPU 编码器使用率 | GPU 编码器使用率 | % | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuMemoryClock | GPU 显存频率 | GPU 显存频率 | s | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuMemoryFree | GPU 显存空闲量 | GPU 显存空闲量 | MBytes | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuMemoryUtil | 显存使用率 | 显存使用率 | % | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuNvlinkRxMb | nvlink 接收数据量 | nvlink 接收数据量 | Mbps | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuNvlinkTxMb | nvlink 发送数据量 | nvlink 发送数据量 | Mbps | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuPcieRxMb | pcie 接收数据量 | pcie 接收数据量 | Mbps | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuPcieTxMb | pcie 发送数据量 | pcie 发送数据量 | Mbps | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
GpuSmClock | SM 时钟频率 | SM 时钟频率 | s | AppId InstanceId SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuDecUtil | GPU 解码使用率 | GPU 解码使用率 | % | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuEncUtil | GPU 编码器使用率 | GPU 编码器使用率 | % | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuMemoryClock | GPU 显存频率 | GPU 显存频率 | s | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuMemoryFree | GPU 显存空闲量 | GPU 显存空闲量 | MBytes | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuMemoryUtil | GPU 显存带宽使用率 | GPU 显存带宽使用率 | % | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuNvlinkRxMb | nvlink 接收数据量 | nvlink 接收数据量 | Mbps | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuNvlinkTxMb | nvlink 发送数据量 | nvlink 发送数据量 | Mbps | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuPcieRxMb | pcie 接收数据量 | pcie 接收数据量 | Mbps | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuPcieTxMb | pcie 发送数据量 | pcie 发送数据量 | Mbps | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuSmClock | SM 时钟频率 | SM 时钟频率 | s | AppId SubUin TaskId | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuDecUtilGpu | GPU 解码使用率 | GPU 解码使用率 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuEncUtilGpu | GPU 编码器使用率 | GPU 编码器使用率 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuMemoryClockGpu | GPU 显存频率 | GPU 显存频率 | s | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuMemoryFreeGpu | GPU 显存空闲量 | GPU 显存空闲量 | MBytes | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuMemoryUtilGpu | GPU 显存带宽使用率 | GPU 显存带宽使用率 | % | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuNvlinkRxMbGpu | nvlink 接收数据量 | nvlink 接收数据量 | Mbps | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuNvlinkTxMbGpu | nvlink 发送数据量 | nvlink 发送数据量 | Mbps | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuPcieRxMbGpu | pcie 接收数据量 | pcie 接收数据量 | Mbps | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuPcieTxMbGpu | pcie 发送数据量 | pcie 发送数据量 | Mbps | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
TaskGpuSmClockGpu | SM 时钟频率 | SM 时钟频率 | s | AppId InstanceGpuNum SubUin | [ 10s, avg ] [ 60s, avg ] [ 300s, avg ] [ 3600s, avg ] [ 86400s, avg ] |
各维度对应参数总览
参数名称 | 维度名称 | 维度解释 | 格式 |
Instances.N.Dimensions.0.Name | AppId | 账号基本信息 APPID 的维度名称 | 输入 String 类型维度名称:AppId(SDK 调用时会自动获取,无需传参) |
Instances.N.Dimensions.0.Value | AppId | 账号基本信息 APPID | 输入 ID,例如:1231231231(SDK 调用时会自动获取,无需传参) |
Instances.N.Dimensions.1.Name | SubUin | 子账号 ID 的维度名称 | 输入 String 类型维度名称:SubUin |
Instances.N.Dimensions.1.Value | SubUin | 子账号 ID | 输入 ID,例如:100001231231 |
Instances.N.Dimensions.2.Name | InstanceId | 训练任务实例 ID 的维度名称 | 输入 String 类型维度名称:InstanceId |
Instances.N.Dimensions.2.Value | InstanceId | 训练任务实例 ID | 输入具体实例 ID,例如:train-9187850047592xxxxx-9ludoo1s1n9c-master-0 |
Instances.N.Dimensions.3.Name | InstanceGpuNum | 训练任务实例使用的 GPU 卡号(仅限 GPU 整卡任务)的维度名称 | 输入 String 类型维度名称:InstanceGpuNum |
Instances.N.Dimensions.3.Value | InstanceGpuNum | 训练任务实例使用的 GPU 卡号(仅限 GPU 整卡任务) | 训练任务实例 ID 拼接 GPU 卡号/avg,输入具体实例 ID,例如:train-9187850047592xxxxx-9ludoo1s1n9c-master-0-0,train-9187850047592xxxxx-9ludoo1s1n9c-master-0-avg |
Instances.N.Dimensions.4.Name | TaskId | 训练任务实例的维度名称 | 输入 String 类型维度名称:TaskId |
Instances.N.Dimensions.4.Value | TaskId | 训练任务实例 | 输入 ID,例如:train-9187850047592xxxxx |
入参说明
查询任务式建模指标监控数据,取值如下:
&Namespace=QCE/TI_TRAINTASK
&Instances.N.Dimensions.0.Name=AppId
&Instances.N.Dimensions.0.Value=具体的账号 ID
&Instances.N.Dimensions.1.Name=SubUin
&Instances.N.Dimensions.1.Value=具体的子账号 ID
&Instances.N.Dimensions.2.Name=InstanceId
&Instances.N.Dimensions.2.Value=训练任务实例 ID
&Instances.N.Dimensions.3.Name=InstanceGpuNum
&Instances.N.Dimensions.3.Value=训练任务实例使用的 GPU 卡号
&Instances.N.Dimensions.4.Name=TaskId
&Instances.N.Dimensions.4.Value=训练任务实例