监控指标概述
Goosefs 监控指标记录了客户端、集群、Master 节点、 Worker 节点、进程的运行状态。
1. 按照统计维度,监控指标可以分成以下四大类:
Gauge:记录一个事件的瞬时数值,该数值可增可减,一般用来反映系统运行状态。
Meter:统计事件频率,即固定时间周期内事件的发生次数,如每分钟、每5分钟,每15分钟内的请求数。
Counter:统计事件发生的总次数,与 Gauge 不同的点在于该数值只增不减,一般用来记录访问请求次数等数据。
Timer:统计事件的频率和分布情况,该指标可以认为 Histogram 指标和 Meter 指标的结合,Histogram 指标用于统计耗时分布,Meter 指标用于统计 QPS;如果需要同时统计频率和耗时的时候可以使用该指标。
注意
2. 按照数据采集来源,监控指标可以分为以下两大类:
集群状态指标:集群状态指标包括了 Cluster、Master 和 Worker 维度的监控数据,这些监控指标可以反应整个集群及集群下的每个节点的运行状态。这类指标的记录格式如下:
Master.[metricName].[tag1].[tag2]...[Worker_ip].[metricName].[tag1].[tag2]...
进程处理指标:进程处理指标用于详细记录集群中的请求次数、延迟、CPU 消耗、RPC 调用次数等信息,这类指标可以推送到第三方监控工具中展示。这类指标的记录格式如下:
[processType].[metricName].[tag1].[tag2]...[hostName]
Cluster 监控指标
GooseFS 集群化部署时,客户端和 Worker 节点会周期性地通过心跳将监控指标发送到 Master 节点,可以通过 goosefs.master.worker.heartbeat.interval 和 goosefs.user.metrics.heartbeat.interval 分别设置 Worker 节点和客户端监控指标的心跳周期。
Cluster 级别的监控指标列表如下:
指标名称 | 指标类型 | 指标描述 |
Cluster.BytesReadDomain | COUNTER | Total number of bytes read from GooseFS storage via domain socket reported by all workers |
Cluster.BytesReadDomainThroughput | GAUGE | Bytes read throughput from GooseFS storage via domain socket by all workers |
Cluster.BytesReadLocal | COUNTER | Total number of bytes short-circuit read from local storage by all clients |
Cluster.BytesReadLocalThroughput | GAUGE | Bytes throughput short-circuit read from local storage by all clients |
Cluster.BytesReadPerUfs | COUNTER | Total number of bytes read from a specific UFS by all workers |
Cluster.BytesReadRemote | COUNTER | Total number of bytes read from GooseFS storage or underlying UFS if data does not exist in GooseFS storage reported by all workers. This does not include short-circuit local reads and domain socket reads |
Cluster.BytesReadRemoteThroughput | GAUGE | Bytes read throughput from GooseFS storage or underlying UFS if data does not exist in GooseFS storage reported by all workers. This does not include short-circuit local reads and domain socket reads |
Cluster.BytesReadUfsAll | COUNTER | Total number of bytes read from a all GooseFS UFSes by all workers |
Cluster.BytesReadUfsThroughput | GAUGE | Bytes read throughput from all GooseFS UFSes by all workers |
Cluster.BytesWrittenDomain | COUNTER | Total number of bytes written to GooseFS storage via domain socket by all workers |
Cluster.BytesWrittenDomainThroughput | GAUGE | Throughput of bytes written to GooseFS storage via domain socket by all workers |
Cluster.BytesWrittenLocal | COUNTER | Total number of bytes short-circuit written to local storage by all clients |
Cluster.BytesWrittenLocalThroughput | GAUGE | Bytes throughput written to local storage by all clients |
Cluster.BytesWrittenPerUfs | COUNTER | Total number of bytes written to a specific GooseFS UFS by all workers |
Cluster.BytesWrittenRemote | COUNTER | Total number of bytes written to GooseFS storage in all workers or the underlying UFS. This does not include short-circuit local writes and domain socket writes. |
Cluster.BytesWrittenRemoteThroughput | GAUGE | Bytes write throughput to GooseFS storage in all workers or the underlying UFS. This does not include short-circuit local writes and domain socket writes. |
Cluster.BytesWrittenUfsAll | COUNTER | Total number of bytes written to all GooseFS UFSes by all workers |
Cluster.BytesWrittenUfsThroughput | GAUGE | Bytes write throughput to all GooseFS UFSes by all workers |
Cluster.CapacityFree | GAUGE | Total free bytes on all tiers, on all workers of GooseFS |
Cluster.CapacityTotal | GAUGE | Total capacity (in bytes) on all tiers, on all workers of GooseFS |
Cluster.CapacityUsed | GAUGE | Total used bytes on all tiers, on all workers of GooseFS |
Cluster.RootUfsCapacityFree | GAUGE | Free capacity of the GooseFS root UFS in bytes |
Cluster.RootUfsCapacityTotal | GAUGE | Total capacity of the GooseFS root UFS in bytes |
Cluster.RootUfsCapacityUsed | GAUGE | Used capacity of the GooseFS root UFS in bytes |
Cluster.Workers | GAUGE | Total number of active workers inside the cluster |
Master 监控指标
Master 监控指标有两大类,其一是默认指标,Master 运行过程中会默认记录这些指标;其二是动态监控指标。默认的 Master 监控指标如下:
指标名称 | 指标类型 | 指标描述 |
Master.CompleteFileOps | COUNTER | Total number of the CompleteFile operations |
Master.CreateDirectoryOps | COUNTER | Total number of the CreateDirectory operations |
Master.CreateFileOps | COUNTER | Total number of the CreateFile operations |
Master.DeletePathOps | COUNTER | Total number of the Delete operations |
Master.DirectoriesCreated | COUNTER | Total number of the succeed CreateDirectory operations |
Master.EdgeCacheEvictions | GAUGE | Total number of edges (inode metadata) that was evicted from cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId. |
Master.EdgeCacheHits | GAUGE | Total number of hits in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId. |
Master.EdgeCacheLoadTimes | GAUGE | Total load times in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId. |
Master.EdgeCacheMisses | GAUGE | Total number of misses in the edge (inode metadata) cache. The edge cache is responsible for managing the mapping from (parentId, childName) to childId. |
Master.EdgeCacheSize | GAUGE | Total number of edges (inode metadata) cached. The edge cache is responsible for managing the mapping from (parentId, childName) to childId. |
Master.EdgeLockPoolSize | GAUGE | The size of master edge lock pool |
Master.FileBlockInfosGot | COUNTER | Total number of succeed GetFileBlockInfo operations |
Master.FileInfosGot | COUNTER | Total number of the succeed GetFileInfo operations |
Master.FilesCompleted | COUNTER | Total number of the succeed CompleteFile operations |
Master.FilesCreated | COUNTER | Total number of the succeed CreateFile operations |
Master.FilesFreed | COUNTER | Total number of succeed FreeFile operations |
Master.FilesPersisted | COUNTER | Total number of successfully persisted files |
Master.FilesPinned | GAUGE | Total number of currently pinned files |
Master.FreeFileOps | COUNTER | Total number of FreeFile operations |
Master.GetFileBlockInfoOps | COUNTER | Total number of GetFileBlockInfo operations |
Master.GetFileInfoOps | COUNTER | Total number of the GetFileInfo operations |
Master.GetNewBlockOps | COUNTER | Total number of the GetNewBlock operations |
Master.InodeCacheEvictions | GAUGE | Total number of inodes that was evicted from the cache. |
Master.InodeCacheHits | GAUGE | Total number of hits in the inodes (inode metadata) cache. |
Master.InodeCacheLoadTimes | GAUGE | Total load times in the inodes (inode metadata) cache. |
Master.InodeCacheMisses | GAUGE | Total number of misses in the inodes (inode metadata) cache. |
Master.InodeCacheSize | GAUGE | Total number of inodes (inode metadata) cached. |
Master.InodeLockPoolSize | GAUGE | The size of master inode lock pool |
Master.JournalFlushFailure | COUNTER | Total number of failed journal flush |
Master.JournalFlushTimer | TIMER | The timer statistics of journal flush |
Master.JournalGainPrimacyTimer | TIMER | The timer statistics of journal gain primacy |
Master.LastBackupEntriesCount | GAUGE | The total number of entries written in the last leading master metadata backup |
Master.LastBackupRestoreCount | GAUGE | The total number of entries restored from backup when a leading master initializes its metadata |
Master.LastBackupRestoreTimeMs | GAUGE | The process time of the last restore from backup |
Master.LastBackupTimeMs | GAUGE | The process time of the last backup |
Master.ListingCacheSize | GAUGE | The size of master listing cache |
Master.MountOps | COUNTER | Total number of Mount operations |
Master.NewBlocksGot | COUNTER | Total number of the succeed GetNewBlock operations |
Master.PathsDeleted | COUNTER | Total number of the succeed Delete operations |
Master.PathsMounted | COUNTER | Total number of succeed Mount operations |
Master.PathsRenamed | COUNTER | Total number of succeed Rename operations |
Master.PathsUnmounted | COUNTER | Total number of succeed Unmount operations |
Master.RenamePathOps | COUNTER | Total number of Rename operations |
Master.SetAclOps | COUNTER | Total number of SetAcl operations |
Master.SetAttributeOps | COUNTER | Total number of SetAttribute operations |
Master.TotalPaths | GAUGE | Total number of files and directory in Goosefs namespace |
Master.UfsJournalCatchupTimer | TIMER | The timer statistics of journal catchup |
Master.UfsJournalFailureRecoverTimer | TIMER | The timer statistics of ufs journal failure recover |
Master.UfsJournalInitialReplayTimeMs | GAUGE | The process time of the ufs journal initial replay |
Master.UnmountOps | COUNTER | Total number of Unmount operations |
动态指标如下:
指标名称 | 指标类型 | 指标描述 |
Master.CapacityTotalTier | GAUGE | Total capacity in tier of the Goosefs file system in bytes |
Master.CapacityUsedTier | GAUGE | Used capacity in tier of the Goosefs file system in bytes |
Master.CapacityFreeTier | GAUGE | Free capacity in tier of the Goosefs file system in bytes |
Master.UfsSessionCount-Ufs: | COUNTER | The total number of currently opened UFS sessions to connect to the given |
Master..UFS:.UFS_TYPE:.User: | GAUGE | The details UFS rpc operation done by the current master |
Master.PerUfsOp.UFS: | TIMER | The aggregated number of UFS operation ran on UFS by leading master |
Master. | TIMER | The duration statistics of RPC calls exposed on leading master |
Worker 监控指标
Worker 监控指标有两大类,其一是默认指标,Worker 运行过程中会默认记录这些指标;其二是动态监控指标。默认的 Worker 监控指标如下:
指标名称 | 指标类型 | 指标描述 |
Worker.AsyncCacheDuplicateRequests | COUNTER | Total number of duplicated async cache request received by this worker |
Worker.AsyncCacheFailedBlocks | COUNTER | Total number of async cache failed blocks in this worker |
Worker.AsyncCacheRemoteBlocks | COUNTER | Total number of blocks that need to be async cached from remote source |
Worker.AsyncCacheRequests | COUNTER | Total number of async cache request received by this worker |
Worker.AsyncCacheSucceededBlocks | COUNTER | Total number of async cache succeeded blocks in this worker |
Worker.AsyncCacheUfsBlocks | COUNTER | Total number of blocks that need to be async cached from local source |
Worker.BlockRemoverBlocksToRemovedCount | COUNTER | The total number of blocks removed from this worker by asynchronous block remover. |
Worker.BlockRemoverRemovingBlocksSize | GAUGE | The size of blocks is removing from this worker by asynchronous block remover. |
Worker.BlockRemoverTryRemoveBlocksSize | GAUGE | The size of blocks to be removed from this worker by asynchronous block remover. |
Worker.BlockRemoverTryRemoveCount | COUNTER | The total number of blocks tried to be removed from this worker by asynchronous block remover. |
Worker.BlocksAccessed | COUNTER | Total number of times any one of the blocks in this worker is accessed. |
Worker.BlocksCached | GAUGE | Total number of blocks used for caching data in an goosefs worker |
Worker.BlocksCancelled | COUNTER | Total number of aborted temporary blocks in this worker. |
Worker.BlocksDeleted | COUNTER | Total number of deleted blocks in this worker by external requests. |
Worker.BlocksEvicted | COUNTER | Total number of evicted blocks in this worker. |
Worker.BlocksLost | COUNTER | Total number of lost blocks in this worker. |
Worker.BlocksPromoted | COUNTER | Total number of times any one of the blocks in this worker moved to a new tier. |
Worker.BytesReadDomain | COUNTER | Total number of bytes read from goosefs storage via domain socket by this worker |
Worker.BytesReadDomainThroughput | METER | Bytes read throughput from goosefs storage via domain socket by this worker |
Worker.BytesReadPerUfs | COUNTER | Total number of bytes read from a specific goosefs UFS by this worker |
Worker.BytesReadRemote | COUNTER | Total number of bytes read from goosefs storage managed by this worker and underlying UFS if data cannot be found in the goosefs storage. This does not include short-circuit local reads and domain socket reads. |
Worker.BytesReadRemoteThroughput | METER | Total number of bytes read from goosefs storage managed by this worker and underlying UFS if data cannot be found in the goosefs storage. This does not include short-circuit local reads and domain socket reads. |
Worker.BytesReadUfsThroughput | METER | Bytes read throughput from all goosefs UFSes by this worker |
Worker.BytesWrittenDomain | COUNTER | Total number of bytes written to goosefs storage via domain socket by this worker |
Worker.BytesWrittenDomainThroughput | METER | Throughput of bytes written to goosefs storage via domain socket by this worker |
Worker.BytesWrittenPerUfs | COUNTER | Total number of bytes written to a specific goosefs UFS by this worker |
Worker.BytesWrittenRemote | COUNTER | Total number of bytes written to goosefs storage or the underlying UFS by this worker. This does not include short-circuit local writes and domain socket writes. |
Worker.BytesWrittenRemoteThroughput | METER | Bytes write throughput to goosefs storage or the underlying UFS by this workerThis does not include short-circuit local writes and domain socket writes. |
Worker.BytesWrittenUfsThroughput | METER | Bytes write throughput to all goosefs UFSes by this worker |
Worker.CapacityFree | GAUGE | Total free bytes on all tiers of a specific goosefs worker |
Worker.CapacityTotal | GAUGE | Total capacity (in bytes) on all tiers of a specific goosefs worker |
Worker.CapacityUsed | GAUGE | Total used bytes on all tiers of a specific goosefs worker |
动态监控指标如下:
指标名称 | 指标类型 | 指标描述 |
Worker.UfsSessionCount-Ufs | GAUGE | The total number of currently opened UFS sessions to connect to the given |
Worker. | TIMER | The duration statistics of RPC calls exposed on workers |
客户端监控指标
客户端监控指标会通过指定的用户 ID 或者客户端的 IP 维度汇聚,如果已经在 GooseFS 中注册好用户 ID,则会优先按照用户 ID 维度汇聚。用户 ID 可以通过 goosefs.user.app.id 指定。客户端监控指标如下:
指标名称 | 指标类型 | 指标描述 |
Client.BytesReadLocal | COUNTER | Total number of bytes short-circuit read from local storage by this client |
Client.BytesReadLocalThroughput | METER | Bytes throughput short-circuit read from local storage by this client |
Client.BytesWrittenLocal | COUNTER | Total number of bytes short-circuit written to local storage by this client |
Client.BytesWrittenLocalThroughput | METER | Bytes throughput short-circuit written to local storage by this client |
Client.BytesWrittenUfs | COUNTER | Total number of bytes write to goosefs UFS by this client |
Client.CacheBytesEvicted | METER | Total number of bytes evicted from the client cache. |
Client.CacheBytesReadCache | METER | Total number of bytes read from the client cache. |
Client.CacheBytesReadExternal | METER | Total number of bytes read from external storage due to a cache miss on the client cache. |
Client.CacheBytesRequestedExternal | METER | Total number of bytes the user requested to read which resulted in a cache miss. This number may be smaller than Client.CacheBytesReadExternal due to chunk reads. |
Client.CacheBytesWrittenCache | METER | Total number of bytes written to the client cache. |
Client.CacheCleanupGetErrors | COUNTER | Number of failures when cleaning up a failed cache read. |
Client.CacheCleanupPutErrors | COUNTER | Number of failures when cleaning up a failed cache write. |
Client.CacheCreateErrors | COUNTER | Number of failures when creating a cache in the client cache. |
Client.CacheDeleteErrors | COUNTER | Number of failures when deleting cached data in the client cache. |
Client.CacheDeleteNonExistingPageErrors | COUNTER | Number of failures when deleting pages due to absence. |
Client.CacheDeleteNotReadyErrors | COUNTER | Number of failures when when cache is not ready to delete pages. |
Client.CacheDeleteStoreDeleteErrors | COUNTER | Number of failures when deleting pages due to failed delete in page stores. |
Client.CacheGetErrors | COUNTER | Number of failures when getting cached data in the client cache. |
Client.CacheGetNotReadyErrors | COUNTER | Number of failures when cache is not ready to get pages. |
Client.CacheGetStoreReadErrors | COUNTER | Number of failures when getting cached data in the client cache due to failed read from page stores. |
Client.CacheHitRate | GAUGE | Cache hit rate: (# bytes read from cache) / (# bytes requested). |
Client.CachePages | COUNTER | Total number of pages in the client cache. |
Client.CachePagesEvicted | METER | Total number of pages evicted from the client cache. |
Client.CachePutAsyncRejectionErrors | COUNTER | Number of failures when putting cached data in the client cache due to failed injection to async write queue. |
Client.CachePutBenignRacingErrors | COUNTER | Number of failures when adding pages due to racing eviction. This error is benign. |
Client.CachePutErrors | COUNTER | Number of failures when putting cached data in the client cache. |
Client.CachePutEvictionErrors | COUNTER | Number of failures when putting cached data in the client cache due to failed eviction. |
Client.CachePutInsufficientSpaceErrors | COUNTER | Number of failures when putting cached data in the client cache due to insufficient space made after eviction. |
Client.CachePutNotReadyErrors | COUNTER | Number of failures when cache is not ready to add pages. |
Client.CachePutStoreDeleteErrors | COUNTER | Number of failures when putting cached data in the client cache due to failed deletes in page store. |
Client.CachePutStoreWriteErrors | COUNTER | Number of failures when putting cached data in the client cache due to failed writes to page store. |
Client.CacheSpaceAvailable | GAUGE | Amount of bytes available in the client cache. |
Client.CacheSpaceUsed | GAUGE | Amount of bytes used by the client cache. |
Client.CacheSpaceUsedCount | COUNTER | Amount of bytes used by the client cache as a counter. |
Client.CacheState | COUNTER | State of the cache: 0 (NOT_IN_USE), 1 (READ_ONLY) and 2 (READ_WRITE) |
Client.CacheStoreDeleteTimeout | COUNTER | Number of timeouts when deleting pages from page store. |
Client.CacheStoreGetTimeout | COUNTER | Number of timeouts when reading pages from page store. |
Client.CacheStorePutTimeout | COUNTER | Number of timeouts when writing new pages to page store. |
Client.CacheStoreThreadsRejected | COUNTER | Number of rejection of I/O threads on submitting tasks to thread pool, likely due to unresponsive local file system. |
Client.CacheUnremovableFiles | COUNTER | Amount of bytes unusable managed by the client cache. |
进程监控指标
进程监控指标可以在 Cluster、Master 和 Worker 中收集并汇聚。主要包括 JVM 信息、垃圾回收信息、内存占用信息三大类。
JVM 信息包括以下内容:
指标名称 | 指标描述 |
name | The name of the JVM |
uptime | The uptime of the JVM |
vendor | The current JVM vendor |
垃圾回收监控指标包括以下内容:
指标名称 | 指标描述 |
PS-MarkSweep.count | Total number of mark and sweep |
PS-MarkSweep.time | The time used to mark and sweep |
PS-Scavenge.count | Total number of scavenge |
PS-Scavenge.time | The time used to scavenge |
内存占用监控指标记录了内存使用情况的概述和详情信息,部分监控指标如下:
指标名称 | 指标描述 |
total.committed | The amount of memory in bytes that is guaranteed to be available for use by the JVM |
total.init | The amount of the memory in bytes that is available for use by the JVM |
total.max | The maximum amount of memory in bytes that is available for use by the JVM |
total.used | The amount of memory currently used in bytes |
heap.committed | The amount of memory from heap area guaranteed to be available |
heap.init | The amount of memory from heap area available at initialization |
heap.max | The maximum amount of memory from heap area that is available |
heap.usage | The amount of memory from heap area currently used in GB |
heap.used | The amount of memory from heap area that has been used |
pools.Code-Cache.used | Used memory of collection usage from the pool from which memory is used for compilation and storage of native code |
pools.Compressed-Class-Space.used | Used memory of collection usage from the pool from which memory is use for class metadata |
pools.PS-Eden-Space.used | Used memory of collection usage from the pool from which memory is initially allocated for most objects |
pools.PS-Survivor-Space.used | Used memory of collection usage from the pool containing objects that have survived the garbage collection of the Eden spac |