前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >使用sql_exporter+Prometheus监控sql server

使用sql_exporter+Prometheus监控sql server

作者头像
保持热爱奔赴山海
发布2023-07-20 15:30:48
9638
发布2023-07-20 15:30:48
举报
文章被收录于专栏:饮水机管理员饮水机管理员

最近接管了一套sql server alwayson集群,需要加些监控和告警,简单研究了下github的方案,发现了这种sql_exporter这种更简单,扩展性也超级强。(另一种是通过nodejs的方案监控,费事点,可扩展性也不太好,这里直接pass掉)。

代码语言:javascript
复制
github地址:https://github.com/free/sql_exporter


wget https://github.com/free/sql_exporter/releases/download/0.5/sql_exporter-0.5.linux-amd64.tar.gz

解压文件

cp sql_exporter.yml mssql.yml

编辑 mssql.yml 文件,里面的data_source_name数据库连接方式换成自己的,需要注意如果有特殊字符需要编码下,具体可以查

https://www.w3schools.com/tags/ref_urlencode.ASP

前台启动

代码语言:javascript
复制
./sql_exporter -config.file "mssql.yml" -web.listen-address ":9399"

也可以加到systemd自启动

代码语言:javascript
复制
vim /lib/systemd/system/sql_exporter.service

[Unit]
Description=sql_exporter
[Service]
ExecStart=/opt/sql_exporter/sql_exporter -config.file /opt/sql_exporter/mssql.yml -web.listen-address 0.0.0.0:9399
Restart=on-failure
[Install]
WantedBy=multi-user.target

默认github上提供的监控指标比较少,我们可以自己写sql去完善下,下面是我改进后的

(我这里只监控了一些mssql层面的,它实际上还可以监控业务层面的,只要自己写sql逻辑即可)

cat mssql_standard.collector.yml  修改后的内容如下:

代码语言:javascript
复制
# A collector defining standard metrics for Microsoft SQL Server.
#
# It is required that the SQL Server user has the following permissions:
#
#   GRANT VIEW ANY DEFINITION TO
#   GRANT VIEW SERVER STATE TO
#
collector_name: mssql_standard

# Similar to global.min_interval, but applies to the queries defined by this collector only.
#min_interval: 0s

metrics:
  - metric_name: mssql_local_time_seconds
    type: gauge
    help: 'Local time in seconds since epoch (Unix time).'
    values: [unix_time]
    query: |
      SELECT DATEDIFF(second, '19700101', GETUTCDATE()) AS unix_time

  - metric_name: mssql_connections
    type: gauge
    help: 'Number of active connections.'
    key_labels:
      - db
    values: [count]
    query: |
      SELECT DB_NAME(sp.dbid) AS db, COUNT(sp.spid) AS count
      FROM sys.sysprocesses sp
      GROUP BY DB_NAME(sp.dbid)

  #
  # Collected from sys.dm_os_performance_counters
  #
  - metric_name: mssql_deadlocks
    type: counter
    help: 'Number of lock requests that resulted in a deadlock.'
    values: [cntr_value]
    query: |
      SELECT cntr_value
      FROM sys.dm_os_performance_counters WITH (NOLOCK)
      WHERE counter_name = 'Number of Deadlocks/sec' AND instance_name = '_Total'

  - metric_name: mssql_user_errors
    type: counter
    help: 'Number of user errors.'
    values: [cntr_value]
    query: |
      SELECT cntr_value
      FROM sys.dm_os_performance_counters WITH (NOLOCK)
      WHERE counter_name = 'Errors/sec' AND instance_name = 'User Errors'

  - metric_name: mssql_kill_connection_errors
    type: counter
    help: 'Number of severe errors that caused SQL Server to kill the connection.'
    values: [cntr_value]
    query: |
      SELECT cntr_value
      FROM sys.dm_os_performance_counters WITH (NOLOCK)
      WHERE counter_name = 'Errors/sec' AND instance_name = 'Kill Connection Errors'

  - metric_name: mssql_page_life_expectancy_seconds
    type: gauge
    help: 'The minimum number of seconds a page will stay in the buffer pool on this node without references.'
    values: [cntr_value]
    query: |
      SELECT top(1) cntr_value
      FROM sys.dm_os_performance_counters WITH (NOLOCK)
      WHERE counter_name = 'Page life expectancy'

  - metric_name: mssql_batch_requests
    type: counter
    help: 'Number of command batches received.'
    values: [cntr_value]
    query: |
      SELECT cntr_value
      FROM sys.dm_os_performance_counters WITH (NOLOCK)
      WHERE counter_name = 'Batch Requests/sec'

  - metric_name: mssql_log_growths
    type: counter
    help: 'Number of times the transaction log has been expanded, per database.'
    key_labels:
      - db
    values: [cntr_value]
    query: |
      SELECT rtrim(instance_name) AS db, cntr_value
      FROM sys.dm_os_performance_counters WITH (NOLOCK)
      WHERE counter_name = 'Log Growths' AND instance_name <> '_Total'

  #
  # Collected from sys.dm_io_virtual_file_stats
  #
  - metric_name: mssql_io_stall_seconds
    type: counter
    help: 'Stall time in seconds per database and I/O operation.'
    key_labels:
      - db
    value_label: operation
    values:
      - read
      - write
    query_ref: mssql_io_stall
  - metric_name: mssql_io_stall_total_seconds
    type: counter
    help: 'Total stall time in seconds per database.'
    key_labels:
      - db
    values:
      - io_stall
    query_ref: mssql_io_stall

  #
  # Collected from sys.dm_os_process_memory
  #
  - metric_name: mssql_resident_memory_bytes
    type: gauge
    help: 'SQL Server resident memory size (AKA working set).'
    values: [resident_memory_bytes]
    query_ref: mssql_process_memory

  - metric_name: mssql_virtual_memory_bytes
    type: gauge
    help: 'SQL Server committed virtual memory size.'
    values: [virtual_memory_bytes]
    query_ref: mssql_process_memory

  - metric_name: mssql_memory_utilization_percentage
    type: gauge
    help: 'The percentage of committed memory that is in the working set.'
    values: [memory_utilization_percentage]
    query_ref: mssql_process_memory

  - metric_name: mssql_page_fault_count
    type: counter
    help: 'The number of page faults that were incurred by the SQL Server process.'
    values: [page_fault_count]
    query_ref: mssql_process_memory

  #
  # Collected from sys.dm_os_sys_memory
  #
  - metric_name: mssql_os_memory
    type: gauge
    help: 'OS physical memory, used and available.'
    value_label: 'state'
    values: [used, available]
    query: |
      SELECT
        (total_physical_memory_kb - available_physical_memory_kb) * 1024 AS used,
        available_physical_memory_kb * 1024 AS available
      FROM sys.dm_os_sys_memory

  - metric_name: mssql_os_page_file
    type: gauge
    help: 'OS page file, used and available.'
    value_label: 'state'
    values: [used, available]
    query: |
      SELECT
        (total_page_file_kb - available_page_file_kb) * 1024 AS used,
        available_page_file_kb * 1024 AS available
      FROM sys.dm_os_sys_memory

  - metric_name: mssql_db_size
    type: gauge
    help: 'mssql dbname and dbsize.'
    values: [size_kb]
    key_labels: [db_name]
    query: |
      SELECT instance_name as db_name,sum(cntr_value) as size_kb FROM sys.dm_os_performance_counters
        WHERE instance_name !='_Total' and object_name LIKE '%Databases%'
        and (
        counter_name = 'Data File(s) Size (KB)'
        or 
        counter_name = 'Log File(s) Size (KB)'  
        )
        group by instance_name


  - metric_name: mssql_conn_by_user
    type: gauge
    help: 'mssql conn count by user'
    values: [session_count]
    key_labels: [login_name]
    query: |
      select login_name , count(session_id) as session_count from sys.dm_exec_sessions group by login_name

  - metric_name: mssql_conn_by_db
    type: gauge
    help: 'mssql conn count by db'
    values: [session_count]
    key_labels: [db_name]
    query: |
      SELECT DB_NAME(dbid) AS db_name,COUNT(dbid) AS session_count FROM sys.sysprocesses GROUP BY DB_NAME(dbid);

  - metric_name: mssql_tempdb_usage
    type: gauge
    help: 'mssql tempdb usage'
    value_label: 'type'
    values: [user_objects_kb,internal_objects_kb,version_store_kb,freespace_kb]
    query: |
      SELECT SUM ( user_object_reserved_page_count ) * 8 AS user_objects_kb,
            SUM ( internal_object_reserved_page_count ) * 8 AS internal_objects_kb,
            SUM ( version_store_reserved_page_count ) * 8 AS version_store_kb,
            SUM ( unallocated_extent_page_count ) * 8 AS freespace_kb 
        FROM
            tempdb.sys.dm_db_file_space_usage;

  - metric_name: mssql_memory_usage
    type: gauge
    help: 'mssql memory usage'
    value_label: 'type'
    values: [total_physical_memory_kb,available_page_file_kb,total_page_file_kb,system_cache_kb]
    query: |
      SELECT total_physical_memory_kb,available_page_file_kb,total_page_file_kb,system_cache_kb FROM sys.dm_os_sys_memory;

  - metric_name: mssql_alwayson_node_role
    type: gauge
    help: 'mssql alwayson node role'
    value_label: 'state'
    values: [role,operational_state,recovery_health,synchronization_health]
    query: |
      SELECT role,operational_state,recovery_health,synchronization_health
        FROM sys.dm_hadr_availability_replica_states r
        JOIN sys.availability_groups g ON r.group_id = g.group_id
        WHERE is_local=1

queries:
  # Populates `mssql_io_stall` and `mssql_io_stall_total`
  - query_name: mssql_io_stall
    query: |
      SELECT
        cast(DB_Name(a.database_id) as varchar) AS [db],
        sum(io_stall_read_ms) / 1000.0 AS [read],
        sum(io_stall_write_ms) / 1000.0 AS [write],
        sum(io_stall) / 1000.0 AS io_stall
      FROM
        sys.dm_io_virtual_file_stats(null, null) a
      INNER JOIN sys.master_files b ON a.database_id = b.database_id AND a.file_id = b.file_id
      GROUP BY a.database_id

  # Populates `mssql_resident_memory_bytes`, `mssql_virtual_memory_bytes`, `mssql_memory_utilization_percentage` and
  # `mssql_page_fault_count`.
  - query_name: mssql_process_memory
    query: |
      SELECT
        physical_memory_in_use_kb * 1024 AS resident_memory_bytes,
        virtual_address_space_committed_kb * 1024 AS virtual_memory_bytes,
        memory_utilization_percentage,
        page_fault_count
      FROM sys.dm_os_process_memory

Prometheus的target里面添加下

代码语言:javascript
复制
- job_name: 'MSSQL'
    static_configs:
    - targets: ['192.168.10.11:9399']
      labels:
        addr: '192.168.10.11'
        cluster: dba_test

prometheus里面添加告警规则:

代码语言:javascript
复制
groups:
- name: MSSQL告警规则
  rules:
    - alert: 检测到SQL Server 宕掉或者sql_exporter采集器宕掉
      expr:  up{job="MSSQL"} !=1
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: 检测到SQL Server 宕掉或者sql_exporter采集器宕掉  (实例 {{ $labels.addr }})
        description: "检测到SQL Server 宕掉或者sql_exporter采集器宕掉  (实例 {{ $labels.addr }})"

    - alert: 检测到SQL Server 单库体积过大
      expr:  mssql_db_size{job="MSSQL"} /1024/1024 > 500
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: 检测到SQL Server 单库体积过大 (实例 {{ $labels.addr }} 库名 {{$labels.db_name}})
        description: "检测到SQL Server 单库体积过大 (实例 {{ $labels.addr }} 库名 {{$labels.db_name}} 当前值 {{$value}} GB"

    - alert: 检测到SQL Server 单账号连接数过大
      expr:  mssql_conn_by_user{job="MSSQL"} > 1000
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: 检测到SQL Server 单账号连接数过大 (实例 {{ $labels.addr }} 账号 {{$labels.login_name}})
        description: "检测到SQL Server 单账号连接数过大 (实例 {{ $labels.addr }} 账号 {{$labels.login_name}} 当前值 {{$value}}"

    - alert: 检测到SQL Server 单库连接数过大
      expr:  mssql_conn_by_db{job="MSSQL"} > 200
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: 检测到SQL Server 单库连接数过大 (实例 {{ $labels.addr }} 库名 {{$labels.login_name}})
        description: "检测到SQL Server 单库连接数过大 (实例 {{ $labels.addr }} 库名 {{$labels.login_name}} 当前值 {{$value}}"

    - alert: 检测到SQL Server READ IO等待过大
      expr:  irate(mssql_io_stall_seconds{job="MSSQL",operation="read"}[1m])/60 > 20
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: 检测到SQL Server READ IO等待过大 (实例 {{ $labels.addr }} 库名 {{$labels.db}})
        description: "检测到SQL Server READ IO等待过大 (实例 {{ $labels.addr }} 库名 {{$labels.login_name}} 分钟数 {{$value}}"

    - alert: 检测到SQL Server READ IO等待过大
      expr:  irate(mssql_io_stall_seconds{job="MSSQL",operation="write"}[1m])/60 > 20
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: 检测到SQL Server WRITE IO等待过大 (实例 {{ $labels.addr }} 库名 {{$labels.db}})
        description: "检测到SQL Server WRITE IO等待过大 (实例 {{ $labels.addr }} 库名 {{$labels.login_name}} 分钟数 {{$value}}"

grafana看板,可以在grafana官方上搜一下,简单改改就行了。

最终的看板如下:

使用sql_exporter+Prometheus监控sql server_prometheus
使用sql_exporter+Prometheus监控sql server_prometheus
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2023-07-05,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
相关产品与服务
云数据库 SQL Server
腾讯云数据库 SQL Server (TencentDB for SQL Server)是业界最常用的商用数据库之一,对基于 Windows 架构的应用程序具有完美的支持。TencentDB for SQL Server 拥有微软正版授权,可持续为用户提供最新的功能,避免未授权使用软件的风险。具有即开即用、稳定可靠、安全运行、弹性扩缩等特点。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档