文章/答案/技术大牛

发布

社区首页 >问答首页 >如何调试和修复返回错误代码4、网关超时的google、spilo/patroni标签更新

问如何调试和修复返回错误代码4、网关超时的google、spilo/patroni标签更新
EN

Stack Overflow用户

提问于 2022-09-18 11:00:12

回答 1查看 186关注 0票数 0

我正在使用zalando的postgres操作符，现在已经有postgres集群停机了。我使用连接池来连接到主和副本，但是副本连接池程序由于复制svc而无法连接到副本荚，没有端点，我认为问题是svc选择的postgres，它的标签是spilo-role是replica，但是pods没有这样的标签，它们应该是主标签和副本。

集群已经运行了一个月，这一事件发生在几天前，我们还处于软生产阶段，只有少数几个流量很低的测试客户，但很快就投入实际生产。

操作符和postgres的日志似乎没有我所知道的任何错误，所以我在google中查看了日志资源管理器，并从审计日志中找到了跟踪，这些审计日志由赞助人调用来设置结荚标签，但会导致504错误。错误似乎来自错误配置，但奇怪的是，直到现在它一直运行得很好，而且我对如何调试它的想法已经不多了，所以非常感谢任何指导或帮助调试/修复这一点。

下面是来自google云控制台日志资源管理器的审计日志，该日志显示pod有权进行pod标签更新，但是失败了，

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "authenticationInfo": {
      "principalEmail": "system:serviceaccount:default:postgres-pod"
    },
    "authorizationInfo": [
      {
        "granted": true,
        "permission": "io.k8s.core.v1.pods.patch",
        "resource": "core/v1/namespaces/default/pods/acid-abc-db-1"
      }
    ],
    "methodName": "io.k8s.core.v1.pods.patch",
    "request": {
      "@type": "k8s.io/Patch",
      "metadata": {
        "annotations": {
          "status": "{\"conn_url\":\"postgres://10.52.3.36:5432/postgres\",\"api_url\":\"http://10.52.3.36:8008/patroni\",\"state\":\"running\",\"role\":\"replica\",\"version\":\"2.1.3\",\"xlog_location\":50331648,\"timeline\":1}"
        },
        "labels": {
          "spilo-role": "replica"
        },
        "name": "acid-abc-db-1",
        "namespace": "default"
      }
    },
    "requestMetadata": {
      "callerIp": "10.52.3.36",
      "callerSuppliedUserAgent": "Patroni/2.1.3 Python/3.6.9 Linux"
    },
    "resourceName": "core/v1/namespaces/default/pods/acid-ml-db-1",
    "response": {
      "@type": "core.k8s.io/v1.Status",
      "apiVersion": "v1",
      "code": 504,
      "details": {},
      "kind": "Status",
      "message": "Timeout: request did not complete within requested timeout - context canceled",
      "metadata": {},
      "reason": "Timeout",
      "status": "Failure"
    },
    "serviceName": "k8s.io",
    "status": {
      "code": 4,
      "message": "Gateway Timeout"
    }
  },
  "insertId": "b6e3cfe7-0125-4652-a77a-f44232198f8c",
  "resource": {
    "type": "k8s_cluster",
    "labels": {
      "project_id": "abc123",
      "cluster_name": "abc",
      "location": "asia-southeast1"
    }
  },
  "timestamp": "2022-09-18T09:21:05.017886Z",
  "labels": {
    "authorization.k8s.io/decision": "allow",
    "authorization.k8s.io/reason": "RBAC: allowed by ClusterRoleBinding \"postgres-pod\" of ClusterRole \"postgres-pod\" to ServiceAccount \"postgres-pod/default\""
  },
  "logName": "projects/ekyc-web-services/logs/cloudaudit.googleapis.com%2Factivity",
  "operation": {
    "id": "b6e3cfe7-0125-4652-a77a-f44232198f8c",
    "producer": "k8s.io",
    "first": true,
    "last": true
  },
  "receiveTimestamp": "2022-09-18T09:21:10.235550735Z"
}

通常patronictl list应该在Running中显示状态，在Host列中显示ip地址，但现在它们是空的。

+ Cluster: acid-abc-db (7144662354080374866) -+-----------+
| Member        | Host | Role    | State | TL | Lag in MB |
+---------------+------+---------+-------+----+-----------+
| acid-abc-db-0 |      | Leader  |       |    |           |
| acid-abc-db-1 |      | Replica |       |    |   unknown |
+---------------+------+---------+-------+----+-----------+

我还尝试用任何名称创建一个全新的集群，这也会给我同样的结果。

来自主荚acid-abc-db-0的日志

2022-09-18 10:18:45,881 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2022-09-18 10:18:45,970 - bootstrapping - INFO - Looks like your running google
2022-09-18 10:18:47,087 - bootstrapping - INFO - Configuring bootstrap
2022-09-18 10:18:47,087 - bootstrapping - INFO - Configuring pgqd
2022-09-18 10:18:47,088 - bootstrapping - INFO - Configuring wal-e
2022-09-18 10:18:47,089 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2022-09-18 10:18:47,090 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2022-09-18 10:18:47,090 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2022-09-18 10:18:47,091 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2022-09-18 10:18:47,091 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_REGION
2022-09-18 10:18:47,091 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_SSE
2022-09-18 10:18:47,092 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2022-09-18 10:18:47,092 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2022-09-18 10:18:47,093 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_BACKUP
2022-09-18 10:18:47,093 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2022-09-18 10:18:47,093 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2022-09-18 10:18:47,094 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2022-09-18 10:18:47,094 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2022-09-18 10:18:47,095 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2022-09-18 10:18:47,095 - bootstrapping - INFO - Configuring log
2022-09-18 10:18:47,095 - bootstrapping - INFO - Configuring patroni
2022-09-18 10:18:47,104 - bootstrapping - INFO - Writing to file /run/postgres.yabc
2022-09-18 10:18:47,105 - bootstrapping - INFO - Configuring pam-oauth2
2022-09-18 10:18:47,106 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2022-09-18 10:18:47,106 - bootstrapping - INFO - Configuring certificate
2022-09-18 10:18:47,107 - bootstrapping - INFO - Generating ssl self-signed certificate
2022-09-18 10:18:47,226 - bootstrapping - INFO - Configuring standby-cluster
2022-09-18 10:18:47,226 - bootstrapping - INFO - Configuring crontab
2022-09-18 10:18:47,227 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2022-09-18 10:18:47,242 - bootstrapping - INFO - Configuring pgbouncer
2022-09-18 10:18:47,242 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2022-09-18 10:18:48,994 INFO: Selected new K8s API server endpoint https://172.16.0.2:443
2022-09-18 10:18:49,017 INFO: No PostgreSQL configuration items changed, nothing to reload.
2022-09-18 10:18:49,020 INFO: Lock owner: None; I am acid-abc-db-0
2022-09-18 10:18:54,082 INFO: trying to bootstrap a new cluster
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.UTF-8".
The default text search configuration will be set to "english".

Data page checksums are enabled.

fixing permissions on existing directory /home/postgres/pgdata/pgroot/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

/usr/lib/postgresql/14/bin/pg_ctl -D /home/postgres/pgdata/pgroot/data -l logfile start

2022-09-18 10:18:56,761 INFO: postmaster pid=92
/var/run/postgresql:5432 - no response
2022-09-18 10:18:56 UTC [92]: [1-1] 6326f090.5c 0     LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2022-09-18 10:18:56 UTC [92]: [2-1] 6326f090.5c 0     LOG:  pg_stat_kcache.linux_hz is set to 500000
2022-09-18 10:18:56 UTC [92]: [3-1] 6326f090.5c 0     LOG:  redirecting log output to logging collector process
2022-09-18 10:18:56 UTC [92]: [4-1] 6326f090.5c 0     HINT:  Future log output will appear in directory "../pg_log".
/var/run/postgresql:5432 - accepting connections
/var/run/postgresql:5432 - accepting connections
2022-09-18 10:18:57,834 INFO: establishing a new patroni connection to the postgres cluster
2022-09-18 10:19:02,852 INFO: running post_bootstrap
DO
GRANT ROLE
DO
DO
CREATE EXTENSION
NOTICE:  version "1.1" of extension "pg_auth_mon" is already installed
ALTER EXTENSION
GRANT
CREATE EXTENSION
DO
NOTICE:  version "1.4" of extension "pg_cron" is already installed
ALTER EXTENSION
ALTER POLICY
REVOKE
GRANT
REVOKE
GRANT
ALTER POLICY
REVOKE
GRANT
CREATE FUNCTION
REVOKE
GRANT
REVOKE
GRANT
REVOKE
GRANT
REVOKE
GRANT
REVOKE
GRANT
REVOKE
GRANT
REVOKE
GRANT
REVOKE
GRANT
CREATE EXTENSION
DO
CREATE TABLE
GRANT
ALTER TABLE
ALTER TABLE
ALTER TABLE
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
CREATE FOREIGN TABLE
GRANT
CREATE VIEW
ALTER VIEW
GRANT
RESET
SET
NOTICE:  schema "zmon_utils" does not exist, skipping
DROP SCHEMA
DO
NOTICE:  language "plpythonu" does not exist, skipping
DROP LANGUAGE
NOTICE:  function plpython_call_handler() does not exist, skipping
DROP FUNCTION
NOTICE:  function plpython_inline_handler(internal) does not exist, skipping
DROP FUNCTION
NOTICE:  function plpython_validator(oid) does not exist, skipping
DROP FUNCTION
CREATE SCHEMA
GRANT
SET
CREATE TYPE
CREATE FUNCTION
CREATE FUNCTION
GRANT
You are now connected to database "postgres" as user "postgres".
CREATE SCHEMA
GRANT
SET
CREATE FUNCTION
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
GRANT
RESET
CREATE EXTENSION
CREATE EXTENSION
CREATE EXTENSION
NOTICE:  version "3.0" of extension "set_user" is already installed
ALTER EXTENSION
GRANT
GRANT
GRANT
CREATE SCHEMA
GRANT
GRANT
SET
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
RESET
You are now connected to database "template1" as user "postgres".
CREATE SCHEMA
GRANT
SET
CREATE FUNCTION
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
CREATE FUNCTION
REVOKE
GRANT
COMMENT
GRANT
RESET
CREATE EXTENSION
CREATE EXTENSION
CREATE EXTENSION
NOTICE:  version "3.0" of extension "set_user" is already installed
ALTER EXTENSION
GRANT
GRANT
GRANT
CREATE SCHEMA
GRANT
GRANT
SET
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
CREATE FUNCTION
REVOKE
GRANT
GRANT
CREATE VIEW
REVOKE
GRANT
GRANT
RESET
2022-09-18 10:19:05,009 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2022-09-18 10:19:10,054 INFO: initialized a new cluster
2022-09-18 10:19:15,087 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:19:25,582 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:19:35,601 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:19:45,588 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:19:47.662 - /scripts/postgres_backup.sh - I was called as: /scripts/postgres_backup.sh /home/postgres/pgdata/pgroot/data
2022-09-18 10:19:48.397 45 LOG Starting pgqd 3.3
2022-09-18 10:19:48.397 45 LOG auto-detecting dbs ...
2022-09-18 10:19:48.941 - /scripts/postgres_backup.sh - producing a new backup
INFO: 2022/09/18 10:19:49.036810 Selecting the latest backup as the base for the current delta backup...
INFO: 2022/09/18 10:19:49.091402 Calling pg_start_backup()
INFO: 2022/09/18 10:19:49.203073 Starting a new tar bundle
INFO: 2022/09/18 10:19:49.203129 Walking ...
INFO: 2022/09/18 10:19:49.203471 Starting part 1 ...
INFO: 2022/09/18 10:19:50.107584 Packing ...
INFO: 2022/09/18 10:19:50.109248 Finished writing part 1.
INFO: 2022/09/18 10:19:50.428312 Starting part 2 ...
INFO: 2022/09/18 10:19:50.428359 /global/pg_control
INFO: 2022/09/18 10:19:50.437376 Finished writing part 2.
INFO: 2022/09/18 10:19:50.439403 Calling pg_stop_backup()
INFO: 2022/09/18 10:19:51.470246 Starting part 3 ...
INFO: 2022/09/18 10:19:51.496912 backup_label
INFO: 2022/09/18 10:19:51.497397 tablespace_map
INFO: 2022/09/18 10:19:51.497645 Finished writing part 3.
INFO: 2022/09/18 10:19:51.632504 Wrote backup with name base_000000010000000000000002
2022-09-18 10:19:55,586 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:20:05,587 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:20:15,579 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:20:18.427 45 LOG {ticks: 0, maint: 0, retry: 0}
2022-09-18 10:20:25,586 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:20:35,578 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:20:45,722 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:20:48.469 45 LOG {ticks: 0, maint: 0, retry: 0}
2022-09-18 10:20:55,583 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:21:05,587 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:21:15,586 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:21:18.470 45 LOG {ticks: 0, maint: 0, retry: 0}
2022-09-18 10:21:25,586 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:21:35,590 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:21:45,587 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:21:48.501 45 LOG {ticks: 0, maint: 0, retry: 0}
2022-09-18 10:21:55,588 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:22:05,589 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:22:15,589 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:22:18.532 45 LOG {ticks: 0, maint: 0, retry: 0}
2022-09-18 10:22:25,585 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:22:35,589 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:22:45,584 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:22:48.580 45 LOG {ticks: 0, maint: 0, retry: 0}
2022-09-18 10:22:55,583 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:23:05,600 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:23:15,586 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:23:18.572 45 LOG {ticks: 0, maint: 0, retry: 0}
2022-09-18 10:23:25,584 INFO: no action. I am (acid-abc-db-0), the leader with the lock
2022-09-18 10:23:35,591 INFO: no action. I am (acid-abc-db-0), the leader with the lock

运算符日志

# too long and almost all of the logs are operator
# creating stuff that are mostly debug and info
# except the error of pod label updating
# ... more omits

... level=error msg="failed to create cluster: pod labels error: still failing after 200 retries" cluster-name=default/acid-abc-db pkg=cluster worker=1
... level=error msg="could not create cluster: pod labels error: still failing after 200 retries" cluster-name=default/acid-abc-db pkg=controller worker=1

# ... more omits

# /home/postgres/.config/patroni/patronictl.yaml

bootstrap:
  clone_with_wale:
    command: envdir "/run/etc/wal-e.d/env-clone-acid-abc-db" python3 /scripts/clone_with_wale.py --recovery-target-time=""
    recovery_conf:
      recovery_target_action: promote
      recovery_target_timeline: latest
      restore_command: envdir "/run/etc/wal-e.d/env-clone-acid-abc-db" timeout "0" /scripts/restore_command.sh "%f" "%p"
  dcs:
    loop_wait: 10
    maximum_lag_on_failover: 33554432
    postgresql:
      parameters:
        archive_mode: 'on'
        archive_timeout: 1800s
        autovacuum_analyze_scale_factor: 0.02
        autovacuum_max_workers: 5
        autovacuum_vacuum_scale_factor: 0.05
        checkpoint_completion_target: '0.9'
        default_statistics_target: '100'
        effective_io_concurrency: '200'
        hot_standby: 'on'
        log_autovacuum_min_duration: 0
        log_checkpoints: 'on'
        log_connections: 'on'
        log_disconnections: 'on'
        log_line_prefix: '%t [%p]: [%l-1] %c %x %d %u %a %h '
        log_lock_waits: 'on'
        log_min_duration_statement: 500
        log_statement: all
        log_temp_files: 0
        max_connections: '512'
        max_parallel_maintenance_workers: '2'
        max_parallel_workers: '32'
        max_parallel_workers_per_gather: '8'
        max_replication_slots: 10
        max_slot_wal_keep_size: 16GB
        max_standby_archive_delay: 0s
        max_standby_streaming_delay: 0s
        max_wal_senders: '16'
        max_wal_size: 4GB
        max_worker_processes: '256'
        min_wal_size: 1GB
        tcp_keepalives_idle: 900
        tcp_keepalives_interval: 100
        track_functions: all
        wal_compression: 'on'
        wal_level: hot_standby
        wal_log_hints: 'on'
      use_pg_rewind: true
      use_slots: true
    retry_timeout: 10
    synchronous_node_count: 1
    ttl: 30
  initdb:
  - auth-host: md5
  - auth-local: trust
  - data-checksums
  - encoding: UTF8
  - locale: en_US.UTF-8
  method: clone_with_wale
  post_init: /scripts/post_init.sh "zalandos"
  users:
    zalandos:
      options:
      - CREATEDB
      - NOLOGIN
      password: ''
kubernetes:
  bypass_api_service: true
  labels:
    application: spilo
  port: tcp://10.56.0.1:443
  port_443_tcp: tcp://10.56.0.1:443
  port_443_tcp_addr: 10.56.0.1
  port_443_tcp_port: '443'
  port_443_tcp_proto: tcp
  ports:
  - name: postgresql
    port: 5432
  role_label: spilo-role
  scope_label: cluster-name
  service_host: 10.56.0.1
  service_port: '443'
  service_port_https: '443'
  use_endpoints: true
postgresql:
  authentication:
    replication:
      password: xxx
      username: standby
    superuser:
      password: xxx
      username: postgres
  basebackup_fast_xlog:
    command: /scripts/basebackup.sh
    retries: 2
  bin_dir: /usr/lib/postgresql/14/bin
  callbacks:
    on_role_change: /scripts/on_role_change.sh zalandos true
  connect_address: 10.52.5.55:5432
  create_replica_method:
  - wal_e
  - basebackup_fast_xlog
  data_dir: /home/postgres/pgdata/pgroot/data
  listen: '*:5432'
  name: acid-abc-db-0
  parameters:
    archive_command: envdir "/run/etc/wal-e.d/env" wal-g wal-push "%p"
    bg_mon.history_buckets: 120
    bg_mon.listen_address: 0.0.0.0
    extwlist.custom_path: /scripts
    extwlist.extensions: btree_gin,btree_gist,citext,extra_window_functions,first_last_agg,hll,hstore,hypopg,intarray,ltree,pgcrypto,pgq,pgq_node,pg_trgm,postgres_fdw,tablefunc,uuid-ossp,timescaledb,pg_partman
    log_destination: csvlog
    log_directory: ../pg_log
    log_file_mode: '0644'
    log_filename: postgresql-%u.log
    log_rotation_age: 1d
    log_truncate_on_rotation: 'on'
    logging_collector: 'on'
    pg_stat_statements.track_utility: 'off'
    shared_buffers: 256MB
    shared_preload_libraries: bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache
    ssl: 'on'
    ssl_cert_file: /run/certs/server.crt
    ssl_key_file: /run/certs/server.key
  pg_hba:
  - local   all             all                                   trust
  - hostssl all             +zalandos    127.0.0.1/32       pam
  - host    all             all                127.0.0.1/32       md5
  - hostssl all             +zalandos    ::1/128            pam
  - host    all             all                ::1/128            md5
  - local   replication     standby                    trust
  - hostssl replication     standby all                md5
  - hostnossl all           all                all                reject
  - hostssl all             +zalandos    all                pam
  - hostssl all             all                all                md5
  pgpass: /run/postgresql/pgpass
  recovery_conf:
    restore_command: envdir "/run/etc/wal-e.d/env" timeout "0" /scripts/restore_command.sh "%f" "%p"
  use_unix_socket: true
  use_unix_socket_repl: true
  wal_e:
    command: envdir /run/etc/wal-e.d/env bash /scripts/wale_restore.sh
    no_master: 1
    retries: 2
    threshold_backup_size_percentage: 30
    threshold_megabytes: 102400
restapi:
  connect_address: 10.52.5.55:8008
  listen: :8008
scope: acid-abc-db

运算符配置

# mostly defaults
# only change common pod secret
# for backups credential
---
apiVersion: acid.zalan.do/v1
configuration:
  aws_or_gcp:
    additional_secret_mount_path: /meta/credentials
    aws_region: ap-southeast-1
    enable_ebs_gp3_migration: false
    enable_ebs_gp3_migration_max_size: 1000
  connection_pooler:
    connection_pooler_default_cpu_limit: "1"
    connection_pooler_default_cpu_request: 500m
    connection_pooler_default_memory_limit: 100Mi
    connection_pooler_default_memory_request: 512Mi
    connection_pooler_image: registry.opensource.zalan.do/acid/pgbouncer:master-22
    connection_pooler_max_db_connections: 512
    connection_pooler_mode: transaction
    connection_pooler_number_of_instances: 2
    connection_pooler_schema: pooler
    connection_pooler_user: pooler
  debug:
    debug_logging: true
    enable_database_access: true
  docker_image: registry.opensource.zalan.do/acid/spilo-14:2.1-p5
  enable_crd_registration: true
  enable_crd_validation: true
  enable_lazy_spilo_upgrade: false
  enable_pgversion_env_var: true
  enable_shm_volume: true
  enable_spilo_wal_path_compat: false
  enable_team_id_clustername_prefix: false
  etcd_host: ""
  kubernetes:
    cluster_domain: cluster.local
    cluster_labels:
      application: spilo
    cluster_name_label: cluster-name
    enable_cross_namespace_secret: false
    enable_init_containers: true
    enable_pod_antiaffinity: true
    enable_pod_disruption_budget: true
    enable_sidecars: true
    master_pod_move_timeout: 20m
    oauth_token_secret_name: postgresql-operator
    pdb_name_format: postgres-{cluster}-pdb
    pod_antiaffinity_topology_key: kubernetes.io/hostname
    pod_environment_secret: postgres-common-secret
    pod_management_policy: ordered_ready
    pod_role_label: spilo-role
    pod_service_account_definition: ""
    pod_service_account_name: postgres-pod
    pod_service_account_role_binding_definition: ""
    pod_terminate_grace_period: 5m
    secret_name_template: '{username}.{cluster}.credentials.{tprkind}.{tprgroup}'
    spilo_allow_privilege_escalation: true
    spilo_privileged: false
    storage_resize_mode: pvc
  kubernetes_use_configmaps: false
  load_balancer:
    db_hosted_zone: db.example.com
    enable_master_load_balancer: false
    enable_master_pooler_load_balancer: false
    enable_replica_load_balancer: false
    enable_replica_pooler_load_balancer: false
    external_traffic_policy: Cluster
    master_dns_name_format: '{cluster}.{team}.{hostedzone}'
    replica_dns_name_format: '{cluster}-repl.{team}.{hostedzone}'
  logging_rest_api:
    api_port: 8080
    cluster_history_entries: 1000
    ring_log_lines: 100
  logical_backup:
    logical_backup_docker_image: registry.opensource.zalan.do/acid/logical-backup:v1.8.1
    logical_backup_job_prefix: logical-backup-
    logical_backup_provider: s3
    logical_backup_s3_bucket: my-bucket-url
    logical_backup_s3_sse: AES256
    logical_backup_schedule: 30 00 * * *
  major_version_upgrade:
    major_version_upgrade_mode: "off"
    minimal_major_version: "9.6"
    target_major_version: "14"
  max_instances: -1
  min_instances: -1
  postgres_pod_resources:
    default_cpu_limit: "1"
    default_cpu_request: 100m
    default_memory_limit: 500Mi
    default_memory_request: 100Mi
    min_cpu_limit: 250m
    min_memory_limit: 250Mi
  repair_period: 5m
  resync_period: 30m
  set_memory_request_to_limit: false
  teams_api:
    enable_admin_role_for_users: true
    enable_postgres_team_crd: true
    enable_postgres_team_crd_superusers: false
    enable_team_member_deprecation: false
    enable_team_superuser: false
    enable_teams_api: false
    pam_configuration: https://info.example.com/oauth2/tokeninfo?access_token= uid
      realm=/employees
    pam_role_name: zalandos
    protected_role_names:
    - admin
    - cron_admin
    role_deletion_suffix: _deleted
    team_admin_role: admin
    team_api_role_configuration:
      log_statement: all
    teams_api_url: https://teams.example.com/api/
  timeouts:
    patroni_api_check_interval: 1s
    patroni_api_check_timeout: 5s
    pod_deletion_wait_timeout: 10m
    pod_label_wait_timeout: 10m
    ready_wait_interval: 4s
    ready_wait_timeout: 30s
    resource_check_interval: 3s
    resource_check_timeout: 10m
  users:
    enable_password_rotation: false
    password_rotation_interval: 90
    password_rotation_user_retention: 180
    replication_username: standby
    super_username: postgres
  workers: 8
kind: OperatorConfiguration
metadata:
  name: postgresql-operator-default-configuration
  namespace: default

感谢并感谢您花时间阅读这篇文章，并提前感谢您指导调试和帮助您的工作。

更新，0

因此，我尝试在一个吊舱中使用curl手动调用pod修补程序，它的工作原理与预期的一样。

curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt --header "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/api/v1/namespaces/default/pods/acid-abc-db-1 -X PATCH -H 'Content-Type: application/merge-patch+json' -d '{"metadata": {"labels": {"spilo-role": "replica"}}}'

然后副本的endpoints变得可用，连接池能够连接到副本，那么为什么patroni调用修补程序会导致网关错误，patronictl list仍然没有在预期的正确结果中显示(主机仍然是空的，状态没有显示运行)。

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-10-11 12:47:48

所以解决方案非常简单，所以我探索了patroni医生，并意识到有一些选项可以配置dcs相关的设置，并且当我使用正常的curl请求进行测试时，这些请求正常工作，因此权限不应成为问题，504错误可能与超时有关，因此我研究了文档，以确定是否可以在此配置请求超时，这就导致我尝试了从父文档中选择的一些选项。因此，我更新Postgresql k8s api对象如下

apiVersion: "acid.zalan.do/v1"
kind: postgresql
metadata:
  name: acid-abc-db
  namespace: default
spec:
  # more omitted
  patroni:
    retry_timeout: 128 # default only 10, change to 128
  * more omitted

现在集群又正常工作了，谢谢大家花时间阅读我的问题，并为这个愚蠢的错误感到抱歉。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/73762173

复制

相似问题

问如何调试和修复返回错误代码4、网关超时的google、spilo/patroni标签更新
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何调试和修复返回错误代码4、网关超时的google、spilo/patroni标签更新EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何调试和修复返回错误代码4、网关超时的google、spilo/patroni标签更新
EN