前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Postgresql中的sync相关参数源码分析

Postgresql中的sync相关参数源码分析

作者头像
mingjie
发布2022-05-12 10:56:09
8620
发布2022-05-12 10:56:09
举报
文章被收录于专栏:Postgresql源码分析

注:本专栏所有分析以函数为主线,必要数据结构会带入讲解;数据库版本为Postgresql10.16。 注:如有讨论的需要请email to jackgo73@outlook.com

一、问题

Postgresql中常见的的sharebuffer配置为内存的25%,而mysql的bp常见配置为内存的75%,原因和刷盘方式不同有关。

常见模式(和配置有关)

  • pgsql:数据依赖OS CACHE,日志依赖OS CACHE
  • mysql:数据不依赖OS CACHE使用O_DIRECT直接刷盘,日志依赖OS CACHE

这里着重分析PG的几种sync参数的不同

二、参数

1 fsync

决定是否同步刷xlog,默认打开。

代码语言:javascript
复制
#fsync = on             # flush data to disk for crash safety
                    # (turning this off can cause
                    # unrecoverable data corruption)

fsync (boolean) If this parameter is on, the PostgreSQL server will try to make sure that updates are physically written to disk, by issuing fsync() system calls or various equivalent methods (see wal_sync_method). This ensures that the database cluster can recover to a consistent state after an operating system or hardware crash.

默认打开。

使用wal_sync_method配置的刷盘函数保证数据落盘。

While turning off fsync is often a performance benefit, this can result in unrecoverable data corruption in the event of a power failure or system crash. Thus it is only advisable to turn off fsync if you can easily recreate your entire database from external data. Examples of safe circumstances for turning off fsync include the initial loading of a new database cluster from a backup file, using a database cluster for processing a batch of data after which the database will be thrown away and recreated, or for a read-only database clone which gets recreated frequently and is not used for failover. High quality hardware alone is not a sufficient justification for turning off fsync. For reliable recovery when changing fsync off to on, it is necessary to force all modified buffers in the kernel to durable storage. This can be done while the cluster is shutdown or while fsync is on by running initdb --sync-only, running sync, unmounting the file system, or rebooting the server. In many situations, turning off synchronous_commit for noncritical transactions can provide much of the potential performance benefit of turning off fsync, without the attendant risks of data corruption. fsync can only be set in the postgresql.conf file or on the server command line. If you turn this parameter off, also consider turning off full_page_writes.

2 wal_sync_method

上述参数打开后,使用什么方式同步刷xlog。默认fdatasync。

代码语言:javascript
复制
#wal_sync_method = fsync        # the default is the first option
                    # supported by the operating system:
                    #   open_datasync
                    #   fdatasync (default on Linux)
                    #   fsync
                    #   fsync_writethrough
                    #   open_sync

wal_sync_method (enum) Method used for forcing WAL updates out to disk. If fsync is off then this setting is irrelevant, since WAL file updates will not be forced out at all. Possible values are:

  • open_datasync (write WAL files with open() option O_DSYNC)
  • fdatasync (call fdatasync() at each commit)
  • fsync (call fsync() at each commit)
  • fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)
  • open_sync (write WAL files with open() option O_SYNC)

The open_* options also use O_DIRECT if available. Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform, except that fdatasync is the default on Linux and FreeBSD. The default is not necessarily ideal; it might be necessary to change this setting or other aspects of your system configuration in order to create a crash-safe configuration or achieve optimal performance. These aspects are discussed in Section 29.1. This parameter can only be set in the postgresql.conf file or on the server command line.

如果系统支持,open_*会使用O_DIRECT;Linux会默认使用fdatasync;

open_datasync:open() O_DSYNC

fdatasync:fdatasync()

fsync:fsync()

fsync_writethrough:fsync()

open_sync:open() O_SYNC

源码位置:src/include/access/xlog.h

代码语言:javascript
复制
/* Sync methods */
#define SYNC_METHOD_FSYNC       0
#define SYNC_METHOD_FDATASYNC   1
#define SYNC_METHOD_OPEN        2   /* for O_SYNC */
#define SYNC_METHOD_FSYNC_WRITETHROUGH  3
#define SYNC_METHOD_OPEN_DSYNC  4   /* for O_DSYNC */
extern int  sync_method;

默认SYNC_METHOD_FDATASYNC

3 synchronous_commit

同步提交约束配置,默认on。

synchronous_commit (enum) Specifies how much WAL processing must complete before the database server returns a “success” indication to the client. Valid values are remote_applyon (the default), remote_writelocal, and off. If synchronous_standby_names is empty, the only meaningful settings are on and offremote_applyremote_write and local all provide the same local synchronization level as on. The local behavior of all non-off modes is to wait for local flush of WAL to disk.

所有非off的配置,都需要等fsync成功,事务才能返回。

In off mode, there is no waiting, so there can be a delay between when success is reported to the client and when the transaction is later guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction. For more discussion see Section 29.3.

事务真正提交 和 事务成功返回客户端 不是一致的! 中间可能最多差三倍的wal_writer_delay

一般把这个参数关了可以提升性能,为什么不关fsync呢?

因为这个参数关了之后,系统crash后最近的几条成功提交的事务会直接丢失,不会造成数据不一致。

而fsync关了之后,日志落盘完全没有保障了,提交了的事物可能一部分刷盘,一部分没有刷盘造成数据不一致。

If synchronous_standby_names is non-empty, synchronous_commit also controls whether transaction commits will wait for their WAL records to be processed on the standby server(s). When set to remote_apply, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and applied it, so that it has become visible to queries on the standby(s), and also written to durable storage on the standbys. This will cause much larger commit delays than previous settings since it waits for WAL replay. When set to on, commits wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and flushed it to durable storage. This ensures the transaction will not be lost unless both the primary and all synchronous standbys suffer corruption of their database storage. When set to remote_write, commits will wait until replies from the current synchronous standby(s) indicate they have received the commit record of the transaction and written it to their file systems. This setting ensures data preservation if a standby instance of PostgreSQL crashes, but not if the standby suffers an operating-system-level crash because the data has not necessarily reached durable storage on the standby. The setting local causes commits to wait for local flush to disk, but not for replication. This is usually not desirable when synchronous replication is in use, but is provided for completeness. This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions commit synchronously and others asynchronously. For example, to make a single multistatement transaction commit asynchronously when the default is the opposite, issue SET LOCAL synchronous_commit TO OFF within the transaction. Table 19.1 summarizes the capabilities of the synchronous_commit settings. Table 19.1. synchronous_commit Modes synchronous_commit settinglocal durable commitstandby durable commit after PG crashstandby durable commit after OS crashstandby query consistencyremote_apply••••on••• remote_write•• local• off

wal_writer_delay/wal_writer_flush_after

常用配置

wal_writer_delay = 10ms

wal_writer_flush_after = 0  # IO很好的机器,不需要考虑平滑调度, 否则建议128~256kB

wal_writer_delay (integer) Specifies how often the WAL writer flushes WAL, in time terms. After flushing WAL the writer sleeps for the length of time given by wal_writer_delay, unless woken up sooner by an asynchronously committing transaction. If the last flush happened less than wal_writer_delay ago and less than wal_writer_flush_after worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If this value is specified without units, it is taken as milliseconds. The default value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.

不满足10ms的事务只写磁盘,不做flush。

wal_writer_flush_after (integer) Specifies how often the WAL writer flushes WAL, in volume terms. If the last flush happened less than wal_writer_delay ago and less than wal_writer_flush_after worth of WAL has been produced since, then WAL is only written to the operating system, not flushed to disk. If wal_writer_flush_after is set to 0 then WAL data is always flushed immediately. If this value is specified without units, it is taken as WAL blocks, that is XLOG_BLCKSZ bytes, typically 8kB. The default is 1MB. This parameter can only be set in the postgresql.conf file or on the server command line.

同上,从空间上触发刷盘。一般盘的io写日志无瓶颈的话,不需要使用这个参数。

除非发现刷xlog周期性的打满IO,配这个参数有奇效。

三、fsync相关源码

xlog文件创建。

XLogFileInit

代码语言:javascript
复制
int
XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
{
    ...
        fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
                           S_IRUSR | S_IWUSR);
    ...
}

get_sync_bit(sync_method)      sync_method = 1

代码语言:javascript
复制
...

/* Sync methods */
#define SYNC_METHOD_FSYNC       0
#define SYNC_METHOD_FDATASYNC   1
#define SYNC_METHOD_OPEN        2   /* for O_SYNC */
#define SYNC_METHOD_FSYNC_WRITETHROUGH  3
#define SYNC_METHOD_OPEN_DSYNC  4   /* for O_DSYNC */
extern int  sync_method;


...

static int
get_sync_bit(int method)
{
    ....
    if (!XLogIsNeeded() && !AmWalReceiverProcess())
        o_direct_flag = PG_O_DIRECT;
    ....

    switch (method)
    {
            /*
             * enum values for all sync options are defined even if they are
             * not supported on the current platform.  But if not, they are
             * not included in the enum option array, and therefore will never
             * be seen here.
             */
        case SYNC_METHOD_FSYNC:
        case SYNC_METHOD_FSYNC_WRITETHROUGH:
        case SYNC_METHOD_FDATASYNC:
            return 0;
#ifdef OPEN_SYNC_FLAG
        case SYNC_METHOD_OPEN:
            return OPEN_SYNC_FLAG | o_direct_flag;
#endif
#ifdef OPEN_DATASYNC_FLAG
        case SYNC_METHOD_OPEN_DSYNC:
            return OPEN_DATASYNC_FLAG | o_direct_flag;
#endif
        default:
            /* can't happen (unless we are out of sync with option array) */
            elog(ERROR, "unrecognized wal_sync_method: %d", method);
            return 0;           /* silence warning */
    }
}

默认函数返回0。

XLogFileOpen

xlog文件打开

代码语言:javascript
复制
    fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method),
                       S_IRUSR | S_IWUSR);

issue_xlog_fsync

触发一次同步刷盘

代码语言:javascript
复制
/*
 * Issue appropriate kind of fsync (if any) for an XLOG output file.
 *
 * 'fd' is a file descriptor for the XLOG file to be fsync'd.
 * 'log' and 'seg' are for error reporting purposes.
 */
void
issue_xlog_fsync(int fd, XLogSegNo segno)
{
    switch (sync_method)
    {
        case SYNC_METHOD_FSYNC:
            if (pg_fsync_no_writethrough(fd) != 0)
                ereport(PANIC,
                        (errcode_for_file_access(),
                         errmsg("could not fsync log file %s: %m",
                                XLogFileNameP(ThisTimeLineID, segno))));
            break;
#ifdef HAVE_FSYNC_WRITETHROUGH
        case SYNC_METHOD_FSYNC_WRITETHROUGH:
            if (pg_fsync_writethrough(fd) != 0)
                ereport(PANIC,
                        (errcode_for_file_access(),
                         errmsg("could not fsync write-through log file %s: %m",
                                XLogFileNameP(ThisTimeLineID, segno))));
            break;
#endif
#ifdef HAVE_FDATASYNC
        case SYNC_METHOD_FDATASYNC:
            if (pg_fdatasync(fd) != 0)
                ereport(PANIC,
                        (errcode_for_file_access(),
                         errmsg("could not fdatasync log file %s: %m",
                                XLogFileNameP(ThisTimeLineID, segno))));
            break;
#endif
        case SYNC_METHOD_OPEN:
        case SYNC_METHOD_OPEN_DSYNC:
            /* write synced it already */
            break;
        default:
            elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
            break;
    }
}

默认走fdatasync

代码语言:javascript
复制
/*
 * pg_fdatasync --- same as fdatasync except does nothing if enableFsync is off
 *
 * Not all platforms have fdatasync; treat as fsync if not available.
 */
int
pg_fdatasync(int fd)
{
    if (enableFsync)
    {
#ifdef HAVE_FDATASYNC
        return fdatasync(fd);
#else
        return fsync(fd);
#endif
    }
    else
        return 0;
}

HAVE_FDATASYNC宏由configure时配置。

OPEN参数含义

上述过程可以看到,默认情况open参数如下:

O_RDWR | PG_BINARY | 0

  • O_RDWR:打开方式 read/write
  • PG_BINARY:适配windows,默认为0。

S_IRUSR | S_IWUSR

  • S_IRUSR:读权限
  • S_IWUSR:写权限
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2021-05-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、问题
  • 二、参数
    • 1 fsync
      • 2 wal_sync_method
        • 3 synchronous_commit
          • wal_writer_delay/wal_writer_flush_after
          • 三、fsync相关源码
            • XLogFileInit
              • XLogFileOpen
                • issue_xlog_fsync
                  • OPEN参数含义
                  相关产品与服务
                  数据库
                  云数据库为企业提供了完善的关系型数据库、非关系型数据库、分析型数据库和数据库生态工具。您可以通过产品选择和组合搭建,轻松实现高可靠、高可用性、高性能等数据库需求。云数据库服务也可大幅减少您的运维工作量,更专注于业务发展,让企业一站式享受数据上云及分布式架构的技术红利!
                  领券
                  问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档