前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Postgresql源码(5)缓冲区管理

Postgresql源码(5)缓冲区管理

作者头像
mingjie
发布2022-07-16 10:22:22
5310
发布2022-07-16 10:22:22
举报

Postgresql源码(5)缓冲区管理

学习笔记:https://www.interdb.jp/pg/pgsql08.html

前言

设想中一个简单的buffer管理器具备的功能:

    • 可以申请到一个空闲PAGE用来写,并对应到磁盘上的一个PAGE
    • 满了可以自动淘汰刷盘一个PAGE
    • 写完了可以决定立即刷盘 或 lazy刷盘
    • 可以读一个已经缓存的PAGE
    • 可以读一个磁盘上的PAGE
    • 满了可以自动淘汰刷盘一个PAGE,再把需要的读上来

看看PG是怎么实现的?

PG实现

1 TAG

{(16821, 16384, 37721), 1, 3}

表示

  • tablespace=16821
  • db=16384
  • table=37721
  • freespace map
  • 文件中的3号块
代码语言:javascript
复制
typedef struct buftag
{
	RelFileNode rnode;			/* physical relation identifier */
  
    typedef struct RelFileNode
    {
	    Oid			spcNode;		/* tablespace */
	    Oid			dbNode;			/* database */
	    Oid			relNode;		/* relation */
    } RelFileNode;
  
	ForkNumber	forkNum;    // tables, freespace maps and visibility maps are defined in 0, 1 and 2
	BlockNumber blockNum;		/* blknum relative to begin of reln */
} BufferTag;

2 结构

三层结构(第二层是逻辑上的)

  • buffer table可以直接从Hash中找到buffer id
  • 在使用buffer slot前,需要用buffer id找到desc,查看slot的meta信息,才可以使用
  • 所以逻辑上中间有一层desc数组
在这里插入图片描述
在这里插入图片描述
2.1 Buffer Table

这里需要关注的,注意这是核心的入口数据结构,是tag对id的映射关系。

  • key
    • BufferTag
      • RelFileNode rnode
      • ForkNumber forkNum
      • BlockNumber blockNum
  • value
    • BufferLookupEnt
      • BufferTag key
      • int id
  • 分区的哈希表,加锁的粒度更小。
代码语言:javascript
复制
void
InitBufTable(int size)
{
	HASHCTL		info;

	/* assume no locking is needed yet */

	/* BufferTag maps to Buffer */
	info.keysize = sizeof(BufferTag);
	info.entrysize = sizeof(BufferLookupEnt);
	info.num_partitions = NUM_BUFFER_PARTITIONS;

	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
								  size, size,
								  &info,
								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
}
在这里插入图片描述
在这里插入图片描述
2.2 Buffer Descriptor

这也是一个数组,在这里初始化的:

代码语言:javascript
复制
void
InitBufferPool(void)
{
	bool		foundBufs,
				foundDescs,
				foundIOLocks,
				foundBufCkpt;

	/* Align descriptors to a cacheline boundary. */
	BufferDescriptors = (BufferDescPadded *)
		ShmemInitStruct("Buffer Descriptors",
						NBuffers * sizeof(BufferDescPadded),
						&foundDescs);
...

初始化的BufferDescPadded,一个buffer对应一个desc:

代码语言:javascript
复制
typedef union BufferDescPadded
{
	BufferDesc	bufferdesc;
	char		pad[BUFFERDESC_PAD_TO_SIZE];
} BufferDescPadded;

具体内容:

代码语言:javascript
复制
/*
 *	BufferDesc -- shared descriptor/state data for a single shared buffer.
 *
 * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change
【加BM_LOCKED(Buffer header lock)锁才能读写tag、state、wait_backend_pid】

 * the tag, state or wait_backend_pid fields.  In general, buffer header lock
 * is a spinlock which is combined with flags, refcount and usagecount into
 * single atomic variable.  This layout allow us to do some operations in a
 * single atomic operation, without actually acquiring and releasing spinlock;
 * for instance, increase or decrease refcount.  buf_id field never changes
 * after initialization, so does not need locking.  freeNext is protected by
 * the buffer_strategy_lock not buffer header lock.  The LWLock can take care
 * of itself.  The buffer header lock is *not* used to control access to the
 * data in the buffer!
 *
 * It's assumed that nobody changes the state field while buffer header lock
 * is held.  Thus buffer header lock holder can do complex updates of the
 * state variable in single write, simultaneously with lock release (cleaning
 * BM_LOCKED flag).  On the other hand, updating of state without holding
 * buffer header lock is restricted to CAS, which insure that BM_LOCKED flag
 * is not set.  Atomic increment/decrement, OR/AND etc. are not allowed.
 *
 【假设BM_LOCKED(Buffer header lock)持有时,别人不能更新state】
 【那么持有锁的人可以对state做很多更新,然后释放锁】
 
 * An exception is that if we have the buffer pinned, its tag can't change
 * underneath us, so we can examine the tag without locking the buffer header.
 * Also, in places we do one-time reads of the flags without bothering to
 * lock the buffer header; this is generally for situations where we don't
 * expect the flag bit being tested to be changing.
 *
 * We can't physically remove items from a disk page if another backend has
 * the buffer pinned.  Hence, a backend may need to wait for all other pins
 * to go away.  This is signaled by storing its own PID into
 * wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER.  At present,
 * there can be only one such waiter per buffer.
 *
 * We use this same struct for local buffer headers, but the locks are not
 * used and not all of the flag bits are useful either. To avoid unnecessary
 * overhead, manipulations of the state field should be done without actual
 * atomic operations (i.e. only pg_atomic_read_u32() and
 * pg_atomic_unlocked_write_u32()).
 *
 * Be careful to avoid increasing the size of the struct when adding or
 * reordering members.  Keeping it below 64 bytes (the most common CPU
 * cache line size) is fairly important for performance.
 */
typedef struct BufferDesc
{
	BufferTag	tag;			/* ID of page contained in buffer */
	int			buf_id;			/* buffer's index number (from 0) */

	/* state of the tag, containing flags, refcount and usagecount */
	pg_atomic_uint32 state;

	int			wait_backend_pid;	/* backend PID of pin-count waiter */
	int			freeNext;		/* link in freelist chain */

	LWLock		content_lock;	/* to lock access to buffer contents */
} BufferDesc;
  • tag/buf_id:上面讲过了
  • state
    • flags
      • dirty bit:indicates whether the stored page is dirty.
      • valid bit:当前页面是可以读的(1)slot有数据、对应的desc有数据,可以读。(2)invalid:desc没数据或正在做页面替换中。
      • io_in_progress bit:缓冲区管理器是否正在从/向存储读取/写入关联的页面。 换句话说,该位指示是否有单个进程持有该描述符的 io_in_progress_lock。
    • recount
      • 记录访问当前页面的进程数,也叫pin count。访问页面必须pin count++,使用后必须pin count–。
      • pin count=0时叫做unpinned;非0时叫做pinned。
    • usagecount:记录自从加载上来之后,被访问的次数,时钟算法会用到。
  • freeNext:下一个空闲buffer,在数组上加一个空闲链表的逻辑

【desc描述页面的三种状态】

  • Empty
    • 当对应的缓冲池槽没有存储页面时(即refcount和usage_count为0),该描述符的状态为空。
  • Pinned
    • 当相应的缓冲池槽存储一个页面并且任何 PostgreSQL 进程正在访问该页面(即 refcount 和 usage_count 大于或等于 1)时,该缓冲区描述符的状态被锁定。
  • Unpinned
    • 当对应的缓冲池槽存储了一个页面但没有PostgreSQL进程访问该页面时(即usage_count大于或等于1,但refcount为0),这个缓冲区描述符的状态是unpinned。
2.3 Buffer Descriptor 逻辑层

BufferDescriptors数组初始化,freelist初始化buf->freeNext = i + 1;

代码语言:javascript
复制
...
BufferDescPadded *BufferDescriptors;
...
  
  
void
InitBufferPool(void)
{
	bool		foundBufs,
				foundDescs,
				foundIOLocks,
				foundBufCkpt;

	/* Align descriptors to a cacheline boundary. */
	BufferDescriptors = (BufferDescPadded *)
		ShmemInitStruct("Buffer Descriptors",
						NBuffers * sizeof(BufferDescPadded),
						&foundDescs);
...
...
		/*
		 * Initialize all the buffer headers.
		 */
		for (i = 0; i < NBuffers; i++)
		{
			BufferDesc *buf = GetBufferDescriptor(i);

			CLEAR_BUFFERTAG(buf->tag);

			pg_atomic_init_u32(&buf->state, 0);
			buf->wait_backend_pid = 0;

			buf->buf_id = i;

			/*
			 * Initially link all the buffers together as unused. Subsequent
			 * management of this list is done by freelist.c.
			 */
			buf->freeNext = i + 1;

			LWLockInitialize(BufferDescriptorGetContentLock(buf),
							 LWTRANCHE_BUFFER_CONTENT);

			LWLockInitialize(BufferDescriptorGetIOLock(buf),
							 LWTRANCHE_BUFFER_IO_IN_PROGRESS);
		}
...
...

加载第一个页面的过程:

  1. freelist中拿一个free的desc,pin住(refcount++, usage_count++)
  2. buffertable中新增entry,记录tag : buffer_id
  3. 存储中读取页面内容到内存
  4. 更新desc中的meta信息
在这里插入图片描述
在这里插入图片描述

desc使用之后就不会在加入到freelist中了,除非:

  • 表 或 索引 被删除了
  • db 被删除了
  • 表 或 索引 被vacuum full清空了
2.4 Buffer Pool

一段内存空间,大小为8K * NBuffers

代码语言:javascript
复制
	BufferBlocks = (char *)
		ShmemInitStruct("Buffer Blocks",
						NBuffers * (Size) BLCKSZ, &foundBufs);

3 锁

这几把锁都在是共享内存中的。

3.1 Buffer Table Locks

BufMappingLock

哈希表的分区锁,分s/e

3.2 Desc锁

content_lock

读写PAGE的轻量锁,分s/e

  • e模式出现在以下情况下
    • 插入page、修改tuple的t_xmin/t_xmax字段
    • 物理删除tuple、压缩页面剩余空间(vacuum)
    • 页面内freeze

io_in_progress_lock

用于等PAGE的IO动作完成,当进程从/向存储加载/写入页面数据时,该进程在访问存储时持有相应描述符的独占 io_in_progress 锁。

spinlock(现在是BM_LOCK标志位)

desc的flags和fields修改时会加spinlock。

例如PIN:

代码语言:javascript
复制
LockBufHdr(bufferdesc);    /* Acquire a spinlock */
bufferdesc->refcont++;
bufferdesc->usage_count++;
UnlockBufHdr(bufferdesc); /* Release the spinlock */

例如set the dirty bit to ‘1’:

代码语言:javascript
复制
#define BM_DIRTY             (1 << 0)    /* data needs writing */
#define BM_VALID             (1 << 1)    /* data is valid */
#define BM_TAG_VALID         (1 << 2)    /* tag is assigned */
#define BM_IO_IN_PROGRESS    (1 << 3)    /* read or write in progress */
#define BM_JUST_DIRTIED      (1 << 5)    /* dirtied since write started */

LockBufHdr(bufferdesc);
bufferdesc->flags |= BM_DIRTY;
UnlockBufHdr(bufferdesc);

4 淘汰策略

四种策略:

代码语言:javascript
复制
typedef enum BufferAccessStrategyType
{
	BAS_NORMAL,					/* Normal random access */
	BAS_BULKREAD,				/* Large read-only scan (hint bit updates are
								 * ok) */
	BAS_BULKWRITE,				/* Large multi-block write (e.g. COPY IN) */
	BAS_VACUUM					/* VACUUM */
} BufferAccessStrategyType;

BufferAccessStrategyType

使用场景

替换算法

BAS_NORMAL

一般情况的随机读写

clock sweep 算法

BAS_BULKREAD

批量读

ring算法,环大小为 256 * 1024 / BLCKSZ

BAS_BULKWRITE

批量写

ring算法,环大小为 16 * 1024 * 1024 / BLCKSZ

BAS_VACUUM

VACUUM 进程

ring算法,环大小为 256 * 1024 / BLCKSZ

clock sweep 算法
代码语言:javascript
复制
/*
 * The shared freelist control information.
 */
typedef struct
{
	/* Spinlock: protects the values below */
  
  // 自旋锁,用来保护下面的成员
	slock_t		buffer_strategy_lock;

	/*
	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
	 * this isn't a concrete buffer - we only ever increase the value. So, to
	 * get an actual buffer, it needs to be used modulo NBuffers.
	 */
  
  // 下次遍历位置
	pg_atomic_uint32 nextVictimBuffer;

  // 空闲buffer链表的头部
	int			firstFreeBuffer;	/* Head of list of unused buffers */
  
  // 空闲buffer链表的尾部
	int			lastFreeBuffer; /* Tail of list of unused buffers */

	/*
	 * NOTE: lastFreeBuffer is undefined when firstFreeBuffer is -1 (that is,
	 * when the list is empty)
	 */

	/*
	 * Statistics.  These counters should be wide enough that they can't
	 * overflow during a single bgwriter cycle.
	 */
  
  // 记录遍历完数组的次数
	uint32		completePasses; /* Complete cycles of the clock sweep */
	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */

	/*
	 * Bgworker process to be notified upon activity or -1 if none. See
	 * StrategyNotifyBgWriter.
	 */
	int			bgwprocno;
} BufferStrategyControl;

每次从上次位置开始轮询,然后检查buffer 的引用次数 refcount 和访问次数 usagecount。

  1. 如果 refcount,usagecount 都为零,那么直接返回。
  2. 如果 refcount 为零,usagecount 不为零,那么将其usagecount 减1,遍历下一个buffer。
  3. 如果 refcount 不为零,则遍历下一个。

clock sweep 算法是一个死循环算法,直到找出一个 refcount,usagecount 都为零的buffer。

空闲链表

  • 为了加快查找空闲 buffer 的速度,postgresql 使用链表来保存这些buffer。
  • 链表的头部和尾部由 BufferStrategyControl 结构体的 firstFreeBuffer 和 lastFreeBuffer 成员指定。
  • 链表节点由 BufferDesc 结构体表示,它的 freeNext 成员指向下一个节点。
  • 当有新增的空闲buffer,它会被添加到链表的尾部。当需要空闲空间时,则直接返回链表的头部。
在这里插入图片描述
在这里插入图片描述

引用计数器

这里的state的前18 bits在共享内存中记录引用次数。

代码语言:javascript
复制
typedef struct BufferDesc
{
	BufferTag	tag;			/* ID of page contained in buffer */
	int			buf_id;			/* buffer's index number (from 0) */

	/* state of the tag, containing flags, refcount and usagecount */
	pg_atomic_uint32 state;

	int			wait_backend_pid;	/* backend PID of pin-count waiter */
	int			freeNext;		/* link in freelist chain */

	LWLock		content_lock;	/* to lock access to buffer contents */
} BufferDesc;

/*
 * Buffer state is a single 32-bit variable where following data is combined.
 *
 * - 18 bits refcount 
【进程本身用数组+哈希表做二级缓存pined,这里记录最后刷新的结果】
 * - 4 bits usage count
 * - 10 bits of flags
 *
 * Combining these values allows to perform some operations without locking
 * the buffer header, by modifying them together with a CAS loop.
 *
 * The definition of buffer state components is below.
 */
#define BUF_REFCOUNT_ONE 1
#define BUF_REFCOUNT_MASK ((1U << 18) - 1)
#define BUF_USAGECOUNT_MASK 0x003C0000U
#define BUF_USAGECOUNT_ONE (1U << 18)
#define BUF_USAGECOUNT_SHIFT 18
#define BUF_FLAG_MASK 0xFFC00000U

/* Get refcount and usagecount from buffer state */
#define BUF_STATE_GET_REFCOUNT(state) ((state) & BUF_REFCOUNT_MASK)
#define BUF_STATE_GET_USAGECOUNT(state) (((state) & BUF_USAGECOUNT_MASK) >> BUF_USAGECOUNT_SHIFT)

真正的引用计数器在这里,每个BUFFER一个:

代码语言:javascript
复制
typedef struct PrivateRefCountEntry
{
	Buffer		buffer;
	int32		refcount;
} PrivateRefCountEntry;

为了快速找到指定buffer的引用计数,PrivateRefCountEntry数组作为一级缓存,使用哈希表作为二级缓存。

注意:进程私有的缓存!

代码语言:javascript
复制
/*
 * Backend-Private refcount management:
 *
 * Each buffer also has a private refcount that keeps track of the number of
 * times the buffer is pinned in the current process.  This is so that the
 * shared refcount needs to be modified only once if a buffer is pinned more
 * than once by an individual backend.  It's also used to check that no buffers
 * are still pinned at the end of transactions and when exiting.
【当前进程在私有内存记录:使用的buffer倍pin了多少次】
【功能1:所以当前进程如果pin了多次,最后在共享内存里面只需要pin一次】
【功能2:也用来检查当事务结束、进程退出时没有pin住的buffer】
 *
 * To avoid - as we used to - requiring an array with NBuffers entries to keep
 * track of local buffers, we use a small sequentially searched array
 * (PrivateRefCountArray) and an overflow hash table (PrivateRefCountHash) to
 * keep track of backend local pins.
 *
【为了避免使用NBuffers个元素的大数组来跟踪本地pin缓存,这里使用一个8元素的数组+一个overflow哈希表来记录pin】

 * Until no more than REFCOUNT_ARRAY_ENTRIES buffers are pinned at once, all
 * refcounts are kept track of in the array; after that, new array entries
 * displace old ones into the hash table. That way a frequently used entry
 * can't get "stuck" in the hashtable while infrequent ones clog the array.
 *
 * Note that in most scenarios the number of pinned buffers will not exceed
 * REFCOUNT_ARRAY_ENTRIES.
【pinned不超过8个之前,所有的refcounts都会在数组中跟踪,再来新的pin会把旧的换到哈希表中】
【这样经常使用的会一直在数组中,不常用的会在哈希表中】

 *
 *
 * To enter a buffer into the refcount tracking mechanism first reserve a free
 * entry using ReservePrivateRefCountEntry() and then later, if necessary,
 * fill it with NewPrivateRefCountEntry(). That split lets us avoid doing
 * memory allocations in NewPrivateRefCountEntry() which can be important
 * because in some scenarios it's called with a spinlock held...
【要使用这套缓存跟踪机制,首先用ReservePrivateRefCountEntry保留一个空闲数组位置】
【在使用时用NewPrivateRefCountEntry填充这个位置】
【为什么要拆分?避免在ReservePrivateRefCountEntry做内存分配,因为有时候会拿着spinlock调这个函数】
 */

// 【一级缓存】:8个元素
static struct PrivateRefCountEntry PrivateRefCountArray[REFCOUNT_ARRAY_ENTRIES];
// 空闲位置index
static uint32 PrivateRefCountClock = 0;
// 指向数组中空余的位置
static PrivateRefCountEntry *ReservedRefCountEntry = NULL;



// 【二级缓存】:key=buffer_id  value=PrivateRefCountEntry
static HTAB *PrivateRefCountHash = NULL;
// 哈希表包含entry的数目
static int32 PrivateRefCountOverflowed = 0;

...
void
InitBufferPoolAccess(void)
{
	HASHCTL		hash_ctl;

	memset(&PrivateRefCountArray, 0, sizeof(PrivateRefCountArray));

	MemSet(&hash_ctl, 0, sizeof(hash_ctl));
	hash_ctl.keysize = sizeof(int32);
	hash_ctl.entrysize = sizeof(PrivateRefCountEntry);

	PrivateRefCountHash = hash_create("PrivateRefCount", 100, &hash_ctl,
									  HASH_ELEM | HASH_BLOBS);
}

ReservePrivateRefCountEntry

(1)在一级缓存中申请一个 初始状态的ReservedRefCountEntry(buffer_id = 0, ref_count = 0)

(2)如果数组满了,把PrivateRefCountClock位置的entry插入hash表,然后清空后使用。

(3)注意这个函数不返回什么,只是维护了ReservedRefCountEntry,让这个指针指向一个空闲的entry存本地ref_count。

代码语言:javascript
复制
static void
ReservePrivateRefCountEntry(void)
{
	/* Already reserved (or freed), nothing to do */
	if (ReservedRefCountEntry != NULL)
		return;

	/*
	 * First search for a free entry the array, that'll be sufficient in the
	 * majority of cases.
	 */
	{
		int			i;

		for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
		{
			PrivateRefCountEntry *res;

			res = &PrivateRefCountArray[i];

			if (res->buffer == InvalidBuffer)
			{
				ReservedRefCountEntry = res;
				return;
			}
		}
	}

	/*
	 * No luck. All array entries are full. Move one array entry into the hash
	 * table.
	 */
	{
		/*
		 * Move entry from the current clock position in the array into the
		 * hashtable. Use that slot.
		 */
		PrivateRefCountEntry *hashent;
		bool		found;

		/* select victim slot */
		ReservedRefCountEntry =
			&PrivateRefCountArray[PrivateRefCountClock++ % REFCOUNT_ARRAY_ENTRIES];

		/* Better be used, otherwise we shouldn't get here. */
		Assert(ReservedRefCountEntry->buffer != InvalidBuffer);

		/* enter victim array entry into hashtable */
		hashent = hash_search(PrivateRefCountHash,
							  (void *) &(ReservedRefCountEntry->buffer),
							  HASH_ENTER,
							  &found);
		Assert(!found);
		hashent->refcount = ReservedRefCountEntry->refcount;

		/* clear the now free array slot */
		ReservedRefCountEntry->buffer = InvalidBuffer;
		ReservedRefCountEntry->refcount = 0;

		PrivateRefCountOverflowed++;
	}
}

NewPrivateRefCountEntry

(1)填充buffer_id

(2)返回一个PrivateRefCountEntry,新的ref_count=0

代码语言:javascript
复制
static PrivateRefCountEntry *
NewPrivateRefCountEntry(Buffer buffer)
{
	PrivateRefCountEntry *res;

	/* only allowed to be called when a reservation has been made */
	Assert(ReservedRefCountEntry != NULL);

	/* use up the reserved entry */
	res = ReservedRefCountEntry;
	ReservedRefCountEntry = NULL;

	/* and fill it */
	res->buffer = buffer;
	res->refcount = 0;

	return res;
}

GetPrivateRefCountEntry

(1)传入buffer_id找PrivateRefCountEntry

(2)如果buffer_id已经在数组中了,直接返回数组中元素的指针(buffer_id, ref_count)

(3)如果没在数组中,去hash表中查询,如果没有直接返回NULL

(4)如果查到了do_move==true ? 在数组中清理一个位置然后把哈希表中的记录放到数组中:直接返回查到的(buffer_id, ref_count)

代码语言:javascript
复制
static PrivateRefCountEntry *
GetPrivateRefCountEntry(Buffer buffer, bool do_move)
{
	PrivateRefCountEntry *res;
	int			i;

	Assert(BufferIsValid(buffer));
	Assert(!BufferIsLocal(buffer));

	/*
	 * First search for references in the array, that'll be sufficient in the
	 * majority of cases.
	 */
	for (i = 0; i < REFCOUNT_ARRAY_ENTRIES; i++)
	{
		res = &PrivateRefCountArray[i];

		if (res->buffer == buffer)
			return res;
	}

	/*
	 * By here we know that the buffer, if already pinned, isn't residing in
	 * the array.
	 *
	 * Only look up the buffer in the hashtable if we've previously overflowed
	 * into it.
	 */
	if (PrivateRefCountOverflowed == 0)
		return NULL;

	res = hash_search(PrivateRefCountHash,
					  (void *) &buffer,
					  HASH_FIND,
					  NULL);

	if (res == NULL)
		return NULL;
	else if (!do_move)
	{
		/* caller doesn't want us to move the hash entry into the array */
		return res;
	}
	else
	{
		/* move buffer from hashtable into the free array slot */
		bool		found;
		PrivateRefCountEntry *free;

		/* Ensure there's a free array slot */
		ReservePrivateRefCountEntry();

		/* Use up the reserved slot */
		Assert(ReservedRefCountEntry != NULL);
		free = ReservedRefCountEntry;
		ReservedRefCountEntry = NULL;
		Assert(free->buffer == InvalidBuffer);

		/* and fill it */
		free->buffer = buffer;
		free->refcount = res->refcount;

		/* delete from hashtable */
		hash_search(PrivateRefCountHash,
					(void *) &buffer,
					HASH_REMOVE,
					&found);
		Assert(found);
		Assert(PrivateRefCountOverflowed > 0);
		PrivateRefCountOverflowed--;

		return free;
	}
}

5 SRC

ReadBufferExtended
代码语言:javascript
复制
/*
 * ReadBufferExtended -- returns a buffer containing the requested
 *		block of the requested relation.  If the blknum
 *		requested is P_NEW, extend the relation file and
 *		allocate a new block.  (Caller is responsible for
 *		ensuring that only one backend tries to extend a
 *		relation at the same time!)
【返回请求的PAGE,如果blknum==P_NEW,扩展表文件申请一个新的页面读到内存里】

 *
 * Returns: the buffer number for the buffer containing
 *		the block read.  The returned buffer has been pinned.
 *		Does not return on error --- elog's instead.
 *
【返回请求的、可用的页面,注意该页面已经PIN了】

 * Assume when this function is called, that reln has been opened already.
 *
 * In RBM_NORMAL mode, the page is read from disk, and the page header is
 * validated.  An error is thrown if the page header is not valid.  (But
 * note that an all-zero page is considered "valid"; see PageIsVerified().)
 
 【RBM_NORMAL模式,页面从磁盘上读取出来并验证page header】
 
 * RBM_ZERO_ON_ERROR is like the normal mode, but if the page header is not
 * valid, the page is zeroed instead of throwing an error. This is intended
 * for non-critical data, where the caller is prepared to repair errors.
 *
 【RBM_ZERO_ON_ERROR模式,如果page header验证失败,直接清零不报错,适用于非核心数据场景】
 
 * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
 * filled with zeros instead of reading it from disk.  Useful when the caller
 * is going to fill the page from scratch, since this saves I/O and avoids
 * unnecessary failure if the page-on-disk has corrupt page headers.
 * The page is returned locked to ensure that the caller has a chance to
 * initialize the page before it's made visible to others.
 * Caution: do not use this mode to read a page that is beyond the relation's
 * current physical EOF; that is likely to cause problems in md.c when
 * the page is modified and written out. P_NEW is OK, though.
 
 【RBM_ZERO_AND_LOCK高性能模式:页面不在缓冲区中,不从磁盘读,直接填0。页面锁定避免别人读到初始化之前的一堆0】
 
 * RBM_ZERO_AND_CLEANUP_LOCK is the same as RBM_ZERO_AND_LOCK, but acquires
 * a cleanup-strength lock on the page.
 *
 * RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.
 *
 * If strategy is not NULL, a nondefault buffer access strategy is used.
 * See buffer/README for details.
 */
Buffer
ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
				   ReadBufferMode mode, BufferAccessStrategy strategy)
{
	bool		hit;
	Buffer		buf;

	/* Open it at the smgr level if not already done */
	RelationOpenSmgr(reln);

这里会使用mdopen打开物理文件,并在reln的md_seg_fds中记录打开的vfd。

1、从reln里面把记录的文件名拿出来return &reln->md_seg_fds[forknum][0],如果没有需要用VFD打开

2、打开文件PathNameOpenFile,并记录VFD到md_seg_fds中。

代码语言:javascript
复制
	/*
	 * Reject attempts to read non-local temporary relations; we would be
	 * likely to get wrong data since we have no visibility into the owning
	 * session's local buffers.
	 */
	if (RELATION_IS_OTHER_TEMP(reln))
		ereport(ERROR,
				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
				 errmsg("cannot access temporary tables of other sessions")));

	/*
	 * Read the buffer, and update pgstat counters to reflect a cache hit or
	 * miss.
	 */
	pgstat_count_buffer_read(reln);
	buf = ReadBuffer_common(reln->rd_smgr, reln->rd_rel->relpersistence,
							forkNum, blockNum, mode, strategy, &hit);
	if (hit)
		pgstat_count_buffer_hit(reln);
	return buf;
}
ReadBufferExtended
代码语言:javascript
复制
/*
 * ReadBuffer_common -- common logic for all ReadBuffer variants
 *
 * *hit is set to true if the request was satisfied from shared buffer cache.
 */
static Buffer
ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
				  BlockNumber blockNum, ReadBufferMode mode,
				  BufferAccessStrategy strategy, bool *hit)
{
...
...
		bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
							 strategy, &found);
...
...
}
BufferAlloc
代码语言:javascript
复制
/*
 * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
 *		buffer.  If no buffer exists already, selects a replacement
 *		victim and evicts the old page, but does NOT read in new page.
 *
【找到一个buffer,如果没有淘汰一个】

 * "strategy" can be a buffer replacement strategy object, or NULL for
 * the default strategy.  The selected buffer's usage_count is advanced when
 * using the default strategy, but otherwise possibly not (see PinBuffer).
 *
 * The returned buffer is pinned and is already marked as holding the
 * desired page.  If it already did have the desired page, *foundPtr is
 * set TRUE.  Otherwise, *foundPtr is set FALSE and the buffer is marked
 * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it.
 *
【返回的buffer会pin住,并且页面数据已经填充好可用】
【如果页面在缓冲区里面已经有了,直接返回】
【如果没有则把foundPtr=false,buffer标记成IO_IN_PROGRESS,上层函数“无需”在做IO】

 * *foundPtr is actually redundant with the buffer's BM_VALID flag, but
 * we keep it for simplicity in ReadBuffer.
 *
 * No locks are held either at entry or exit.
 */
static BufferDesc *
BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
			BlockNumber blockNum,
			BufferAccessStrategy strategy,
			bool *foundPtr)
{
	BufferTag	newTag;			/* identity of requested block */
	uint32		newHash;		/* hash value for newTag */
	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
	BufferTag	oldTag;			/* previous identity of selected buffer */
	uint32		oldHash;		/* hash value for oldTag */
	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
	uint32		oldFlags;
	int			buf_id;
	BufferDesc *buf;
	bool		valid;
	uint32		buf_state;

	/* create a tag so we can lookup the buffer */
	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);

	/* determine its hash code and partition lock ID */
	newHash = BufTableHashCode(&newTag);

用hash值 % NUM_BUFFER_PARTITIONS(128),然后去MainLWLockArray数组里面拿锁:

#define BufTableHashPartition(hashcode) ((hashcode) % NUM_BUFFER_PARTITIONS)

#define BufMappingPartitionLock(hashcode) (&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + BufTableHashPartition(hashcode)].lock)

代码语言:javascript
复制
	newPartitionLock = BufMappingPartitionLock(newHash);

	/* see if the block is in the buffer pool already */
	LWLockAcquire(newPartitionLock, LW_SHARED);
	buf_id = BufTableLookup(&newTag, newHash);

【hash表中tag --> buf_id,找到了说明已经在buffer中了】
	if (buf_id >= 0)
	{
		/*
		 * Found it.  Now, pin the buffer so no one can steal it from the
		 * buffer pool, and check to see if the correct data has been loaded
		 * into the buffer.
		 */
【找到了!直接pinbuffer】
		buf = GetBufferDescriptor(buf_id);

【下面有这个函数的展开分析,共享内存中desc和本地缓存ref_count都做++】
		valid = PinBuffer(buf, strategy);

		/* Can release the mapping lock as soon as we've pinned it */
【PIN住就可以放锁了】
		LWLockRelease(newPartitionLock);

		*foundPtr = TRUE;

【哈希表中找到了,但是锁完了发现页面是不可用的,需要把页面重新读上来】
		if (!valid)
		{
			/*
			 * We can only get here if (a) someone else is still reading in
			 * the page, or (b) a previous read attempt failed.  We have to
			 * wait for any active read attempt to finish, and then set up our
			 * own read attempt if the page is still not BM_VALID.
			 * StartBufferIO does it all.
			 */
			if (StartBufferIO(buf, true))
			{
				/*
				 * If we get here, previous attempts to read the buffer must
				 * have failed ... but we shall bravely try again.
				 */
				*foundPtr = FALSE;
			}
		}

		return buf;
	}

【走到这了说明在hash表中没找到,没在缓存中】
	/*
	 * Didn't find it in the buffer pool.  We'll have to initialize a new
	 * buffer.  Remember to unlock the mapping lock while doing the work.
	 */
【需要初始化一个新的buffer,先把锁放了在初始化】
	LWLockRelease(newPartitionLock);

	/* Loop here in case we have to try another victim buffer */
	for (;;)
	{
		/*
		 * Ensure, while the spinlock's not yet held, that there's a free
		 * refcount entry.
		 */
【从私有一级缓存数组中拿出来一个(buffer_id,ref_count)位置】
【如果没位置了,换出一个到二级缓存哈希表中,然后拿出来一个位置】
		ReservePrivateRefCountEntry();

		/*
		 * Select a victim buffer.  The buffer is returned with its header
		 * spinlock still held!
		 */
【找一个buffer位置,返回ID和buf_state,先找freelist没有就clocksweep淘汰】
【这个函数下面有展开】
		buf = StrategyGetBuffer(strategy, &buf_state);

		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);

		/* Must copy buffer flags while we still hold the spinlock */
		oldFlags = buf_state & BUF_FLAG_MASK;

		/* Pin the buffer and then release the buffer spinlock */
【共享缓冲区和本地缓冲区都的ref都++】
		PinBuffer_Locked(buf);

		/*
		 * If the buffer was dirty, try to write it out.  There is a race
		 * condition here, in that someone might dirty it after we released it
		 * above, or even while we are writing it out (since our share-lock
		 * won't prevent hint-bit updates).  We will recheck the dirty bit
		 * after re-locking the buffer header.
		 */
【需要刷脏】
		if (oldFlags & BM_DIRTY)
		{
			/*
			 * We need a share-lock on the buffer contents to write it out
			 * (else we might write invalid data, eg because someone else is
			 * compacting the page contents while we write).  We must use a
			 * conditional lock acquisition here to avoid deadlock.  Even
			 * though the buffer was not pinned (and therefore surely not
			 * locked) when StrategyGetBuffer returned it, someone else could
			 * have pinned and exclusive-locked it by the time we get here. If
			 * we try to get the lock unconditionally, we'd block waiting for
			 * them; if they later block waiting for us, deadlock ensues.
			 * (This has been observed to happen when two backends are both
			 * trying to split btree index pages, and the second one just
			 * happens to be trying to split the page the first one got from
			 * StrategyGetBuffer.)
			 */
			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
										 LW_SHARED))
			{
				/*
				 * If using a nondefault strategy, and writing the buffer
				 * would require a WAL flush, let the strategy decide whether
				 * to go ahead and write/reuse the buffer or to choose another
				 * victim.  We need lock to inspect the page LSN, so this
				 * can't be done inside StrategyGetBuffer.
				 */
				if (strategy != NULL)
				{
					XLogRecPtr	lsn;

					/* Read the LSN while holding buffer header lock */
					buf_state = LockBufHdr(buf);
					lsn = BufferGetLSN(buf);
					UnlockBufHdr(buf, buf_state);

					if (XLogNeedsFlush(lsn) &&
						StrategyRejectBuffer(strategy, buf))
					{
						/* Drop lock/pin and loop around for another buffer */
						LWLockRelease(BufferDescriptorGetContentLock(buf));
						UnpinBuffer(buf, true);
						continue;
					}
				}
				/* OK, do the I/O */
				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
														  smgr->smgr_rnode.node.spcNode,
														  smgr->smgr_rnode.node.dbNode,
														  smgr->smgr_rnode.node.relNode);

				FlushBuffer(buf, NULL);
				LWLockRelease(BufferDescriptorGetContentLock(buf));

				ScheduleBufferTagForWriteback(&BackendWritebackContext,
											  &buf->tag);

				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
														 smgr->smgr_rnode.node.spcNode,
														 smgr->smgr_rnode.node.dbNode,
														 smgr->smgr_rnode.node.relNode);
			}
			else
			{
				/*
				 * Someone else has locked the buffer, so give it up and loop
				 * back to get another one.
				 */
				UnpinBuffer(buf, true);
				continue;
			}
		}
【拿出来的块不需要刷脏了 或者上面已经刷完了】

【然后这个块的TAG如果BM_TAG_VALID,重新加到hash表里面】
		/*
		 * To change the association of a valid buffer, we'll need to have
		 * exclusive lock on both the old and new mapping partitions.
		 */
		if (oldFlags & BM_TAG_VALID)
		{
			/*
			 * Need to compute the old tag's hashcode and partition lock ID.
			 * XXX is it worth storing the hashcode in BufferDesc so we need
			 * not recompute it here?  Probably not.
			 */
			oldTag = buf->tag;
			oldHash = BufTableHashCode(&oldTag);
			oldPartitionLock = BufMappingPartitionLock(oldHash);

			/*
			 * Must lock the lower-numbered partition first to avoid
			 * deadlocks.
			 */
			if (oldPartitionLock < newPartitionLock)
			{
				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			}
			else if (oldPartitionLock > newPartitionLock)
			{
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
			}
			else
			{
				/* only one partition, only one lock */
				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			}
		}
		else
【否则这个旧TAG是无效的,不需要管以前的了,锁新的分区就好】
		{
			/* if it wasn't valid, we need only the new partition */
			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
			/* remember we have no old-partition lock or tag */
			oldPartitionLock = NULL;
			/* this just keeps the compiler quiet about uninit variables */
			oldHash = 0;
		}

		/*
		 * Try to make a hashtable entry for the buffer under its new tag.
		 * This could fail because while we were writing someone else
		 * allocated another buffer for the same block we want to read in.
		 * Note that we have not yet removed the hashtable entry for the old
		 * tag.
		 */
【新TAG插入哈希表】
		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);

		if (buf_id >= 0)
		{
			/*
			 * Got a collision. Someone has already done what we were about to
			 * do. We'll just handle this as if it were found in the buffer
			 * pool in the first place.  First, give up the buffer we were
			 * planning to use.
			 */
			UnpinBuffer(buf, true);

			/* Can give up that buffer's mapping partition lock now */
			if (oldPartitionLock != NULL &&
				oldPartitionLock != newPartitionLock)
				LWLockRelease(oldPartitionLock);

			/* remaining code should match code at top of routine */

			buf = GetBufferDescriptor(buf_id);

			valid = PinBuffer(buf, strategy);

			/* Can release the mapping lock as soon as we've pinned it */
			LWLockRelease(newPartitionLock);

			*foundPtr = TRUE;

			if (!valid)
			{
				/*
				 * We can only get here if (a) someone else is still reading
				 * in the page, or (b) a previous read attempt failed.  We
				 * have to wait for any active read attempt to finish, and
				 * then set up our own read attempt if the page is still not
				 * BM_VALID.  StartBufferIO does it all.
				 */
				if (StartBufferIO(buf, true))
				{
					/*
					 * If we get here, previous attempts to read the buffer
					 * must have failed ... but we shall bravely try again.
					 */
					*foundPtr = FALSE;
				}
			}

			return buf;
		}

		/*
		 * Need to lock the buffer header too in order to change its tag.
		 */
		buf_state = LockBufHdr(buf);

		/*
		 * Somebody could have pinned or re-dirtied the buffer while we were
		 * doing the I/O and making the new hashtable entry.  If so, we can't
		 * recycle this buffer; we must undo everything we've done and start
		 * over with a new victim buffer.
		 */
		oldFlags = buf_state & BUF_FLAG_MASK;
		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
			break;

		UnlockBufHdr(buf, buf_state);
		BufTableDelete(&newTag, newHash);
		if (oldPartitionLock != NULL &&
			oldPartitionLock != newPartitionLock)
			LWLockRelease(oldPartitionLock);
		LWLockRelease(newPartitionLock);
		UnpinBuffer(buf, true);
	}

	/*
	 * Okay, it's finally safe to rename the buffer.
	 *
	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
	 * paranoia.  We also reset the usage_count since any recency of use of
	 * the old content is no longer relevant.  (The usage_count starts out at
	 * 1 so that the buffer can survive one clock-sweep pass.)
	 *
	 * Make sure BM_PERMANENT is set for buffers that must be written at every
	 * checkpoint.  Unlogged buffers only need to be written at shutdown
	 * checkpoints, except for their "init" forks, which need to be treated
	 * just like permanent relations.
	 */
	buf->tag = newTag;
	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
				   BUF_USAGECOUNT_MASK);
	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
	else
		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;

	UnlockBufHdr(buf, buf_state);

【如果老TAG有有效的,需要从哈希表里面删掉】
	if (oldPartitionLock != NULL)
	{
		BufTableDelete(&oldTag, oldHash);
		if (oldPartitionLock != newPartitionLock)
			LWLockRelease(oldPartitionLock);
	}

	LWLockRelease(newPartitionLock);

	/*
	 * Buffer contents are currently invalid.  Try to get the io_in_progress
	 * lock.  If StartBufferIO returns false, then someone else managed to
	 * read it before we did, so there's nothing left for BufferAlloc() to do.
	 */
	if (StartBufferIO(buf, true))
		*foundPtr = FALSE;
	else
		*foundPtr = TRUE;

	return buf;
}
PinBuffer

总结:

  1. 用buf_id在本地缓存中查是不是已经pin了
  2. 如果已经pin了本地ref_count++
  3. 如果本地没pin,在desc共享内存中更新state(ref_count++,usage_count++最大到5),本地ref_count++
  4. 注意:锁完了页面数据不一定是可用的,返回值是:(buf_state & BM_VALID) != 0
代码语言:javascript
复制
static bool
PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
{
	Buffer		b = BufferDescriptorGetBuffer(buf);

这里b比desc的buf_id多1。

desc的buf_id是从0开始算的。

#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)

代码语言:javascript
复制
	bool		result;
	PrivateRefCountEntry *ref;
【从数字和哈希表两级缓存中找(buffer_id, ref_count)】
	ref = GetPrivateRefCountEntry(b, true);

	if (ref == NULL)
	{
		uint32		buf_state;
		uint32		old_buf_state;
【没找到在数组中申请一个,如果数组满8个了踢一个到哈希表中】
		ReservePrivateRefCountEntry();
    
【数组中把b填进去】
		ref = NewPrivateRefCountEntry(b);
    
【更新buf->state】
		old_buf_state = pg_atomic_read_u32(&buf->state);
		for (;;)
		{
 
【如果锁了,轮询然后把新状态拿出来】
			if (old_buf_state & BM_LOCKED)
				old_buf_state = WaitBufHdrUnlocked(buf);

			buf_state = old_buf_state;

			/* increase refcount */
			buf_state += BUF_REFCOUNT_ONE;

 【这里strategy == NULL指的clock sweep正常淘汰,否则用ring buffer批量读写数据用的】
			if (strategy == NULL)
			{
				/* Default case: increase usagecount unless already max. */
				if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
					buf_state += BUF_USAGECOUNT_ONE;
			}
			else
			{
				/*
				 * Ring buffers shouldn't evict others from pool.  Thus we
				 * don't make usagecount more than 1.
				 */
				if (BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
					buf_state += BUF_USAGECOUNT_ONE;
			}

			if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
											   buf_state))
			{
				result = (buf_state & BM_VALID) != 0;
				break;
			}
		}
	}
	else
【如果已经pin了,本地的refcount++即可,对于共享内存中的desc来说,一个进程pin多少次都算一次】
	{
		/* If we previously pinned the buffer, it must surely be valid */
		result = true;
	}

	ref->refcount++;
	Assert(ref->refcount > 0);
	ResourceOwnerRememberBuffer(CurrentResourceOwner, b);
	return result;
}
StrategyGetBuffer
代码语言:javascript
复制
/*
 * StrategyGetBuffer
 *
 *	Called by the bufmgr to get the next candidate buffer to use in
 *	BufferAlloc(). The only hard requirement BufferAlloc() has is that
 *	the selected buffer must not currently be pinned by anyone.
 *
 *	strategy is a BufferAccessStrategy object, or NULL for default strategy.
 *
 *	To ensure that no one else can pin the buffer before we do, we must
 *	return the buffer with the buffer header spinlock still held.
 */
BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state)
{
	BufferDesc *buf;
	int			bgwprocno;
	int			trycounter;
	uint32		local_buf_state;	/* to avoid repeated (de-)referencing */

	/*
	 * If given a strategy object, see whether it can select a buffer. We
	 * assume strategy objects don't need buffer_strategy_lock.
	 */

【如果strategy不为空,说明走ring buffer,strategy结构体里面自带ring buffer的虽有结构】
	if (strategy != NULL)
	{
		buf = GetBufferFromRing(strategy, buf_state);
		if (buf != NULL)
			return buf;
	}

	/*
	 * If asked, we need to waken the bgwriter. Since we don't want to rely on
	 * a spinlock for this we force a read from shared memory once, and then
	 * set the latch based on that value. We need to go through that length
	 * because otherwise bgprocno might be reset while/after we check because
	 * the compiler might just reread from memory.
	 *
	 * This can possibly set the latch of the wrong process if the bgwriter
	 * dies in the wrong moment. But since PGPROC->procLatch is never
	 * deallocated the worst consequence of that is that we set the latch of
	 * some arbitrary process.
	 */

【StrategyControl这个是clocksweep的核心算法数据结构,上面有介绍】
  // (gdb) p	*StrategyControl
// $6 = {buffer_strategy_lock = 0 '\000', nextVictimBuffer = {value = 0}, firstFreeBuffer = 324, lastFreeBuffer = 16383, completePasses = 0, numBufferAllocs = {value = 0}, bgwprocno = 113}
	bgwprocno = INT_ACCESS_ONCE(StrategyControl->bgwprocno);
	if (bgwprocno != -1)
	{
		/* reset bgwprocno first, before setting the latch */
		StrategyControl->bgwprocno = -1;

		/*
		 * Not acquiring ProcArrayLock here which is slightly icky. It's
		 * actually fine because procLatch isn't ever freed, so we just can
		 * potentially set the wrong process' (or no process') latch.
		 */
		SetLatch(&ProcGlobal->allProcs[bgwprocno].procLatch);
	}

	/*
	 * We count buffer allocation requests so that the bgwriter can estimate
	 * the rate of buffer consumption.  Note that buffers recycled by a
	 * strategy object are intentionally not counted here.
	 */
	pg_atomic_fetch_add_u32(&StrategyControl->numBufferAllocs, 1);

	/*
	 * First check, without acquiring the lock, whether there's buffers in the
	 * freelist. Since we otherwise don't require the spinlock in every
	 * StrategyGetBuffer() invocation, it'd be sad to acquire it here -
	 * uselessly in most cases. That obviously leaves a race where a buffer is
	 * put on the freelist but we don't see the store yet - but that's pretty
	 * harmless, it'll just get used during the next buffer acquisition.
	 *
	 * If there's buffers on the freelist, acquire the spinlock to pop one
	 * buffer of the freelist. Then check whether that buffer is usable and
	 * repeat if not.
	 *
	 * Note that the freeNext fields are considered to be protected by the
	 * buffer_strategy_lock not the individual buffer spinlocks, so it's OK to
	 * manipulate them without holding the spinlock.
	 */
	if (StrategyControl->firstFreeBuffer >= 0)
	{
		while (true)
		{
			/* Acquire the spinlock to remove element from the freelist */
			SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
      
【这是一个临界场景,firstFreeBuffer上面>0但是可能进来之后上完了锁就被用光了】
			if (StrategyControl->firstFreeBuffer < 0)
			{
				SpinLockRelease(&StrategyControl->buffer_strategy_lock);
				break;
			}

			buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
			Assert(buf->freeNext != FREENEXT_NOT_IN_LIST);

【freelist的第一个desc不在链表中了: 1、头指针修改 2、拿出来的desc的freeNext失效】
			/* Unconditionally remove buffer from freelist */
			StrategyControl->firstFreeBuffer = buf->freeNext;
			buf->freeNext = FREENEXT_NOT_IN_LIST;

			/*
			 * Release the lock so someone else can access the freelist while
			 * we check out this buffer.
			 */
			SpinLockRelease(&StrategyControl->buffer_strategy_lock);

			/*
			 * If the buffer is pinned or has a nonzero usage_count, we cannot
			 * use it; discard it and retry.  (This can only happen if VACUUM
			 * put a valid buffer in the freelist and then someone else used
			 * it before we got to it.  It's probably impossible altogether as
			 * of 8.3, but we'd better check anyway.)
			 */
【拿到state并且加锁BM_LOCKED】
			local_buf_state = LockBufHdr(buf);

【这里不太可能失败,相当于做个异常检测】
【失败的话只能是vacuum放一个进去,别人先于我们拿到直接用掉了】
			if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
				&& BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
			{
				if (strategy != NULL)
					AddBufferToRing(strategy, buf);
				*buf_state = local_buf_state;
				return buf;
			}
			UnlockBufHdr(buf, local_buf_state);

		}
	}
  
【到这说明freelist没有能用的了,需要淘汰】

	/* Nothing on the freelist, so run the "clock sweep" algorithm */
	trycounter = NBuffers;
	for (;;)
	{
【ClockSweepTick函数拿StrategyControl->nextVictimBuffer】
【因为要保证原子性,所以写了一大坨】
  
【后面循环逻辑比较简单,遍历所有buffer,找没有pinned的】
【找usage_count==0的返回,遍历到的buffer的usage_count都会--,所以第一轮没有遍历5轮肯定会有】
【usage_count的上限就是5了,防止内卷~】
		buf = GetBufferDescriptor(ClockSweepTick());

		/*
		 * If the buffer is pinned or has a nonzero usage_count, we cannot use
		 * it; decrement the usage_count (unless pinned) and keep scanning.
		 */
		local_buf_state = LockBufHdr(buf);

		if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
		{
			if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
			{
				local_buf_state -= BUF_USAGECOUNT_ONE;

				trycounter = NBuffers;
			}
			else
			{
				/* Found a usable buffer */
				if (strategy != NULL)
					AddBufferToRing(strategy, buf);
				*buf_state = local_buf_state;
				return buf;
			}
		}
		else if (--trycounter == 0)
		{
			/*
			 * We've scanned all the buffers without making any state changes,
			 * so all the buffers are pinned (or were when we looked at them).
			 * We could hope that someone will free one eventually, but it's
			 * probably better to fail than to risk getting stuck in an
			 * infinite loop.
			 */
			UnlockBufHdr(buf, local_buf_state);
			elog(ERROR, "no unpinned buffers available");
		}
		UnlockBufHdr(buf, local_buf_state);
	}
}
本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2021-08-19,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Postgresql源码(5)缓冲区管理
    • 前言
      • PG实现
        • 1 TAG
        • 2 结构
        • 3 锁
        • 4 淘汰策略
        • 5 SRC
    相关产品与服务
    对象存储
    对象存储(Cloud Object Storage,COS)是由腾讯云推出的无目录层次结构、无数据格式限制,可容纳海量数据且支持 HTTP/HTTPS 协议访问的分布式存储服务。腾讯云 COS 的存储桶空间无容量上限,无需分区管理,适用于 CDN 数据分发、数据万象处理或大数据计算与分析的数据湖等多种场景。
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档