Postgresql源码（30）Postgresql索引基础B-linked-tree

mingjie

发布于 2022-07-14 13:44:39

4750

发布于 2022-07-14 13:44:39

文章被收录于专栏：Postgresql源码分析Postgresql源码分析

阅读顺序

《Postgresql源码（30）Postgresql索引基础B-linked-tree》

《Postgresql源码（31）Btree索引相关系统表和整体结构》

《Postgresql源码（32）Btree索引分裂前后结构差异对比》

《Postgresql源码（33）Btree索引读——整体流程&_bt_first》

《Postgresql源码（34）Btree索引读——_bt_first搜索部分分析》

《Postgresql源码（36）Btree索引读——_bt_next搜索部分分析》

从B树到B+、B*再到B-linked-tree的一些学习总结。

1 B树用阶（order）定义

Every node has at most m children.
Every non-leaf node (except root) has at least ⌈m/2⌉ child nodes.
The root has at least two children if it is not a leaf node.
A non-leaf node with k children contains k − 1 keys.
All leaves appear in the same level and carry no information.
B树除叶子和跟节点，其余节点的孩子数在[ceil(m / 2), m]之间。
所有节点都保存key，叶子节点存在并且也会保存key。
B树要求每个节点至少半满（算法导论：bstar-tree要求2/3满）。
有n棵子树的结点中含有n-1个key。

注意：n阶B树主要限制了孩子的数量，key的数量通过第五条来限制，例如5阶B树某个节点最多有5个孩子，那么该节点只能保存4个key

https://en.wikipedia.org/wiki/B-tree

3阶B树的生长过程：A B Tree insertion example with each iteration. The nodes of this B tree have at most 3 children (Knuth order 3).

5阶B树

2 B树高度

高度等于=log┌m┐(N+1/2)

N：B树中key的数量。
m：阶，表示每个节点至少有m/2个孩子，至多有m个孩子。

树的阶越高，每层存的key数量越多，树的高度约低。

3 B+树

m阶B+树在B树基础上增加要求：

所有非叶子节点都是索引，不保存数据，只保存每个孩子节点的最大值或最小值。
所有叶子节点保存全部key，并形成linkedlist。

B+的优点：

非叶子不保存数据，可以保存大量索引数据，只需要IO叶子，IO次数降低。
遍历、区间查询效率大幅度提高。
查询稳定，每次查询经历的节点树相同。

B+树：

4 B*-tree

m阶B*树在B+树的基础上增加要求：

~~非根和非叶子结点增加指向兄弟的指针。~~（这是B-linked-tree）
非叶子结点关键字个数至少为floor ((2m-1)/3)，即块的最低使用率为2/3（代替B+树的1/2）。

数据结构example：

struct node {
  
    // key of N-1 nodes
    int key[N - 1];
      
    // Child array of 'N' length
    struct node* child[N];
      
    // To state whether a leaf or not; if node 
    // is a leaf, isleaf=1 else isleaf=0
    int isleaf;
      
    // Counts the number of filled keys in a node
    int n;
      
    // Keeps track of the parent node
    struct node* parent;
};

wiki:

The B* tree balances more neighboring internal nodes to keep the internal nodes more densely packed.2 This variant ensures non-root nodes are at least 2/3 full instead of 1/2.13 As the most costly part of operation of inserting the node in B-tree is splitting the node, B*-trees are created to postpone splitting operation as long as they can.14 To maintain this, instead of immediately splitting up a node when it gets full, its keys are shared with a node next to it. This spill operation is less costly to do than split, because it requires only shifting the keys between existing nodes, not allocating memory for a new one.14 For inserting, first it is checked whether the node has some free space in it, and if so, the new key is just inserted in the node. However, if the node is full (it has m − 1 keys, where m is the order of the tree as maximum number of pointers to subtrees from one node), it needs to be checked whether the right sibling exists and has some free space. If the right sibling has j < m − 1 keys, then keys are redistributed between the two sibling nodes as evenly as possible. For this purpose, m - 1 keys from the current node, the new key inserted, one key from the parent node and j keys from the sibling node are seen as an ordered array of m + j + 1 keys. The array becomes split by half, so that ⌊(m + j + 1)/2⌋ lowest keys stay in the current node, the next (middle) key is inserted in the parent and the rest go to the right sibling.14 The situation when right sibling is full, and left isn’t is analogous.14 When both the sibling nodes are full, then the two nodes (current node and a sibling) are split into three and one more key is shifted up the tree, to the parent node.14 If the parent is full, then spill/split operation propagates towards the root node.14 Deleting nodes is somewhat more complex than inserting however.

wiki

B* 树平衡更多相邻的内部节点，以保持内部节点更密集。

【优点】2此变体确保非根节点至少 2/3 满而不是 1/2。 13由于在 B-tree 中插入节点的操作成本最高的部分是拆分节点，因此创建 B*-trees 以尽可能推迟拆分操作。 14为了保持这一点，不是在节点满时立即拆分节点，而是与旁边的节点共享key（把自己的key挪到右兄弟）。这种溢出操作比拆分成本更低，因为它只需要在现有节点之间移动键，而不是为新节点分配内存。
【未满】 14对于插入，首先检查节点中是否有一些空闲空间，如果有，则将新key插入节点中。但是，如果节点已满（它有 m - 1 个键，其中 m 是树的order（阶），作为从一个节点指向子树的最大指针数，就是最大孩子数），则需要检查右兄弟是否存在并且有一些空闲空间。
【满了】如果右兄弟节点有 j < m - 1 个键，则键在两个兄弟节点之间尽可能均匀地重新分配。为此，来自当前节点的 m - 1 个键、插入的新键、来自父节点的一个键和来自兄弟节点的 j 个键被视为 m + j + 1 个键的有序数组。数组被分成两半，因此 ⌊(m + j + 1)/2⌋ 最低的键留在当前节点中，下一个（中间）键插入父节点，其余的进入右兄弟节点。 14 （新插入的key可能最终出现在三个位置中的任何一个。）右兄弟已满而左兄弟未满的情况类似。 14当两个兄弟节点都已满时，这两个节点（当前节点和一个兄弟节点）被分成三个，另外一个键沿树向上移动到父节点。 14如果父节点已满，则溢出/拆分操作向根节点传播。 14然而，删除节点比插入要复杂一些。

B* 和 B+相比更饱满了：

5 B+树分裂、B*树分裂

B+树的分裂
- 当一个结点满时，分配一个新的结点，并将原结点中1/2的数据复制到新结点，最后在父结点中增加新结点的指针；B+树的分裂只影响原结点和父结点，而不会影响兄弟结点，所以它不需要指向兄弟的指针。
B*树的分裂
- （总结一）使用 B* 树优于 B+称为“二到三”拆分，拆分后每个节点中的最小key数不是1/2（B+），而是2/3，使数据更加紧凑。缺点是删除操作复杂。
- （总结二）当一个结点满时，如果它的下一个兄弟结点未满，那么将一部分数据移到兄弟结点中，再在原结点插入关键字，最后修改父结点中兄弟结点的关键字（因为兄弟结点的关键字范围改变了）；如果兄弟也满了，则在原结点与兄弟结点之间增加新结点，并各复制1/3的数据到新结点，最后在父结点增加新结点的指针。

所以，B*树分配新结点的概率比B+树要低，空间使用率更高。

6 B*分裂实例

假设m=4阶B*树，node最多有4个孩子，所以node保存3个key就满了。

6.1 情况一：当前节点满了，右sibling满了

节点{1,2,4}、节点{7,8,9}都已经满了。现在需要插入5。

将{1,2,4}, 5 , {6}, {7,8,9}进行排序
新建三个新的node：p、q、r
1. p：前(2m – 2)/3个的key保存在p中， (2m – 2)/3位置的key保存在parent1中
2. q：接着(2m – 1)/3个key保存在q中，(2m – 1)/3位置的key保存在parent2中
3. r：剩下(2m)/3个key保存在r中
p、q的parent指针更新为parent1，r的parent指针更新为parent2
parent中新增parent2
parent1、parent2维护child指针指向p、q、r

6.1 情况二：当前节点满了，右sibling有空间

节点{1,2,4}、节点{7,8}都已经满了。现在需要插入3。

将当前节点的最后一个元素移动到父节点。
将右兄弟节点中的所有键向右移动，并插入前一个父节点。
将3插入当前节点合适的位置。

9 B-Linked-Tree

9.1 数据结构

《Efficient Locking for Concurrent Operations on B-Trees》

The B-link-tree is a B*-tree modified by adding a single “link” pointer field to each node. This link field points to the next node at the same level of the tree as the current node, except that the link pointer of the rightmost node on a level is a null pointer.

在B*的基础上增加了指向右兄弟。

The purpose of the link pointer is to provide an additional method for reaching a node. When a node is split because of data overflow, a single node is replaced by two new nod&. The link pointer of the first new node points to the second node; the link pointer of the second node contains the old contents of the link pointer field of the first node. Usually, the first new node occupies the same physical page on the disk as the old single node. The intent of this scheme is that the two nodes, since they are joined by a link pointer, are functionally essentially the same as a single node until the proper pointer from their father can be added. The precise search and insertion algorithms for Blink-trees are given in the next two sections.

增加link pointer的目的：提供到达节点的另一种方法。当一个节点因为数据溢出被分裂时，单个节点被两个新的节点&替换。第一个新节点的链接指针指向第二个节点；第二个节点的链接指针包含第一个节点的链接指针字段的旧内容。通常新旧节点在磁盘上占用相同的物理页面。

该方案的目的：两个节点因为它们通过链接指针连接，在功能上基本上与单个节点相同，直到可以添加来自其父节点的正确指针。

Link pointers have the advantage that they are introduced simultaneously with the splitting of the node. Therefore, the link pointer serves as a “temporary fix” that allows correct concurrent operation, even before all of the usual tree pointers are changed for a new (split) node. If the search key exceeds the highest value in a node (as indicated by the high key), it indicates that the tree structure has been changed, and that the twin node should be accessed using the link pointer. While this is slightly less efficient (we need to do an extra disk read to follow a link pointer), it is still a correct method of reaching a leaf node. The link pointers should be used relatively infrequently, since the splitting of a node is an excep- tional case.

新增的指针可以在分裂的时候添加，这是他的优点。因此，链接指针可以充当对btree的临时修复，允许正确的并发操作，甚至在所有常用树指针都更改为新（拆分）节点之前。如果搜索关键字超过了节点中的最大值（如高关键字所示），则表明树结构已更改，应使用链接指针访问孪生节点。虽然这效率稍低（我们需要执行额外的磁盘读取以跟随链接指针），但它仍然是到达叶节点的正确方法。链接指针应该相对不经常使用，因为节点的分裂是一种例外情况。

9.2 查询

To search for a value, u, in the tree, the search process begins at the root and proceeds by comparing u with the values in each node in a path down the tree. In each node, the comparisons produce a pointer to follow from that node, whether to the next level, or to a leaf (record) node. If the search process examines a node and finds that the maximum value given by that node is less than u, then it infers some change has taken place in the current node that had not been indicated in the father at the time the father was examined by the search. The current node must have been split into two (or more) new nodes. The search must then rectify the error in its position in the tree by following the link pointer of the newly split node instead of by following a son pointer as it would ordinarily do. The search process eventually reaches the leaf node in which u must reside if it exists. Either this node contains u, or it does not contain u and the maximum value of the node exceeds u. Therefore, the algorithm correctly determines whether u exists in the tree.

为了在树中搜索值 u，搜索过程从根开始，然后通过将 u 与树下路径中每个节点中的值进行比较来进行。在每个节点中，比较产生一个从该节点跟随的指针，到下一级或叶（记录）节点。
如果搜索过程检查一个节点并发现该节点给出的最大值小于 u，那么它推断在当前节点中发生了一些变化（父节点不知道）。当前节点一定已经拆分了，拆分为两个（或更多）新节点。搜索必须通过遵循新拆分节点的链接指针而不是像通常那样搜索子指针来纠正其在树中位置的错误。

总结差异点：如果扫描过程中，根据父节点的指针拿到一个子节点，然后发现当前节点的**high_key<u**，说明这个节点已经分裂过了，但是没有维护父指针，这时候需要沿着链接指针找到右兄弟继续查询。这就是B-linked-tree上面定义中提到的功能，分裂时可以延迟维护父指针。直到扫描到了在维护。

9.3 插入

To insert a value, u, in the tree, we perform operations similar to that for search above. Beginning at the root, we scan down the tree to the leaf node that should contain the value u. We also keep track of the rightmost node that we examined at each level during the descent through the tree. This descent through the tree constitutes a search for the proper place to insert u (which is, say, node a). The insertion of the value u into the leaf node may necessitate splitting the node (in the case where it was unsafe). In this case, we split the node (as shown in Figure 8), replacing node a by nodes u’（写错了应该是a’） (a new version of a which is written on the same disk page) and b’. The nodes a’ and b’ have the same contents as a, with the addition of the value u. We then proceed back up the tree (using our “remembered” list of nodes through which we searched) to insert entries for the new node (b’) and for the new high key of a’ in the parent of the leaf node. This node, too, may need to be split. If so, we backtrack up the tree, splitting nodes and inserting new pointers into their parents, stopping when we reach a safe node-one that does not need to be split. In all cases, we lock a node before modifying it.

要在树中插入一个值 u，需要执行搜索操作。从根开始向下扫描树到应该包含值 u 的叶节点。我们还跟踪在树的递归过程中每个级别检查的最右边的节点。通过树的这种下降搜索拿到 u 的适当位置（例如，节点 a）。
将值 u 插入叶节点可能需要拆分节点（在不安全的情况下）。在这种情况下，我们拆分节点（如图 8 所示），将节点 a 替换为节点 a’（写入同一磁盘页面的 a 的新版本）和 b’。节点 a’ 和 b’ 与 a 具有相同的内容，但增加了值 u。
然后我们继续返回树（使用我们搜索过的“记忆”节点列表）在叶节点的父节点中插入新节点 (b’) 和 a’ 的新HIGHKEY的条目。
该节点也可能需要拆分。如果是这样，我们回溯树，分裂节点并将新指针插入它们的父节点，当我们到达一个不需要分裂的安全节点时停止。在所有情况下，我们都会在修改节点之前锁定它。

8 B* implementation

// CPP program to implement B* tree
#include <bits/stdc++.h>
using namespace std;

// This can be changed to any value -
// it is the order of the B* Tree
#define N 4

struct node {

	// key of N-1 nodes
	int key[N - 1];
	
	// Child array of 'N' length
	struct node* child[N];
	
	// To state whether a leaf or not; if node
	// is a leaf, isleaf=1 else isleaf=0
	int isleaf;
	
	// Counts the number of filled keys in a node
	int n;
	
	// Keeps track of the parent node
	struct node* parent;
};

// This function searches for the leaf
// into which to insert element 'k'
struct node* searchforleaf(struct node* root, int k,
					struct node* parent, int chindex)
{
	if (root) {

		// If the passed root is a leaf node, then
		// k can be inserted in this node itself
		if (root->isleaf == 1)
			return root;
			
		// If the passed root is not a leaf node,
		// implying there are one or more children
		else {
			int i;
			
		/*If passed root's initial key is itself g
			reater than the element to be inserted,
			we need to insert to a new leaf left of the root*/
			if (k < root->key[0])
				root = searchforleaf(root->child[0], k, root, 0);
				
			else
			{
				// Find the first key whose value is greater
				// than the insertion value
				// and insert into child of that key
				for (i = 0; i < root->n; i++)
					if (root->key[i] > k)
						root = searchforleaf(root->child[i], k, root, i);
						
				// If all the keys are less than the insertion
				// key value, insert to the right of last key
				if (root->key[i - 1] < k)
					root = searchforleaf(root->child[i], k, root, i);
			}
		}
	}
	else {
		
		// If the passed root is NULL (there is no such
		// child node to search), then create a new leaf
		// node in that location
		struct node* newleaf = new struct node;
		newleaf->isleaf = 1;
		newleaf->n = 0;
		parent->child[chindex] = newleaf;
		newleaf->parent = parent;
		return newleaf;
	}
}

struct node* insert(struct node* root, int k)
{
	if (root) {
		struct node* p = searchforleaf(root, k, NULL, 0);
		struct node* q = NULL;
		int e = k;
		
		// If the leaf node is empty, simply
		// add the element and return
		for (int e = k; p; p = p->parent) {
			if (p->n == 0) {
				p->key[0] = e;
				p->n = 1;
				return root;
			}
			// If number of filled keys is less than maximum
			if (p->n < N - 1) {
				int i;
				for (i = 0; i < p->n; i++) {
					if (p->key[i] > e) {
						for (int j = p->n - 1; j >= i; j--)
							p->key[j + 1] = p->key[j];
						break;
					}
				}
				p->key[i] = e;
				p->n = p->n + 1;
				return root;
			}
			
			// If number of filled keys is equal to maximum
			// and it's not root and there is space in the parent
			if (p->n == N - 1 && p->parent && p->parent->n < N) {
				int m;
				for (int i = 0; i < p->parent->n; i++)
					if (p->parent->child[i] == p) {
						m = i;
						break;
					}
					
				// If right sibling is possible
				if (m + 1 <= N - 1)
				{
					// q is the right sibling
					q = p->parent->child[m + 1];
					
					if (q) {
						
						// If right sibling is full
						if (q->n == N - 1) {
							struct node* r = new struct node;
							int* z = new int[((2 * N) / 3)];
							int parent1, parent2;
							int* marray = new int[2 * N];
							int i;
							for (i = 0; i < p->n; i++)
								marray[i] = p->key[i];
							int fege = i;
							marray[i] = e;
							marray[i + 1] = p->parent->key[m];
							for (int j = i + 2; j < ((i + 2) + (q->n)); j++)
								marray[j] = q->key[j - (i + 2)];
								
							// marray=bubblesort(marray, 2*N)
							// a more rigorous implementation will
							// sort these elements

							// Put first (2*N-2)/3 elements into keys of p
							for (int i = 0; i < (2 * N - 2) / 3; i++)
								p->key[i] = marray[i];
							parent1 = marray[(2 * N - 2) / 3];

							// Put next (2*N-1)/3 elements into keys of q
							for (int j = ((2 * N - 2) / 3) + 1; j < (4 * N) / 3; j++)
								q->key[j - ((2 * N - 2) / 3 + 1)] = marray[j];
							parent2 = marray[(4 * N) / 3];

							// Put last (2*N)/3 elements into keys of r
							for (int f = ((4 * N) / 3 + 1); f < 2 * N; f++)
								r->key[f - ((4 * N) / 3 + 1)] = marray[f];

							// Because m=0 and m=1 are children of the same key,
							// a special case is made for them
							if (m == 0 || m == 1) {
								p->parent->key[0] = parent1;
								p->parent->key[1] = parent2;
								p->parent->child[0] = p;
								p->parent->child[1] = q;
								p->parent->child[2] = r;
								return root;
							}

							else {
								p->parent->key[m - 1] = parent1;
								p->parent->key[m] = parent2;
								p->parent->child[m - 1] = p;
								p->parent->child[m] = q;
								p->parent->child[m + 1] = r;
								return root;
							}
						}
					}
					else // If right sibling is not full
					{
						int put;
						if (m == 0 || m == 1)
							put = p->parent->key[0];
						else
							put = p->parent->key[m - 1];
						for (int j = (q->n) - 1; j >= 1; j--)
							q->key[j + 1] = q->key[j];
						q->key[0] = put;
						p->parent->key[m == 0 ? m : m - 1] = p->key[p->n - 1];
					}
				}
			}
		}

		/*Cases of root splitting, etc. are omitted
		as this implementation is just to demonstrate
		the two-three split operation*/
	}
	else
	{
		// Create new node if root is NULL
		struct node* root = new struct node;
		root->key[0] = k;
		root->isleaf = 1;
		root->n = 1;
		root->parent = NULL;
	}
}

// Driver code
int main()
{
	/* Consider the following tree that has been obtained
	from some root split:
				6			
				/ \			
			1 2 4 7 8 9
				
			We wish to add 5. This makes the B*-tree:
				4 7			
				/ \ \		
			1 2 5 6 8 9
			
		Contrast this with the equivalent B-tree, in which
		some nodes are less than half full
			
			4 6
			/ \ \
			1 2 5 7 8 9
			
			*/

	// Start with an empty root
	struct node* root = NULL;
	// Insert 6
	root = insert(root, 6);
	
	// Insert 1, 2, 4 to the left of 6
	root->child[0] = insert(root->child[0], 1);
	root->child[0] = insert(root->child[0], 2);
	root->child[0] = insert(root->child[0], 4);
	root->child[0]->parent = root;
	
	
	// Insert 7, 8, 9 to the right of 6
	root->child[1] = insert(root->child[1], 7);
	root->child[1] = insert(root->child[1], 8);
	root->child[1] = insert(root->child[1], 9);
	root->child[1]->parent = root;
	
	
	cout << "Original tree: " << endl;
	for (int i = 0; i < root->n; i++)
		cout << root->key[i] << " ";
	cout << endl;
	for (int i = 0; i < 2; i++) {
		cout << root->child[i]->key[0] << " ";
		cout << root->child[i]->key[1] << " ";
		cout << root->child[i]->key[2] << " ";
	}
	cout << endl;
	
	
	cout << "After adding 5: " << endl;

	// Inserting element '5':

	root->child[0] = insert(root->child[0], 5);

	// Printing nodes

	for (int i = 0; i <= root->n; i++)
		cout << root->key[i] << " ";
	cout << endl;
	for (int i = 0; i < N - 1; i++) {
		cout << root->child[i]->key[0] << " ";
		cout << root->child[i]->key[1] << " ";
	}
	
	return 0;
}