专栏首页Sth interesting, as a DBAWL#7277: InnoDB: Bulk Load for Create Index

WL#7277: InnoDB: Bulk Load for Create Index



Currently we create an index by repeatedly inserting records. It's very slow

when we have a large collection of records. Bulk loading can be done much

more efficiently.

We sort the records using index key and then insert (exploiting sorted order of

the records). The basic idea of bulk load is we builds an index from bottom up

(also known as sorted index build).

Index entries first go into the leaf page, and when the leaf page is filled up

we add the corresponding search-key entry in to the parent non-leaf node and

start filling next leaf page. Filling records into leaf page is fast, and we

have much less splitting in the building process. Also this approach help in

maintaining good locality and page once pinned and unpinned can be flushed to

disk as it would not be pinned again for build phase.

Problem Statement:

Speed up index build by bulking load the data.

When we build or rebuild an index, we usually have three phases.

Phase-1: Generate Initial Runs

Scan cluster index, generate index entries and add it to sort buffer. When

sort buffer is full, we sort the entries, and write/append the buffer to a

temporary intermediate file.

Phase-2: Merge Sort

Now we have one or more runs in the file, so we run merge sort to get complete

file sorted (in turn all the entries are now sorted).

Phase-1 and Phase-2 is a common external sort alogrithm.

Phase-3: Index Build

Once we have all index entries in sorted order, entries are inserted into B-tree

using normal insert apis. Obivious optimization here should have been using

sorted build. This WL addresses this requirement and enables sorted build path

instead of normal insert apis.

Let's understand current implementation in InnoDB:

  • STEP-1: Optimistic insert - Open a B-tree cursor to find the insert position, and try to insert the tuple in the page. If insert fails (page can't accomodate new record as page has already reached its capacity) try pessimistic insert.
  • STEP-2: Pessimistic insert - Open a B-tree cursor and do necessary split & merge to find a proper space for the tuple.

Current algorithm has following shortcomings:

  • For each entry, position to insert is searched which is costly process even though each search is bounded by log(n) complexity.
  • Current algorithm is top-down which would result in exhaustive split and merge.

New proposed algorithm try to address these issues using bottom-up build

technique and thus avoids/reduces split and merge. (For compressed tables splits

will occur.)

High Level Approach Description:

  • Bottom-up Index BuildSince tuples are ordered, we allocate a leaf page and append tuples to it. Once it is full (fill factor is dictated by external configuration variable), we append a node pointer of the page to its parent page and same process is then followed for all the non-leaf page which in turn can lead to insert upto root. Next tuple would then go to new page following same semantics as described above.

We hold references to the rightmost pages at each level in the B-tree.

All inserts goes to these pages (including node-pointer inserted to non-leaf).

If a page has no space left for a new tuple, we allocate a new sibling page

releasing reference to the old page after updating the non-leaf pages as

needed. The newly pinned siblings will now act as rightmost page and default

location for upcoming new tuples.

  • Redo Logging DisabledREDO logging is selectively turned off and instead checkpoint is done to ensure that build index can withstand crash/failure. Checkpoint will force writing up of all the dirty pages to disk. During the progress of bulk load, we signal page cleaner to flush dirty pages periodically, so the checkpoint could be short and fast.

(Note: page-cleaner thread by default would do it but would trigger the action

only if non-dirty pages fall below some set threshold. In this case we would

like to flush the pages immediately reducing overhead of checkpoint and also

parallelize IO and CPU activities.)

There are 2 main action that needs redo logging:

  1. Page-Allocation: REDO logging is turned-on as usual for this action except if complete table is being rebuild.
  2. Page-Modification: REDO logging is turned-off for this. On crash there is no need to restore page content as anyways index is half-build and will not be updated in metadata.
  3. Fill Factor

The fill-factor value determines percentage of space on page to be filled with

data, reserving the remainder on each page as free space for future growth. For

example, specifying a fill-factor value of 80 means that 20 percent of each page

will be left empty. We don't keep exactly 20 percent free space, and the actual

free percentage varies, greater than 20, or less than 20. It would be

interpreted as hint and not a hard assurance.

Fill factor applies to both leaf-level page and non-leaf page, but doesn't apply

to external page for TEXT/BLOB in B-tree. Fill-factor concept is newly being

introduced and use of it is currently restricted only for sorted build.

It is controlled using a global configuration parameter named


Agreeable limits: 10 <= fill-factor <= 100

Note: it could be better that we support fill factor per index in syntax. e.g.

CREATE INDEX ... FILLFACTOR=80. We need a separate WL to address it cooperating

with other teams. As Marko mentioned, fill factor should be an attribute in

dd.indexes.options string.

  • Compressed TableIn existing approach record is inserted to both uncompressed and compressed page. When the modification log of the compressed page is filled up, we recompress the compressed page, and split the page if compression fails.

In bulk load, we append records only to uncompressed pages. when a uncompressed

page is filled up, we compress the page. If compression fails, we split the page

into two pages, and recompress until the compression succeeds.

We keep certain padding space in a page, which is determined by adaptive


  • Fulltext IndexCurrent implementation use SQL to insert a record into fulltext index. We will use bulk load to speedup fulltext index build as well.



0 条评论
登录 后参与评论


  • 听原作者为你深度解读InnoDB批量建索引原理

  • MySQL8.03 RC 已发布

    The MySQL 8.0.3 Release Candidate is available

  • MySQL8.03 RC 已发布

    MySQL开发团队非常高兴地宣布,第一个8.0 RC版本8.0.3现已可在dev.mysql.com下载(相对于8.0.2,8.0.1和8.0.0,8.0.3添...

  • MySQL InnoDB Lock(二)

    MySQL InnoDB Lock主要从5个部分介绍,这篇文章承接 上一篇 ,会详细介绍后3部分。 ---- 数据库数据一致性 InnoDB事物一致级别 Inn...

  • OCP真题精简版--你值得拥有

  • MySQL每秒57万的写入,带你飞~


  • MySQL 每秒 570000 的写入,如何实现?


  • MySQL 每秒 570000 的写入,如何实现?


  • MySQL 每秒 570000 的写入,如何实现?


  • MySQL每秒57万的写入,带你飞~


  • 太牛了,MySQL每秒写入57万条数据


  • MySQL 8.0.21 GA!重点解读

    MySQL 8.0.21 版本已于昨日发布(dev.mysql.com),开始对一些术语如 Master / Slave 等做了替换。下面是来自官方团队对此版本...

  • 浅谈MySQL自增锁


  • 如何在数据库中高效实现订座功能?

  • MySQL 每秒 570000 的写入,如何实现?


  • 我在MySQL的那些年(一)

    | 作者 赖铮(Allen Lai)前MySQL官方团队成员,专注数据库内核开发近二十年,先后就职于达梦,Teradata,北大方正以及MySQL InnoD...

    腾讯云数据库 TencentDB
  • salt-api return mysql返回的使用,记录操作日志

  • 自增主键,三类插入测验答案,在这里。

    《三类插入与自增键的关系》一文,基本解答了《自增键四道测验题》,仍有水友要求贴答案,原理都解释了,copy语句执行下,真的难么? 画外音:你们赢了,我还是贴一下...

  • docker安装服务

    2、docker run --name mynginx -d nginx: 运行nginx实例