Currently we create an index by repeatedly inserting records. It's very slow
when we have a large collection of records. Bulk loading can be done much
We sort the records using index key and then insert (exploiting sorted order of
the records). The basic idea of bulk load is we builds an index from bottom up
(also known as sorted index build).
Index entries first go into the leaf page, and when the leaf page is filled up
we add the corresponding search-key entry in to the parent non-leaf node and
start filling next leaf page. Filling records into leaf page is fast, and we
have much less splitting in the building process. Also this approach help in
maintaining good locality and page once pinned and unpinned can be flushed to
disk as it would not be pinned again for build phase.
Speed up index build by bulking load the data.
When we build or rebuild an index, we usually have three phases.
Phase-1: Generate Initial Runs
Scan cluster index, generate index entries and add it to sort buffer. When
sort buffer is full, we sort the entries, and write/append the buffer to a
temporary intermediate file.
Phase-2: Merge Sort
Now we have one or more runs in the file, so we run merge sort to get complete
file sorted (in turn all the entries are now sorted).
Phase-1 and Phase-2 is a common external sort alogrithm.
Phase-3: Index Build
Once we have all index entries in sorted order, entries are inserted into B-tree
using normal insert apis. Obivious optimization here should have been using
sorted build. This WL addresses this requirement and enables sorted build path
instead of normal insert apis.
Let's understand current implementation in InnoDB:
Current algorithm has following shortcomings:
New proposed algorithm try to address these issues using bottom-up build
technique and thus avoids/reduces split and merge. (For compressed tables splits
High Level Approach Description:
We hold references to the rightmost pages at each level in the B-tree.
All inserts goes to these pages (including node-pointer inserted to non-leaf).
If a page has no space left for a new tuple, we allocate a new sibling page
releasing reference to the old page after updating the non-leaf pages as
needed. The newly pinned siblings will now act as rightmost page and default
location for upcoming new tuples.
(Note: page-cleaner thread by default would do it but would trigger the action
only if non-dirty pages fall below some set threshold. In this case we would
like to flush the pages immediately reducing overhead of checkpoint and also
parallelize IO and CPU activities.)
There are 2 main action that needs redo logging:
The fill-factor value determines percentage of space on page to be filled with
data, reserving the remainder on each page as free space for future growth. For
example, specifying a fill-factor value of 80 means that 20 percent of each page
will be left empty. We don't keep exactly 20 percent free space, and the actual
free percentage varies, greater than 20, or less than 20. It would be
interpreted as hint and not a hard assurance.
Fill factor applies to both leaf-level page and non-leaf page, but doesn't apply
to external page for TEXT/BLOB in B-tree. Fill-factor concept is newly being
introduced and use of it is currently restricted only for sorted build.
It is controlled using a global configuration parameter named
Agreeable limits: 10 <= fill-factor <= 100
Note: it could be better that we support fill factor per index in syntax. e.g.
CREATE INDEX ... FILLFACTOR=80. We need a separate WL to address it cooperating
with other teams. As Marko mentioned, fill factor should be an attribute in
In bulk load, we append records only to uncompressed pages. when a uncompressed
page is filled up, we compress the page. If compression fails, we split the page
into two pages, and recompress until the compression succeeds.
We keep certain padding space in a page, which is determined by adaptive