The ‘B’ in B-Tree – Indexing in SQL Server

Some people advocate we should count by twelve rather than ten. This is because you can easily count the twelve segments of your fingers by using your thumb as a pointer. With that method, you can count the number of indexes in SQL Server on one hand.

The 12 SQL Server Indexes

That’s right, twelve different types of indexes. Many database administrators do not have experience working with all of them. However, you will never be a master of your craft unless you understand all the tools in your toolbox.

In this blog series I will; describe each type of index, point out their best use cases, and give a detailed explanation of their physical structure. Before I serve the entrée I would like to discuss the foundation of the B-Tree index and heaps.

What is a B-Tree index?

On more than one occasion I have been told that the B-Tree index is the index structure which was designed after the A-Tree index. Another common misconception is that it stands for Binary-Tree. As logical as those may seem, they are false. The ‘B’ in B-Tree does not actually have any specific meaning. Check out Ed McCreight’s explanation here (16:08) where he admits that the name discussion was never settled.

In its most basic form, the B-Tree index is a hierarchy of data pages (page structures lightly touched on in the next post of this series). The lowest level is called the leaf level, the highest level is the index root, and all levels in between are the intermediate levels. This structure is an improvement over the Binary Tree index because its balanced nature greatly improved the performance of maintenance operations such as, INSERT, DELETE, and UPDATE.

Basic B-tree Structure

Rudolf Bayer and Ed McCreight, the creators of the original B-Tree index, assumed that, “…the index itself is so voluminous that only rather small parts of it can be kept in main store at one time.” Their solution to this performance problem is the B-Tree index which supports index seeks and can be maintained with minimal overhead. An index seek is an operation that begins by reading the root page and then works its way down the levels by checking to see if the requested key is less than or greater than the keys stored in each page. This minimizes the amount of pages which need to be read in order for the data retrieval to occur. Reading the full index is much less efficient and considered an index scan.

In the last 45 years, the B-Tree index has been improved. SQL Server uses a B+-Tree index not the traditional B-Tree index. There are two major differences.

  • In a B+-Tree index, all of the data is in the leaf level pages. Only key values and pointers exist in the root and intermediate levels.
  • In a B+-Tree index, there are pointers which point to the next and last page in the level.

B-tree Structure with Read-ahead

With each page pointing to the page ahead and behind you can do range scans without having to travel up and down the index levels. Take this scenario…

You query an index with the predicate, WHERE id >= 8 AND id < 25. Once the engine has searched down to the leaf page for the key value of 8, an assumption can be made that the index is sorted. Therefore, with the next page pointers, the engine can simply move through the leaf level until it reaches the last key value which is less than 25. Without these pointers, the engine would have to travel up and down the index levels for each page which would be far less efficient.

History of the B-Tree

Rudolf Bayer and Ed McCreight are the creators of the B-Tree index. There was a lot of focus on index structures during the 1960s. The B-Tree index was first published, however, in their paper Organization and Maintenance of Large Ordered Indices (earlier version, Mathematical and Information Sciences Report No. 20) in 1970.

B-Tree timeline

In the 1970s, the B-Tree became the de facto standard for indexing file systems and database management systems. The 1970’s was also an age of excitement and enthusiasm which brought about multiple variations of the B-Tree index. As mentioned, SQL Server uses the B+-Tree index. The various types are explained, compared, and contrasted in Douglas Comer’s paper, The Ubiquitous B-Tree (1979).

In the 1980s, advancements in concurrency and data access patterns were realized. Concurrency questions revolved around issues of data durability and crash recovery. When should we log index maintenance operations? When should page splits and merging be logged and persisted to durable storage? Those questions produced a number of methods with the Algorithm for Recovery and Isolation Exploiting Semantics (ARIES) being the most successful.

As the B-Tree evolved and fundamental issues were resolved, the 1990s brought us new, specialized, B-Tree indexes with better features and improvements in input/output (I/O) performance. One example of the high performance indexes is Log-Structured Merge-Trees (LSM-Trees). A LSM-Tree performs better performance by writing in batches rather than real-time. It maintains more than one tree. One of them, T0, is smaller and fits 100% in memory. The second tree, T1, remains on disk. Writes do not flush to disk until a size threshold is crossed in T0. This Yahoo! Research paper explains a particular flavor (bLSM-Tree) and compares the performance of it to the B-Tree and basic LSM-Tree.

The new features that I referred to include Snapshots, Shadowing, and Cloning. Each of these are variations of the B-Tree index which allow for leaf pages, sub-trees, and even root pages to be duplicated to support multiple versioned copies while maintaining good performance. These features and the copy-on-write (COW) concept which supports them is explained in detail in Ohad Rodeh’s paper, B-trees, Shadowing, and Clones.

Next time

In the next post of this series I will cover heaps. A heap is a structure of table data which does not have an index. Learning about heaps will be important for understanding indexes.

This next post will be the last post which does not focus on a specific SQL Server index type but it will be more in-line with the advertised format.

  1. Describe each type of index.
  2. Point out their best use cases.
  3. Give a detailed explanation of their physical structure.

References

This article has 1 comment

  1. […] clustered index is a B+-Tree as described in the first post of this series. It is a hierarchical tree which has a root level, zero to many intermediate levels, […]

Leave a Reply