In the last post of this blog series, I discussed what a B-Tree index was and briefly explained its history.
Before we dive into the indexes which SQL Server uses, it is important to set the foundation of data structures. The most basic table that you can create is a heap. A heap is an unsorted structure of pages which are not linked to each other.
When should you use heaps?
The best practice recommendation is to avoid SQL Server heaps, in most cases. Kimberly Tripp explains some of the reasons and debate here but I will cover the high-level. Heaps, being the most basic storage structure, is lacking features which improve performance for normal (OLTP) workloads. Some argue that heaps are better for extract, transformation, and loading (ETL) operations because there are less pages to maintain during write operations (UPDATE/INSERT/DELETE). Specifically, heaps can be faster than a B-tree index if page splits occur and many levels of the index must be updated. In addition, the row identifier (RID), discussed more below, is only 8 bytes. This can perform better than a clustered index with a lot of key columns because non-clustered indexes must store the clustered key or RID for each indexed row. Typically I recommend using a clustered index (more on clustered indexes as we continue this series) rather than a heap but, if you do use them, use them only when you have proven that they will be more efficient than a clustered index. After discussing a heap’s structure, I will outline the performance concerns of using heaps.
A heap is a grouping of unsorted pages which are not linked. Page anatomy is out of scope for this series since all types of indexed and non-indexed tables use the same page structure but I do encourage you to check out here and here to learn more.
A heap is comprised of one or more index allocation map (IAM) pages which point to the data pages which make up the heap. The only exception to this is when you have a row which has been updated and could not fit in its page anymore. In that case, you get a forwarding pointer to the row which has been moved to an existing page with space or a new page. It is possible for you to produce a chain of forwarding records if the row continues to need relocation by further operations.
Image reference: MSDN
A heap has one row in sys.partitions per partition and its index_id will equal zero. In this record, the first_iam_page points to the first of the index allocation map (IAM) pages. An IAM page maps pages for each allocation unit and manages 4 GB chunks of the table. Allocation units include:
- IN_ROW_DATA allocation unit
- LOB_DATA allocation unit
- ROW_OVERFLOW_DATA allocation unit
The IAM page contains the standard 96-byte header followed by the IAM header. The IAM header begins with eight slots for mixed extent allocations and then an 8000-byte bit-map to locate uniform extents allocated to the table.
A mixed extent is one which contains pages from more than one table. This is also referred to as a shared extent. The purpose of a mixed extent is to save space with tables which are smaller than 64 KB. A uniform extent is one where all pages in the extent belong to that single table.
Image Reference: The Extent
To investigate our IAM page, first we must locate it.
DBCC IND ('demo','dbo.demo',1)
With DBCC IND, we have 17 data pages and 1 IAM page. IAM pages are not self-referencing, therefore the IAMFID (IAM file Id) and IAMPID (IAM page Id) are NULL. This means that our IAM page is in file Id (PageFID) 1 and is page Id (PagePID) 297. DBCC PAGE will allow us to view our IAM page.
DBCC TRACEON(3604); DBCC PAGE ('demo',1,297,3);
In YELLOW there are eight single page allocations in our mixed extent. In ORANGE the uniform extents are displayed in ranges.
Performing various data manipulation language (DML) operations on a heap has these affects.
- INSERT – New rows can be placed in the first available page with sufficient space. So whenever a new row is inserted it will likely be added to the last page.
- UPDATE – Rows can remain on the same page if it fits in the page after the update, if not it will be removed from the current page and placed on the first available page with sufficient space and a forwarding pointer is written in its original page.
- DELETE – Data is not overwritten, space is just flagged as available for reuse. This can cause your table to consume significantly more disk space than is necessary.
- SELECT – A table scan on the entire table will need to be read for most queries. The exception to this is when there is a suitable non-clustered index available. We will discuss non-clustered indexes later in this series.
Without any supporting index, a heap requires a full table scan for any query. This is because the physical structure is un-sorted and there is no page linking to support range scans.
SET STATISTICS IO ON DBCC IND ('demo','dbo.demo',1) SELECT COUNT(*) [RowCount] FROM dbo.demo
Repeating our DBCC IND command from above reminds us that we have 18 pages for this table. 17 data pages and 1 IAM page.
--Removed DBCC IND results because it is already displayed above. --18 row result, indicates 18 pages. (18 row(s) affected) DBCC execution completed. If DBCC printed error messages, contact your system administrator. RowCount ----------- 5031 (1 row(s) affected) Table 'Demo'. Scan count 1, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected)
Note the 17 logical reads from our query. This is because the IAM page is not counted in the logical reads metric and we performed a full scan of the table with 18 pages.
Writes and forward pointers
As mentioned above, INSERTs and DELETEs are fairly straight forward. An INSERT occurs on the first page with available space even if that means the creation of a new page. DELETEs are performed by de-allocating the space the row was occupying.
UPDATEs are where things go downhill for heaps. When a fixed length data type is updated the update will occur in-place. If a variable length data type is updated and the length has increased, there is a chance that the row will no longer fit in the same page. When the row no longer fits, a forwarding pointer is put into the space where the original row was and the row is relocated to a new page. This process has a negative impact on read performance. Let’s demonstrate.
Originally all of our records in the dbo.demo table contained VARCHARs with 9 character strings.
SELECT TOP 5 * FROM dbo.demo
To create some forward pointers we will update all of our rows and increase the length of the varLen column.
update dbo.demo set varLen = replicate('a',100)
We now have 95 pages in our heap because all of our rows have been relocated to new pages with forward pointers left behind.
DBCC IND ('demo','dbo.demo',1)
As you can imagine, with heaps requiring full table scans, we will need to read a lot more pages for the same amount of records now.
SET STATISTICS IO ON SELECT COUNT(*) [RowCount] FROM dbo.demo
(1 row(s) affected) Table 'Demo'. Scan count 1, logical reads 4405, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Our reads increased drastically because the original pages had to be read and then the forward pointers had to be followed. The performance impact and waste of space in memory and on disk is why we need clustered indexes.
The heap data structure is not an index and is often the worst performing data structure for your tables in SQL Server. We covered it today as a foundation of understanding which will be useful as we discuss indexes. In the next part of this series, clustered indexes, a B+-tree structure for table data, will be covered.