Clustered Indexes v/s Non-Clustered Indexes

In this post, I would like to give a small overview of Clustered and Non-Clustered Indexes.

DISCLAIMER: I am an Oracle employee, and the views/opinions expressed in the below article are purely my own and do not express the views of my employer.


Let’s start with similarities:

Similarities:

  • Both Clustered and Non-Clustered indexes are types of index structures for a database table.
  • The fundamental reason behind creating them is still the same — speeding up the query performance as deemed necessary by the client applications using the database.
  • As far as I know, both are created and managed using the same fundamental database indexing structure — our very own disk based B/B+ Trees.
  • So the act of looking up or probing the index structure using the index key is same for both clustered and non-clustered indexes — get to the root block, appropriate branch block(s), and finally down to the leaf block where the data of interest is sitting.
  • Both clustered and non-clustered index can be UNIQUE or NON-UNIQUE.

Differences:

The following two things differentiate both the index types at a high level:

  • What exactly is the final outcome of the index lookup operation ?
  • Any impact on the fundamental organization of data (rows) in the base table the user has created the index on ?

Non-Clustered Indexes:

  • A non-clustered index structure exists as a separate first class object in the database. The base table T and any non-clustered index created by the user on T exist as separate structures in the database.
  • This type of index does not alter or enforce any ordering of rows in the table.
  • In other words, rows of the table are not really stored in any sorted order. Such types of tables are also known as Heap Tables in the database terminology.
  • However, index entries in index blocks are always sorted on the index key.
  • This type of index operates like a back-pointer. The leaf block of non-clustered index structure store a mapping from the index key to the corresponding row in table T that the index key points to. This mapping is stored in the form of row locators (or whatever it is called in respective databases).
  • The main thing to understand is that this row locator helps the database in quickly getting to the corresponding row. Usually the database parses the locator and uses each piece of distinct information stored in it to get to the row in a disk block.
    • <index key, row locator>. This is what is stored in the leaf block of a non-clustered index.
  • The result of lookup in a non-clustered index is not the actual row data or column values. The only column values stored in the index are the values for the index key columns. Apart from this there is a row locator which is then used by the database to fetch the actual row to return all/subset of column data requested by the client.
  • This also tells there is an extra level of indirection in the use of non-clustered index. The way to get to the row works like this:
    • Use the index key and probe the index. In a 3-level index (root + branch + leaf), this would mean doing 3 logical I/Os on the buffer pool of database to see if the disk blocks are in database buffer cache or not.
    • Read the corresponding row locator from index leaf block and use this to fetch the actual row data from another disk block — note that this is the actual data block and is different from index block. So this will be an additional I/O.
  • Because this index structure does not really impact the base table T’s structure in any way, we can have multiple non-clustered indexes on a given table on different columns. This should obviously be driven by workload characteristics that describe what kinds of queries are most expected and accordingly we optimize them by creating indexes.
  • Non-Clustered Indexes are also referred to as Secondary Indexes.

Clustered Indexes:

  • A very important fact: clustered index does _not_ exist as a separate object/structure in the database. In other words, the clustered index is actually the table itself. That is why such table can also be called as Indexed Organized Tables (IOTs).
  • The index leaf block in a clustered index will obviously store the index key just like we have in a non-clustered index. But in addition to that, it will store the entire row data — values of all the non-key columns as well.
  • The non-key column data is co-located (aka clustered) with index key in index blocks. So, there is no need to store a row locator or a pointer to row in index blocks and follow that to retrieve the values of non-key columns. This is why we can say that  “index is the table” and “table is the index”. This is the most important thing to remember about clustered indexes.
  • Clustered index will alter the way data(rows) are stored on disk. All the rows will be stored (in index blocks) in sorted manner on the key columns (clustering key) used to create the clustered index. But in a non-clustered index there is no guarantee on the organization of actual data rows in data blocks. Only the index entries (index key, row locator) are sorted on the index key.
  • Because a clustered index stores the rows in sorted order on the clustering key, it is logically not possible to have multiple clustered indexes on any table.
  • This is completely different from a combination of a HEAP Table (rows stored in arbitrary order) with one or more non-clustered (or secondary) indexes on it.

Comparison

Clustered index saves an additional I/O that we have to do in non-clustered index to fetch the row data. This is because the complete row data resides in a clustered index leaf block whereas only the row locator is stored in a non-clustered index leaf block. So we save an additional I/O to get to the row.

This suggests that point lookups using clustered index will usually be faster than non-clustered index. This is true for both UNIQUE and NON-UNIQUE index type.

Similarly in case of range scan, clustered index will be faster. The performance gap between clustered and non-clustered index can be very significant if the scan happens to retrieve large amounts of data.

Let’s take the following query which requires a range scan:

SELECT COL1, COL2, COL3 FROM T WHERE SALARY>100000 AND SALARY<800000;

As with any index structure on SALARY column, the index will be probed to get to the leaf block. But what happens after this can really make a difference:

  • In a non-clustered index we will retrieve row locators (as many as fall in the predicate range) and do additional work to fetch the actual row data from another block using the row locator. The corresponding rows might be present in random physical locations on disk as no care is taken to physically store them together and minimize disk seek latency. In the worst case, each row can be in a different data block and we may see a huge increase in overall elapsed time due to disk seek cost.
  • However with a clustered index, we get the advantage of the row really sitting along with the index key and we retrieve the actual row(s) from index block itself. We essentially  see the benefits of“clustering” the data.

Let’s take another example to demonstrate clustering:

Consider Quora database having an ANSWERS table with following definition:

<AnswerID, Timestamp, AuthorName, Topic, AnswerData, Upvotes, CommentData>.  Similarly there is an AUTHORS table describing information about an author  <AuthorID, AuthorName, JoinDate, AnswerCount>.

If the following query is very frequently run, then creating a clustered index on “AuthorName” column in ANSWERS table would be beneficial.

SELECT AnswerData, Upvotes, Topic, AnswerID, CommentData FROM ANSWERS WHERE AuthorName=<blah>;

Because the ANSWERS table is organized as a clustered index (sorted on AuthorName), all the answers for a particular author are physically stored together (aka clustered) on disk. This will make the query run much faster as opposed to running the query with a non-clustered index on AuthorName column on ANSWERS table.

Key UPDATEs will require movement of entries in index blocks. This is true for both clustered and non-clustered indexes because index entries are sorted on the index key. However, key UPDATEs can get costly in clustered index. This is because the entire row will have to be moved (row movement) to maintain the sorted order of rows on clustering key. This is not the case with non-clustered index because it does not physically store the row data together with index entries. So there is really no row movement involved in non-clustered indexes. For this reason, it is usually recommended to create clustered index on Primary Key column since the columns values are unlikely to change.

Similarly, INSERTs are likely to be costly with clustered indexes. This is again due to row movement to maintain the sorted order in index blocks. Again, this is not the case with non-clustered indexes because there is no requirement of maintaining a deterministic order on the physical organization of rows. Rows are stored in random order (HEAP Table).

If the table is stored as a clustered index (which is really the actual table) and there are one or more non-clustered indexes, then row locators will _not_ be stored in leaf blocks of non-clustered index. Instead they will store clustering key (clustered index key). To be very specific:

  • Non-Clustered index leaf block entries in absence of a clustered index are of type: <non-clustered index key, row locator>.
  • Non-Clustered index leaf block entries in presence of a clustered index are of type: <non-clustered index key, clustered index key>.

This change in non-clustered index in the presence of clustered index has its advantages and disadvantages.

Disadvantages:

  • Lookup on non-clustered index becomes costly. Let’s say there is a clustered index on column A and non-clustered index on column B. Once the non-clustered index is probed on key value of column B, we would have gotten the value of clustered index key (column A) from leaf block. This index key value of column A is then used to probe the clustered index (which is again a B-Tree traversal) and fetch the corresponding row from leaf block of clustered index.
  • Every time the clustering key is updated, a corresponding update is required on non-clustered index as well because it stores the clustering key.
  • Wide/Fat clustering key will increase the size of non-clustered index and also the size of each entry in index leaf block. Thus fewer entries will be packed per block and this will increase the disk I/O — more number of index pages/blocks will have to be read.

Advantages:

  • Clustered Index requires the row data to be physically stored in sorted order (on the clustering key). An INSERT in a table (which is literally the clustered index here) has to maintain the invariant of “sorted order”. Therefore, it is very likely that rows will have to be moved around (row movement within a block or across blocks) to accommodate the new insert. If non-clustered index was storing the row locator, then every such INSERT will require an update to non-clustered index as well because the row locator would have changed due to row movement.
  • But because non-clustered index stores the clustered index key instead of row locator, no such update is required due to row movement.

In a followup post, I will compare these index types with respect to respective use cases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: