Traditional query processing algorithms are based on “iterator” or “tuple-at-a-time” model where a single tuple is pushed up through the query plan tree from one operator to another. Each operator typically has a next() method which outputs a tuple or record and the latter is then consumed as an input record by the caller operator (parent in execution plan tree).
The tuple at a time processing methodology has some disadvantages when it comes to building high performance SQL execution engines for analytical queries.
- Excessive function calls – tuple-at-a-time model has a lot of overhead associated with a large number of function calls throughout query execution. As a result, it is difficult to write efficient query interpreters around this model. For some TPC-H queries, this can result in performance degradation by an order of magnitude.
- Defeats the purpose of using columnar format – Columnar format (in-memory or on-disk) has become the de facto format for building efficient analytical query engines to accelerate data warehouse or business intelligence style queries. Query processing algorithms written around tuple-at-a-time model are unable to take advantage of columnar format since it works on tuples as opposed to columns. Thus tuple-at-a-time model misses out on several key fundamental techniques that can be employed to run analytical queries efficiently on columnar format.
Vectorized query processing model is a significant departure from tuple-at-a-time model and has two key differences:
- Column oriented processing – Write query processing algorithms that operate on columns as long as possible in the execution plan. In other words, refrain from working on tuples till late in the plan until its absolutely necessary (during projecting result-set back to the user).
- Push column vectors through the query plan tree – Instead of passing around tuple from one operator to another, pass column(s) containing a fixed number of records. This is typically referred to a “block of vectors” or “batch of vectors”. This is the unit of work that is fed into the operator or the result produced by an operator.
- Please don’t misunderstand this as “multiple tuples”. Forming tuples requires stitching values together from different columns to form rows and this disallows column oriented processing. A batch consists of multiple records but they are still passed as a set of columns — each column in its own data structure generally known as vector in columnar databases.
Some advantages associated with vectorized processing:
- Efficient interpretation – by working on a batch of vectors (as opposed to tuple-at-a-time) across the operators in execution plan, we see an overall improvement in query execution time.
- Better cache locality and efficient utilization of CPU cache – we can quickly loop through tightly packed values of a column and do the necessary processing — predicate evaluation, arithmetic computations etc. Cache lines are filled with related values (from the same column) as opposed to heterogeneous values from multiple columns in a tuple where some columns may not even be touched by the query.
- Better chance of native optimizations by the compiler – tight loop based vectorized algorithms are good candidates of automatic optimization by compilers.
- Leverage hardware acceleration – well aligned column data in densely packed arrays is amenable to acceleration using SIMD instructions. Common operations like FILTER, SUM, MIN, MAX can be accelerated by an order of magnitude by exploiting data-level parallelism of SIMD instructions.
- Directly operate on compressed columnar data – columnar format allows us to encode column values with lightweight compression algorithms (dictionary encoding, RLE etc) which trade compression ratio for better query performance. These compression schemes allow the query processing code to directly work (in some cases) on compressed column values without having to decompress everything upfront. For example:
- FILTER on variable width columns can be efficiently executed if the column is dictionary encoded and we re-write the FILTER using dictionary codes (which are fixed width) which is amenable to SIMD processing.
- RLE (Run Length Encoding) can be leveraged to do quick arithmetic on a set of column values without having to process/load each column value individually.
By executing query processing logic directly on compressed columnar data, we save a lot on CPU and this can further improve the query performance.
I recently gave a talk at Strata Data Conference on Vectorized Processing using Apache Arrow and how Dremio has built a high performance execution engine on top of Apache Arrow. Please see the following links to view the presentation and learn more about Dremio and Apache Arrow.
Please join us at the first Apache Arrow meetup in Bay Area on May 15th in San Francisco.