Speed Up Query Execution Using Vectorization

Posted on 2021-08-15

Normal vs Vectorized Intstructions

Vectorized Queries

Vectorized execution operates on batches of rows at a time instead of individual rows.

It allows queries to be sped up by reducing the number of method calls, allowing more cache-efficiency, and potentially enabling CPU SIMD instructions.

Scope

Supported Vectorized Query Types

As of druid(0.20.2) release, following query types can be vectorized:

GroupBy
Timeseries

In particular, vectorization currently has the following requirements:

All query-level filters must either
- be able to run on bitmap indexes or
- offer vectorized row-matchers
- These include “selector”, “bound”, “in”, “like”, “regex”, “search”, “and”, “or”, and “not”
All filters in filtered aggregators must offer vectorized row-matchers
All aggregators must offer vectorized implementations. These include :

count, "doubleSum", "floatSum", "longSum",
"longMin", "longMax", "doubleMin", "doubleMax",
"floatMin", "floatMax", "longAny", "doubleAny",
"floatAny", "stringAny", "hyperUnique", "filtered",
"approxHistogram", "approxHistogramFold", and "fixedBucketsHistogram"
All virtual columns must offer vectorized implementations
For GroupBy:
- All dimension specs must be “default”
- no extraction functions or filtered dimension specs
- No multi-value dimensions
For Timeseries:
- No “descending” order
Only immutable segments (not real-time)
Only table datasources
- not joins, subqueries, lookups, or inline datasources

Unsupported Query Types

As of druid(0.20.2) release, following query types cannot be vectorized:

TopN
Scan
Select
Search

Vectorization parameters

property	default	description
vectorize	true	Enables or disables vectorized query execution.
vectorSize	512	Sets the row batching size for a particular query. This will override druid.query.default.context.vectorSize if it’s set.
vectorizeVirtualColumns	false	Enables or disables vectorized query processing of queries with virtual columns, layered on top of vectorize (vectorize must also be set to true for a query to utilize vectorization). Possible values are false (disabled), true (enabled if possible, disabled otherwise, on a per-segment basis), and force (enabled, and groupBy or timeseries queries with virtual columns that cannot be vectorized will fail). The “force” setting is meant to aid in testing, and is not generally useful in production. This will override druid.query.default.context.vectorizeVirtualColumns if it’s set.

How to Enable Vectorization

Vectorization can be enabled in 2 ways:

By setting the params in common config (this will enable query vectorization for all valid queries):

1
2
3

druid.query.default.context.vectorize=true
druid.query.default.context.vectorSize=1024
druid.query.default.context.vectorizeVirtualColumns=true

By passing the vector params in query context, e.g:

...
"context": {
    "vectorize": true,
    "vectorSize": 512,
    "vectorizeVirtualColumns": false
  },
...

References

Vectorization parameters
[PROPOSAL] Query vectorization #7093
Release Notes
Vectorization support for expression virtual columns
More vectorization support for aggregators
Vectorization benchmarks