Speed Up Query Execution Using Vectorization
Vectorized Queries
Vectorized execution operates on batches of rows at a time instead of individual rows.
It allows queries to be sped up by reducing the number of method calls, allowing more cache-efficiency, and potentially enabling CPU SIMD instructions.
Scope
Supported Vectorized Query Types
As of druid(0.20.2) release, following query types can be vectorized:
- GroupBy
- Timeseries
In particular, vectorization currently has the following requirements:
- All
query-level filters
must either- be able to run on bitmap indexes or
- offer vectorized row-matchers
- These include “selector”, “bound”, “in”, “like”, “regex”, “search”, “and”, “or”, and “not”
- All filters in
filtered aggregators
must offer vectorized row-matchers - All aggregators must offer vectorized implementations. These include :
count, "doubleSum", "floatSum", "longSum",
"longMin", "longMax", "doubleMin", "doubleMax",
"floatMin", "floatMax", "longAny", "doubleAny",
"floatAny", "stringAny", "hyperUnique", "filtered",
"approxHistogram", "approxHistogramFold", and "fixedBucketsHistogram"
- All virtual columns must offer vectorized implementations
- For GroupBy:
- All dimension specs must be “default”
- no extraction functions or filtered dimension specs
- No multi-value dimensions
- For Timeseries:
- No “descending” order
- Only immutable segments (not real-time)
- Only table datasources
- not joins, subqueries, lookups, or inline datasources
Unsupported Query Types
As of druid(0.20.2) release, following query types cannot be vectorized:
- TopN
- Scan
- Select
- Search
Vectorization parameters
property | default | description |
---|---|---|
vectorize | true | Enables or disables vectorized query execution. |
vectorSize | 512 | Sets the row batching size for a particular query. This will override druid.query.default.context.vectorSize if it’s set. |
vectorizeVirtualColumns | false | Enables or disables vectorized query processing of queries with virtual columns, layered on top of vectorize (vectorize must also be set to true for a query to utilize vectorization). Possible values are false (disabled), true (enabled if possible, disabled otherwise, on a per-segment basis), and force (enabled, and groupBy or timeseries queries with virtual columns that cannot be vectorized will fail). The “force” setting is meant to aid in testing, and is not generally useful in production. This will override druid.query.default.context.vectorizeVirtualColumns if it’s set. |
How to Enable Vectorization
Vectorization can be enabled in 2 ways:
By setting the params in common config (this will enable query vectorization for all valid queries):
1
2
3druid.query.default.context.vectorize=true
druid.query.default.context.vectorSize=1024
druid.query.default.context.vectorizeVirtualColumns=trueBy passing the vector params in query context, e.g:
1
2
3
4
5
6
7...
"context": {
"vectorize": true,
"vectorSize": 512,
"vectorizeVirtualColumns": false
},
...
References
- Vectorization parameters
- [PROPOSAL] Query vectorization #7093
- Release Notes
- Vectorization support for expression virtual columns
- More vectorization support for aggregators
- Vectorization benchmarks