Speed Up Query Execution Using Vectorization

Normal vs Vectorized Intstructions

Vectorized Queries

Vectorized execution operates on batches of rows at a time instead of individual rows.

It allows queries to be sped up by reducing the number of method calls, allowing more cache-efficiency, and potentially enabling CPU SIMD instructions.

Scope

Supported Vectorized Query Types

As of druid(0.20.2) release, following query types can be vectorized:

  1. GroupBy
  2. Timeseries

In particular, vectorization currently has the following requirements:

  • All query-level filters must either
    • be able to run on bitmap indexes or
    • offer vectorized row-matchers
    • These include “selector”, “bound”, “in”, “like”, “regex”, “search”, “and”, “or”, and “not”
  • All filters in filtered aggregators must offer vectorized row-matchers
  • All aggregators must offer vectorized implementations. These include :

    count, "doubleSum", "floatSum", "longSum",
    "longMin", "longMax", "doubleMin", "doubleMax",
    "floatMin", "floatMax", "longAny", "doubleAny",
    "floatAny", "stringAny", "hyperUnique", "filtered",
    "approxHistogram", "approxHistogramFold", and "fixedBucketsHistogram"

  • All virtual columns must offer vectorized implementations
  • For GroupBy:
    • All dimension specs must be “default”
    • no extraction functions or filtered dimension specs
    • No multi-value dimensions
  • For Timeseries:
    • No “descending” order
  • Only immutable segments (not real-time)
  • Only table datasources
    • not joins, subqueries, lookups, or inline datasources

Unsupported Query Types

As of druid(0.20.2) release, following query types cannot be vectorized:

  1. TopN
  2. Scan
  3. Select
  4. Search

Vectorization parameters

property default description
vectorize true Enables or disables vectorized query execution.
vectorSize 512 Sets the row batching size for a particular query. This will override druid.query.default.context.vectorSize if it’s set.
vectorizeVirtualColumns false Enables or disables vectorized query processing of queries with virtual columns, layered on top of vectorize (vectorize must also be set to true for a query to utilize vectorization). Possible values are false (disabled), true (enabled if possible, disabled otherwise, on a per-segment basis), and force (enabled, and groupBy or timeseries queries with virtual columns that cannot be vectorized will fail). The “force” setting is meant to aid in testing, and is not generally useful in production. This will override druid.query.default.context.vectorizeVirtualColumns if it’s set.

How to Enable Vectorization

Vectorization can be enabled in 2 ways:

  1. By setting the params in common config (this will enable query vectorization for all valid queries):

    1
    2
    3
    druid.query.default.context.vectorize=true
    druid.query.default.context.vectorSize=1024
    druid.query.default.context.vectorizeVirtualColumns=true
  2. By passing the vector params in query context, e.g:

    1
    2
    3
    4
    5
    6
    7
    ...
    "context": {
    "vectorize": true,
    "vectorSize": 512,
    "vectorizeVirtualColumns": false
    },
    ...

References

  1. Vectorization parameters
  2. [PROPOSAL] Query vectorization #7093
  3. Release Notes
  4. Vectorization support for expression virtual columns
  5. More vectorization support for aggregators
  6. Vectorization benchmarks