Extensible Data Skipping¶

Read the IEEE Big Data 2020 paper - Extensible Data Skipping

Xskipper creates a level of abstraction between index creation/query evaluation, and the actual metadata store implementation, which is unique to each metadata store.

This abstraction operates in two main areas:

Indexing Flow - abstract metadata is generated during index creation and refresh by analyzing a DataFrame.

Indexing Flow

Query Evaluation Flow - abstract metadata clauses are generated by filters that analyze the Catalyst (Spark optimizer) expression tree for pushdown predicates, and identify subtrees that can be mapped to a metadata Clause.

Query Evaluation Flow

These abstract structures (MetaData and Clauses) are then translated to a representation that matches a specific metadata store. Metadata Clauses can then be applied to MetaDataTypes in an efficient manner.