Skip to content

Indexing

For every column in the object, Xskipper can collect summary metadata. This metadata is used during query evaluation to skip over objects which have no relevant data.

Default Indexes

The following indexes are supported out of the box:

Index type Description Applicable to predicates in WHERE clauses Column types
MinMax Stores minimum and maximum values for a column <,<=,=,>=,> All types except for complex types. See Supported Spark SQL data types.
ValueList Stores the list of unique values for the column =,IN,LIKE All types except for complex types. See Supported Spark SQL data types.
BloomFilter Uses a bloom filter to test for set membership =,IN Byte, String, Long, Integer, Short

In order to add an index using the IndexBuilder to specify the required indexes, for example:

# create Xskipper instance for the sample dataset
xskipper = Xskipper(spark, dataset_location)

# remove index if exists
if xskipper.isIndexed():
    xskipper.dropIndex()

xskipper.indexBuilder() \
        .addMinMaxIndex("temp") \
        .addValueListIndex("city") \
        .addBloomFilterIndex("vid") \
        .build(reader) \
        .show(10, False)
// create Xskipper instance for the sample dataset
val xskipper = new Xskipper(spark, dataset_location)

// remove existing index if needed
if (xskipper.isIndexed()) {
  xskipper.dropIndex()
}

xskipper.indexBuilder()
        .addMinMaxIndex("temp")
        .addValueListIndex("city")
        .addBloomFilterIndex("vid")
        .build(reader)
        .show(false)

By default, the indexes are stored as parquet files stored in storage Each parquet file with row per each object in the dataset.

For more information about the parquet metadatastore see here.

Plugins

Xskipper supports adding new indexes using a pluggable system.
For instructions on how to create a new plugin see here.

Supported plugins

Currently the following plugins are supported (in addition to the built-in indexes: MinMax, ValueList and BloomFilter):

  • Regex Plugin - An index which enables to save a value list for a given regex.

Setting up a plugin

In order to use a plugin you first need to register the needed classes.
For example, for the Regex Plugin:

from xskipper import Registration

Registration.addMetadataFilterFactory(spark, 'io.xskipper.plugins.regex.filter.RegexValueListMetaDataFilterFactory')
# Add IndexFactory
Registration.addIndexFactory(spark, 'io.xskipper.plugins.regex.index.RegexIndexFactory')
# Add MetaDataTranslator
Registration.addMetaDataTranslator(spark, 'io.xskipper.plugins.regex.parquet.RegexValueListMetaDataTranslator')
# Add ClauseTranslator
Registration.addClauseTranslator(spark, 'io.xskipper.plugins.regex.parquet.RegexValueListClauseTranslator')
import io.xskipper._
import io.xskipper.plugins.regex.filter.RegexValueListMetaDataFilterFactory
import io.xskipper.plugins.regex.index.RegexIndexFactory
import io.xskipper.plugins.regex.parquet.{RegexValueListClauseTranslator, RegexValueListMetaDataTranslator}

// registering the filter factories for user metadataFilters
Registration.addIndexFactory(RegexIndexFactory)
Registration.addMetadataFilterFactory(RegexValueListMetaDataFilterFactory)
Registration.addClauseTranslator(RegexValueListClauseTranslator)
Registration.addMetaDataTranslator(RegexValueListMetaDataTranslator)

Index building

In order to build an index you can use the addCustomIndex API.

For example for the Regex Plugin:

    xskipper = Xskipper(spark, dataset_path)

    # adding the index using the custom index API
    xskipper.indexBuilder() \
        .addCustomIndex("io.xskipper.plugins.regex.index.RegexValueListIndex", ["log_line"],
                        {"io.xskipper.plugins.regex.pattern.r0": ".* .* .* (.*): .*"}) \
        .build(reader) \
        .show(10, False)
import io.xskipper.plugins.regex.implicits._

// index the dataset
val xskipper = new Xskipper(spark, dataset_path)


xskipper
  .indexBuilder()
  // using the implicit method defined in the plugin implicits
  .addRegexValueListIndex("log_line", Seq(".* .* .* (.*): .*"))
  // equivalent
  //.addCustomIndex(RegexValueListIndex("log_line", Seq(".* .* .* (.*): .*")))
  .build(reader).show(false)

Creating you own plugin

In order to create your own plugin see here.