Indexing¶

For every column in the object, Xskipper can collect summary metadata. This metadata is used during query evaluation to skip over objects which have no relevant data.

Default Indexes¶

The following indexes are supported out of the box:

Index type	Description	Applicable to predicates in WHERE clauses	Column types
MinMax	Stores minimum and maximum values for a column	<,<=,=,>=,>	All types except for complex types. See Supported Spark SQL data types.
ValueList	Stores the list of unique values for the column	=,IN,LIKE	All types except for complex types. See Supported Spark SQL data types.
BloomFilter	Uses a bloom filter to test for set membership	=,IN	Byte, String, Long, Integer, Short

In order to add an index using the IndexBuilder to specify the required indexes, for example:

PythonScala

# create Xskipper instance for the sample dataset
xskipper = Xskipper(spark, dataset_location)

# remove index if exists
if xskipper.isIndexed():
    xskipper.dropIndex()

xskipper.indexBuilder() \
        .addMinMaxIndex("temp") \
        .addValueListIndex("city") \
        .addBloomFilterIndex("vid") \
        .build(reader) \
        .show(10, False)

// create Xskipper instance for the sample dataset
val xskipper = new Xskipper(spark, dataset_location)

// remove existing index if needed
if (xskipper.isIndexed()) {
  xskipper.dropIndex()
}

xskipper.indexBuilder()
        .addMinMaxIndex("temp")
        .addValueListIndex("city")
        .addBloomFilterIndex("vid")
        .build(reader)
        .show(false)

By default, the indexes are stored as parquet files stored in storage Each parquet file with row per each object in the dataset.

For more information about the parquet metadatastore see here.

Plugins¶

Xskipper supports adding new indexes using a pluggable system.
For instructions on how to create a new plugin see here.

Supported plugins¶

Currently the following plugins are supported (in addition to the built-in indexes: MinMax, ValueList and BloomFilter):

Regex Plugin - An index which enables to save a value list for a given regex.

Setting up a plugin¶

In order to use a plugin you first need to register the needed classes.
For example, for the Regex Plugin:

PythonScala

from xskipper import Registration

Registration.addMetadataFilterFactory(spark, 'io.xskipper.plugins.regex.filter.RegexValueListMetaDataFilterFactory')
# Add IndexFactory
Registration.addIndexFactory(spark, 'io.xskipper.plugins.regex.index.RegexIndexFactory')
# Add MetaDataTranslator
Registration.addMetaDataTranslator(spark, 'io.xskipper.plugins.regex.parquet.RegexValueListMetaDataTranslator')
# Add ClauseTranslator
Registration.addClauseTranslator(spark, 'io.xskipper.plugins.regex.parquet.RegexValueListClauseTranslator')

import io.xskipper._
import io.xskipper.plugins.regex.filter.RegexValueListMetaDataFilterFactory
import io.xskipper.plugins.regex.index.RegexIndexFactory
import io.xskipper.plugins.regex.parquet.{RegexValueListClauseTranslator, RegexValueListMetaDataTranslator}

// registering the filter factories for user metadataFilters
Registration.addIndexFactory(RegexIndexFactory)
Registration.addMetadataFilterFactory(RegexValueListMetaDataFilterFactory)
Registration.addClauseTranslator(RegexValueListClauseTranslator)
Registration.addMetaDataTranslator(RegexValueListMetaDataTranslator)

Index building¶

In order to build an index you can use the addCustomIndex API.

For example for the Regex Plugin:

PythonScala

    xskipper = Xskipper(spark, dataset_path)

    # adding the index using the custom index API
    xskipper.indexBuilder() \
        .addCustomIndex("io.xskipper.plugins.regex.index.RegexValueListIndex", ["log_line"],
                        {"io.xskipper.plugins.regex.pattern.r0": ".* .* .* (.*): .*"}) \
        .build(reader) \
        .show(10, False)

import io.xskipper.plugins.regex.implicits._

// index the dataset
val xskipper = new Xskipper(spark, dataset_path)


xskipper
  .indexBuilder()
  // using the implicit method defined in the plugin implicits
  .addRegexValueListIndex("log_line", Seq(".* .* .* (.*): .*"))
  // equivalent
  //.addCustomIndex(RegexValueListIndex("log_line", Seq(".* .* .* (.*): .*")))
  .build(reader).show(false)

Creating you own plugin¶

In order to create your own plugin see here.