Welcome to Xskipper’s documentation!

Xskipper

class xskipper.xskipper.Xskipper(sparkSession, uri, metadataStoreManagerClassName=None)

Main class for programmatically interacting with xskipper

Parameters
  • sparkSession – SparkSession object

  • uri – the URI of the dataset / the identifier of the hive table on which the index is defined

  • metadataStoreManagerClassName – fully qualified name of MetadataStoreManager to be used

static clearStats(sparkSession)

Clears the stats for all active MetadataHandle instances in the active MetadataStoreManager Should be called before each query to make sure the aggregated stats are cleared

Parameters

sparkSession – SparkSession object

describeIndex(dataFrameReader=None)

Describes the indexes on the URI (for non table URI)

Parameters

dataFrameReader – if uri is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame

Returns

dataFrame object containing information about the index

Raises

XskipperException – if index cannot be refreshed

static disable(sparkSession)

Disable xskipper by disabling the rules

Parameters

sparkSession – SparkSession object

dropIndex()

Deletes the index

Raises

XskipperException – if index cannot be removed

static enable(sparkSession)

Enable xskipper by adding the necessary rules

Parameters

sparkSession – SparkSession object

static get(sparkSession, key)

Retrieves the value associated with the given key in the configuration

Parameters
  • sparkSession – SparkSession object

  • key – the key to lookup

Returns

the value associated with the key or None if the key doesn’t exist

static getConf(sparkSession)

Returns a map of all configurations currently set

Parameters

sparkSession – SparkSession object

Returns

a map of the current values currently set in the configuration

static getLatestQueryAggregatedStats(sparkSession)

Gets the aggregated latest query skipping stats for all active MetadataHandle instances in the current default MetadataStoreManager. In order to get reliable results it is assumed that either clearStats or clearActiveMetadataHandles was called before running the query.

In case the API was called on a URI without an index or the API was called without running a query the returned DataFrame structure is - status, reason with status=FAILED In case the query cannot be skipped because one of the following:

  1. Dataset has no indexed files

  2. No query to the metadata store can be generated - can be due to a predicate that can not be used in skipping (or maybe due to missing metadata filter) or due to failure to translate the abstract query. the returned dataframe structure is: status, isSkippable, skipped_Bytes, skipped_Objs, total_Bytes, total_Objs with status=SUCCESS, isSkippable=false and all other values are -1 Otherwise the dataframe structure is the same as above with isSkippable=true and the relevant stats

Parameters

sparkSession – SparkSession object

Returns

dataFrame object containing information about latest query stats

getLatestQueryStats()

Return latest query skipping statistics for this xskipper instance

In case the API was called on a URI without an index or the API was called without running a query the returned DataFrame structure is - status, reason with status=FAILED In case the query cannot be skipped because one of the following:

  1. Dataset has no indexed files

  2. No query to the metadata store can be generated - can be due to a predicate that can not

    be used in skipping (or maybe due to missing metadata filter) or due to failure to translate the abstract query. the returned dataframe structure is: status, isSkippable, skipped_Bytes, skipped_Objs, total_Bytes, total_Objs with status=SUCCESS, isSkippable=false and all other values are -1 Otherwise the dataframe structure is the same as above with isSkippable=true and the relevant stats

Returns

object containing information about latest query stats

indexBuilder()

Helper class for setting and building an index

Returns

IndexBuilder: instance

static installExceptionWrapper()

Install a wrapper to wrap JVM XskipperException as python XskipperException

static isEnabled(sparkSession)

Checks whether xskipper is enabled

Parameters

sparkSession – SparkSession object

Returns

true if xskipper is enabled

isIndexed()

Checks if the URI is indexed

Returns

true if the URI is indexed

static listIndexes(sparkSession)

Returns information about the indexed datasets

Parameters

sparkSession – SparkSession object

Returns

dataFrame object containing information about the indexed datasets under the configured base path

refreshIndex(dataFrameReader=None)

Refresh the index

Parameters

dataFrameReader – if uri is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame Note: The reader is assumed to have all of the parameters configured. reader.load(Seq(<path>)) will be used by the indexing code to read each object separately

Returns

dataFrame object containing statistics about the refresh operation

Raises

XskipperException – if index cannot be refreshed

static reset(sparkSession)

Reset all xskipper settings by:

  1. disables filtering

  2. clear all MetadataHandle-s in the default MetadataStoreManager

  3. reset the JVM wide configuration

Parameters

sparkSession – SparkSession object

static set(sparkSession, key, value)

Sets a specific key in the JVM wide configuration

Parameters
  • sparkSession – SparkSession object

  • key – the key to set

  • value – the value associated with the key

static setConf(sparkSession, params)

Updates JVM wide xskipper parameters (Only given parameters will be updated)

Parameters
  • sparkSession – SparkSession object

  • params – a map of parameters to be set

setParams(params)

Update instance specific MetadataHandle parameters

Parameters

params – a map of parameters to be set

static unset(sparkSession, key)

Removes a key from the configuration

Parameters
  • sparkSession – SparkSession object

  • key – the key to remove

exception xskipper.xskipper.XskipperException(desc, stackTrace)
xskipper.xskipper.get_wrapped_function(f)

IndexBuilder

class xskipper.indexbuilder.IndexBuilder(spark, uri, xskipper)

Helper class for building indexes

Parameters
  • sparkSession – SparkSession object

  • uri – the URI of the dataset / the identifier of the hive table on which the index is defined

  • xskipper – the xskipper instance associated with this IndexBuilder

addBloomFilterIndex(col, keyMetadata=None)

Adds a BloomFilter index on the given column

Parameters
  • col – the column to add the index on

  • keyMetadata – optional key metadata

addCustomIndex(indexClass, cols, params, keyMetadata=None)

Adds a Custom index on the given columns

Parameters
  • cols – a sequence of cols

  • params – a map of index specific parameters

  • keyMetadata – optional key metadata

addMinMaxIndex(col, keyMetadata=None)

Adds a MinMax index for the given column

Parameters
  • col – the column to add the index on

  • keyMetadata – optional key metadata

addValueListIndex(col, keyMetadata=None)

Adds a ValueList index on the given column

Parameters
  • col – the column to add the index on

  • keyMetadata – optional key metadata

build(reader=None)

Builds the index

Parameters

dataFrameReader – if uri in the xskipper instance is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame Note: The reader is assumed to have all of the parameters configured. reader.load(Seq(<path>)) will be used by the indexing code to read each object separately

Returns

dataFrame object containing statistics about the build operation

Registration

class xskipper.registration.Registration

Helper class for registering Factories and Translators for xskipper plugins

static addClauseTranslator(sparkSession, clauseTranslator)

Adds a ClauseTranslator to the start of the ClauseTranslator-s list

Parameters
  • sparkSession – SparkSession object

  • clauseFactories – fully qualified name of ClauseTranslator to be used

static addIndexFactory(sparkSession, indexFactory)

Adds a IndexFactory to the start of the IndexFactory-s list

Parameters
  • sparkSession – SparkSession object

  • indexFactory – fully qualified name of IndexFactory to be used

static addMetaDataTranslator(sparkSession, metadataTranslator)

Adds a IndexFactory to the start of the IndexFactory-s list

Parameters
  • sparkSession – SparkSession object

  • metadataTranslator – fully qualified name of MetadataTranslator to be used

static addMetadataFilterFactory(sparkSession, filterFactory)

Adds a MetadataFilterFactory to the end of the MetadataFilterFactory-s list

Parameters
  • sparkSession – SparkSession object

  • filterFactory – fully qualified name of MetadataFilterFactory to be used

static getActiveMetadataStoreManagerType(sparkSession)

Returns the fully qualified name of the active MetadataStoreManagerType

Parameters

sparkSession – SparkSession object

Returns

fully qualified name of the active MetadataStoreManagerType

static setActiveMetadataStoreManager(sparkSession, metadataStoreManager)

Set the currently active the [[MetadataStoreManager]]

Parameters
  • sparkSession – SparkSession object

  • metadataStoreManager – fully qualified name of MetadataStoreManagerType to be used

Indices and tables