Welcome to Xskipper’s documentation!¶

Xskipper¶

class xskipper.xskipper.Xskipper(sparkSession, uri, metadataStoreManagerClassName=None)¶

Main class for programmatically interacting with xskipper

Parameters

sparkSession – SparkSession object
uri – the URI of the dataset / the identifier of the hive table on which the index is defined
metadataStoreManagerClassName – fully qualified name of MetadataStoreManager to be used

static clearStats(sparkSession)¶

Clears the stats for all active MetadataHandle instances in the active MetadataStoreManager Should be called before each query to make sure the aggregated stats are cleared

Parameters: sparkSession – SparkSession object

describeIndex(dataFrameReader=None)¶

Describes the indexes on the URI (for non table URI)

Parameters: dataFrameReader – if uri is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame
Returns: dataFrame object containing information about the index
Raises: XskipperException – if index cannot be refreshed

static disable(sparkSession)¶

Disable xskipper by disabling the rules

Parameters: sparkSession – SparkSession object

dropIndex()¶

Deletes the index

Raises: XskipperException – if index cannot be removed

static enable(sparkSession)¶

Enable xskipper by adding the necessary rules

Parameters: sparkSession – SparkSession object

static get(sparkSession, key)¶

Retrieves the value associated with the given key in the configuration

Parameters

sparkSession – SparkSession object
key – the key to lookup

Returns

the value associated with the key or None if the key doesn’t exist

static getConf(sparkSession)¶

Returns a map of all configurations currently set

Parameters: sparkSession – SparkSession object
Returns: a map of the current values currently set in the configuration

static getLatestQueryAggregatedStats(sparkSession)¶

Gets the aggregated latest query skipping stats for all active MetadataHandle instances in the current default MetadataStoreManager. In order to get reliable results it is assumed that either clearStats or clearActiveMetadataHandles was called before running the query.

In case the API was called on a URI without an index or the API was called without running a query the returned DataFrame structure is - status, reason with status=FAILED In case the query cannot be skipped because one of the following:

Dataset has no indexed files
No query to the metadata store can be generated - can be due to a predicate that can not be used in skipping (or maybe due to missing metadata filter) or due to failure to translate the abstract query. the returned dataframe structure is: status, isSkippable, skipped_Bytes, skipped_Objs, total_Bytes, total_Objs with status=SUCCESS, isSkippable=false and all other values are -1 Otherwise the dataframe structure is the same as above with isSkippable=true and the relevant stats

Parameters: sparkSession – SparkSession object
Returns: dataFrame object containing information about latest query stats

getLatestQueryStats()¶

Return latest query skipping statistics for this xskipper instance

In case the API was called on a URI without an index or the API was called without running a query the returned DataFrame structure is - status, reason with status=FAILED In case the query cannot be skipped because one of the following:

Dataset has no indexed files
No query to the metadata store can be generated - can be due to a predicate that can not
be used in skipping (or maybe due to missing metadata filter) or due to failure to translate the abstract query. the returned dataframe structure is: status, isSkippable, skipped_Bytes, skipped_Objs, total_Bytes, total_Objs with status=SUCCESS, isSkippable=false and all other values are -1 Otherwise the dataframe structure is the same as above with isSkippable=true and the relevant stats

Returns: object containing information about latest query stats

indexBuilder()¶

Helper class for setting and building an index

Returns: IndexBuilder: instance

static installExceptionWrapper()¶: Install a wrapper to wrap JVM XskipperException as python XskipperException

static isEnabled(sparkSession)¶

Checks whether xskipper is enabled

Parameters: sparkSession – SparkSession object
Returns: true if xskipper is enabled

isIndexed()¶

Checks if the URI is indexed

Returns: true if the URI is indexed

static listIndexes(sparkSession)¶

Returns information about the indexed datasets

Parameters: sparkSession – SparkSession object
Returns: dataFrame object containing information about the indexed datasets under the configured base path

refreshIndex(dataFrameReader=None)¶

Refresh the index

Parameters: dataFrameReader – if uri is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame Note: The reader is assumed to have all of the parameters configured. reader.load(Seq(<path>)) will be used by the indexing code to read each object separately
Returns: dataFrame object containing statistics about the refresh operation
Raises: XskipperException – if index cannot be refreshed

static reset(sparkSession)¶

Reset all xskipper settings by:

disables filtering
clear all MetadataHandle-s in the default MetadataStoreManager
reset the JVM wide configuration

Parameters: sparkSession – SparkSession object

static set(sparkSession, key, value)¶

Sets a specific key in the JVM wide configuration

Parameters

sparkSession – SparkSession object
key – the key to set
value – the value associated with the key

static setConf(sparkSession, params)¶

Updates JVM wide xskipper parameters (Only given parameters will be updated)

Parameters

sparkSession – SparkSession object
params – a map of parameters to be set

setParams(params)¶

Update instance specific MetadataHandle parameters

Parameters: params – a map of parameters to be set

static unset(sparkSession, key)¶

Removes a key from the configuration

Parameters

sparkSession – SparkSession object
key – the key to remove

exception xskipper.xskipper.XskipperException(desc, stackTrace)¶

xskipper.xskipper.get_wrapped_function(f)¶

IndexBuilder¶

class xskipper.indexbuilder.IndexBuilder(spark, uri, xskipper)¶

Helper class for building indexes

Parameters

sparkSession – SparkSession object
uri – the URI of the dataset / the identifier of the hive table on which the index is defined
xskipper – the xskipper instance associated with this IndexBuilder

addBloomFilterIndex(col, keyMetadata=None)¶

Adds a BloomFilter index on the given column

Parameters

col – the column to add the index on
keyMetadata – optional key metadata

addCustomIndex(indexClass, cols, params, keyMetadata=None)¶

Adds a Custom index on the given columns

Parameters

cols – a sequence of cols
params – a map of index specific parameters
keyMetadata – optional key metadata

addMinMaxIndex(col, keyMetadata=None)¶

Adds a MinMax index for the given column

Parameters

col – the column to add the index on
keyMetadata – optional key metadata

addValueListIndex(col, keyMetadata=None)¶

Adds a ValueList index on the given column

Parameters

col – the column to add the index on
keyMetadata – optional key metadata

build(reader=None)¶

Builds the index

Parameters: dataFrameReader – if uri in the xskipper instance is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame Note: The reader is assumed to have all of the parameters configured. reader.load(Seq(<path>)) will be used by the indexing code to read each object separately
Returns: dataFrame object containing statistics about the build operation

Registration¶

class xskipper.registration.Registration¶

Helper class for registering Factories and Translators for xskipper plugins

static addClauseTranslator(sparkSession, clauseTranslator)¶

Adds a ClauseTranslator to the start of the ClauseTranslator-s list

Parameters

sparkSession – SparkSession object
clauseFactories – fully qualified name of ClauseTranslator to be used

static addIndexFactory(sparkSession, indexFactory)¶

Adds a IndexFactory to the start of the IndexFactory-s list

Parameters

sparkSession – SparkSession object
indexFactory – fully qualified name of IndexFactory to be used

static addMetaDataTranslator(sparkSession, metadataTranslator)¶

Adds a IndexFactory to the start of the IndexFactory-s list

Parameters

sparkSession – SparkSession object
metadataTranslator – fully qualified name of MetadataTranslator to be used

static addMetadataFilterFactory(sparkSession, filterFactory)¶

Adds a MetadataFilterFactory to the end of the MetadataFilterFactory-s list

Parameters

sparkSession – SparkSession object
filterFactory – fully qualified name of MetadataFilterFactory to be used

static getActiveMetadataStoreManagerType(sparkSession)¶

Returns the fully qualified name of the active MetadataStoreManagerType

Parameters: sparkSession – SparkSession object
Returns: fully qualified name of the active MetadataStoreManagerType

static setActiveMetadataStoreManager(sparkSession, metadataStoreManager)¶

Set the currently active the [[MetadataStoreManager]]

Parameters

sparkSession – SparkSession object
metadataStoreManager – fully qualified name of MetadataStoreManagerType to be used

Welcome to Xskipper’s documentation!¶

Xskipper¶

IndexBuilder¶

Registration¶

Indices and tables¶

Table of Contents

This Page