Welcome to Xskipper’s documentation!¶
Xskipper¶
-
class
xskipper.xskipper.
Xskipper
(sparkSession, uri, metadataStoreManagerClassName=None)¶ Main class for programmatically interacting with xskipper
- Parameters
sparkSession – SparkSession object
uri – the URI of the dataset / the identifier of the hive table on which the index is defined
metadataStoreManagerClassName – fully qualified name of MetadataStoreManager to be used
-
static
clearStats
(sparkSession)¶ Clears the stats for all active MetadataHandle instances in the active MetadataStoreManager Should be called before each query to make sure the aggregated stats are cleared
- Parameters
sparkSession – SparkSession object
-
describeIndex
(dataFrameReader=None)¶ Describes the indexes on the URI (for non table URI)
- Parameters
dataFrameReader – if uri is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame
- Returns
dataFrame object containing information about the index
- Raises
XskipperException – if index cannot be refreshed
-
static
disable
(sparkSession)¶ Disable xskipper by disabling the rules
- Parameters
sparkSession – SparkSession object
-
dropIndex
()¶ Deletes the index
- Raises
XskipperException – if index cannot be removed
-
static
enable
(sparkSession)¶ Enable xskipper by adding the necessary rules
- Parameters
sparkSession – SparkSession object
-
static
get
(sparkSession, key)¶ Retrieves the value associated with the given key in the configuration
- Parameters
sparkSession – SparkSession object
key – the key to lookup
- Returns
the value associated with the key or None if the key doesn’t exist
-
static
getConf
(sparkSession)¶ Returns a map of all configurations currently set
- Parameters
sparkSession – SparkSession object
- Returns
a map of the current values currently set in the configuration
-
static
getLatestQueryAggregatedStats
(sparkSession)¶ Gets the aggregated latest query skipping stats for all active MetadataHandle instances in the current default MetadataStoreManager. In order to get reliable results it is assumed that either clearStats or clearActiveMetadataHandles was called before running the query.
In case the API was called on a URI without an index or the API was called without running a query the returned DataFrame structure is - status, reason with status=FAILED In case the query cannot be skipped because one of the following:
Dataset has no indexed files
No query to the metadata store can be generated - can be due to a predicate that can not be used in skipping (or maybe due to missing metadata filter) or due to failure to translate the abstract query. the returned dataframe structure is: status, isSkippable, skipped_Bytes, skipped_Objs, total_Bytes, total_Objs with status=SUCCESS, isSkippable=false and all other values are -1 Otherwise the dataframe structure is the same as above with isSkippable=true and the relevant stats
- Parameters
sparkSession – SparkSession object
- Returns
dataFrame object containing information about latest query stats
-
getLatestQueryStats
()¶ Return latest query skipping statistics for this xskipper instance
In case the API was called on a URI without an index or the API was called without running a query the returned DataFrame structure is - status, reason with status=FAILED In case the query cannot be skipped because one of the following:
Dataset has no indexed files
- No query to the metadata store can be generated - can be due to a predicate that can not
be used in skipping (or maybe due to missing metadata filter) or due to failure to translate the abstract query. the returned dataframe structure is: status, isSkippable, skipped_Bytes, skipped_Objs, total_Bytes, total_Objs with status=SUCCESS, isSkippable=false and all other values are -1 Otherwise the dataframe structure is the same as above with isSkippable=true and the relevant stats
- Returns
object containing information about latest query stats
-
indexBuilder
()¶ Helper class for setting and building an index
- Returns
IndexBuilder: instance
-
static
installExceptionWrapper
()¶ Install a wrapper to wrap JVM XskipperException as python XskipperException
-
static
isEnabled
(sparkSession)¶ Checks whether xskipper is enabled
- Parameters
sparkSession – SparkSession object
- Returns
true if xskipper is enabled
-
isIndexed
()¶ Checks if the URI is indexed
- Returns
true if the URI is indexed
-
static
listIndexes
(sparkSession)¶ Returns information about the indexed datasets
- Parameters
sparkSession – SparkSession object
- Returns
dataFrame object containing information about the indexed datasets under the configured base path
-
refreshIndex
(dataFrameReader=None)¶ Refresh the index
- Parameters
dataFrameReader – if uri is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame Note: The reader is assumed to have all of the parameters configured. reader.load(Seq(<path>)) will be used by the indexing code to read each object separately
- Returns
dataFrame object containing statistics about the refresh operation
- Raises
XskipperException – if index cannot be refreshed
-
static
reset
(sparkSession)¶ Reset all xskipper settings by:
disables filtering
clear all MetadataHandle-s in the default MetadataStoreManager
reset the JVM wide configuration
- Parameters
sparkSession – SparkSession object
-
static
set
(sparkSession, key, value)¶ Sets a specific key in the JVM wide configuration
- Parameters
sparkSession – SparkSession object
key – the key to set
value – the value associated with the key
-
static
setConf
(sparkSession, params)¶ Updates JVM wide xskipper parameters (Only given parameters will be updated)
- Parameters
sparkSession – SparkSession object
params – a map of parameters to be set
-
setParams
(params)¶ Update instance specific MetadataHandle parameters
- Parameters
params – a map of parameters to be set
-
static
unset
(sparkSession, key)¶ Removes a key from the configuration
- Parameters
sparkSession – SparkSession object
key – the key to remove
-
exception
xskipper.xskipper.
XskipperException
(desc, stackTrace)¶
-
xskipper.xskipper.
get_wrapped_function
(f)¶
IndexBuilder¶
-
class
xskipper.indexbuilder.
IndexBuilder
(spark, uri, xskipper)¶ Helper class for building indexes
- Parameters
sparkSession – SparkSession object
uri – the URI of the dataset / the identifier of the hive table on which the index is defined
xskipper – the xskipper instance associated with this IndexBuilder
-
addBloomFilterIndex
(col, keyMetadata=None)¶ Adds a BloomFilter index on the given column
- Parameters
col – the column to add the index on
keyMetadata – optional key metadata
-
addCustomIndex
(indexClass, cols, params, keyMetadata=None)¶ Adds a Custom index on the given columns
- Parameters
cols – a sequence of cols
params – a map of index specific parameters
keyMetadata – optional key metadata
-
addMinMaxIndex
(col, keyMetadata=None)¶ Adds a MinMax index for the given column
- Parameters
col – the column to add the index on
keyMetadata – optional key metadata
-
addValueListIndex
(col, keyMetadata=None)¶ Adds a ValueList index on the given column
- Parameters
col – the column to add the index on
keyMetadata – optional key metadata
-
build
(reader=None)¶ Builds the index
- Parameters
dataFrameReader – if uri in the xskipper instance is a table identifier a DataFrameReader instance to enable reading the URI as a DataFrame Note: The reader is assumed to have all of the parameters configured. reader.load(Seq(<path>)) will be used by the indexing code to read each object separately
- Returns
dataFrame object containing statistics about the build operation
Registration¶
-
class
xskipper.registration.
Registration
¶ Helper class for registering Factories and Translators for xskipper plugins
-
static
addClauseTranslator
(sparkSession, clauseTranslator)¶ Adds a ClauseTranslator to the start of the ClauseTranslator-s list
- Parameters
sparkSession – SparkSession object
clauseFactories – fully qualified name of ClauseTranslator to be used
-
static
addIndexFactory
(sparkSession, indexFactory)¶ Adds a IndexFactory to the start of the IndexFactory-s list
- Parameters
sparkSession – SparkSession object
indexFactory – fully qualified name of IndexFactory to be used
-
static
addMetaDataTranslator
(sparkSession, metadataTranslator)¶ Adds a IndexFactory to the start of the IndexFactory-s list
- Parameters
sparkSession – SparkSession object
metadataTranslator – fully qualified name of MetadataTranslator to be used
-
static
addMetadataFilterFactory
(sparkSession, filterFactory)¶ Adds a MetadataFilterFactory to the end of the MetadataFilterFactory-s list
- Parameters
sparkSession – SparkSession object
filterFactory – fully qualified name of MetadataFilterFactory to be used
-
static
getActiveMetadataStoreManagerType
(sparkSession)¶ Returns the fully qualified name of the active MetadataStoreManagerType
- Parameters
sparkSession – SparkSession object
- Returns
fully qualified name of the active MetadataStoreManagerType
-
static
setActiveMetadataStoreManager
(sparkSession, metadataStoreManager)¶ Set the currently active the [[MetadataStoreManager]]
- Parameters
sparkSession – SparkSession object
metadataStoreManager – fully qualified name of MetadataStoreManagerType to be used
-
static