utils.expansion

This module contains methods to decode layer data from the strings given by certain features.

Many features in the Perovskite Database contain information about the device layers encoded using patterns:: Cell_stack_sequence: “SLG | FTO | TiO2-c | TiO2-mp | Perovskite | Spiro-MeOTAD | Au” ETL_deposition_synthesis_atmosphere: “Vacuum | Vacuum >> Vacuum”

The decoding methods are used to expand this layer data into individual features. The Depth Threshold, one of the preprocessing hyperparameters, is used to restrict the sparcity of these features.

From the Perovskite database reference document:

### The vertical bar, i.e. (‘ | ‘) If a filed contains data for more than one layer, the data belonging to the different layers is separated by a vertical bar with a space on both sides, i.e. (‘ | ‘) Layers are sorted left to right with the substrate first, i.e. to the left.

### The semicolon. i.e. (‘ ; ‘) If several materials, solvents, gases, etc. are occurring in one layer or during one reaction step, e.g. A and B, are listed in alphabetic order and separated with semicolons, as in (A; B)

### The double forward angel bracket, i.e. (‘ >> ‘) When a layer in a stack is deposited, there may be more than one reaction step involved. If that is the case, the information concerning the different reaction steps, e.g. A, and B, are separated by a double forward angel bracket with one blank space on both sides, as in (‘A >> B‘)

Functions

`expand_data`(name, data[, percent, verbosity])	Expands the encoded data from a column of data.
`expand_dataset`(data[, features, percent, ...])	Expands the encoded data from a dataset.
`expand_sort`(data, ref)	Expands a dataset and sorts a reference dictionary of the features by sparsity.
`extract_features`(seq[, delim])	Converts a sequence of encoded data into a list of features.
`feature_counts`(children[, count_none])	Finds the maximum layer number for each row of data and sums along the rows.
`feature_tree`(name, data[, parent, iter, ...])	Generates a tree of features from a column of encoded data.
`filter_children`(children, percent)	Filters children within a given percentile.
`generate_children`(data, delim)	Generates the child features for a column of data.
`generate_name`(leaf[, name])	Recursively generates a name for a leaf node.
`leaf_matrix`(root)	Generates a dataframe of the data from the leaf nodes of a tree.
`percent_index`(arr[, percent])	Finds the index where a given percent of data falls within.

utils.expansion.expand_data(name: str, data, percent: float = 1.0, verbosity: int = 0)

Expands the encoded data from a column of data.

Parameters:

name (str) – Name of the feature.
data (series) – The column of data.
percent (float, optional) – The percentile threshold. Defaults to 1.0.
verbosity (int, optional) – The verbosity level. Defaults to 0.

Returns:

The expanded data.

Return type:

dataframe

utils.expansion.expand_dataset(data, features=None, percent: float = 1.0, verbosity: int = 0)

Expands the encoded data from a dataset.

Iterates over each column to expand the data.

Parameters:

data (dataframe) – The dataset.
features (list, optional) – The list of features to expand. Defaults to None.
percent (float, optional) – The percentile threshold. Defaults to 1.0.
verbosity (int, optional) – The verbosity level. Defaults to 0.

Returns:

The expanded dataset.

Return type:

dataframe

utils.expansion.expand_sort(data, ref)

Expands a dataset and sorts a reference dictionary of the features by sparsity.

Detects and expands valid encoded features and passes the rest.

Parameters:

data (dataframe) – The dataset.
ref (dataframe) – The reference dictionary for the database.

Returns:

The expanded dataset. dict: The sorted reference dictionary of features.

Return type:

dataframe

utils.expansion.extract_features(seq, delim=';')

Converts a sequence of encoded data into a list of features.

Parameters:

seq (str) – The sequence of encoded data.
delim (str) – The delimiter used to separate features. Defaults to ‘;’. (The Feature Layer Delimiter)

Returns:

The list of features.

Return type:

list

utils.expansion.feature_counts(children, count_none=True)

Finds the maximum layer number for each row of data and sums along the rows.

This counts how many times a maximum layer occurs.

Parameters:

children (dataframe) – The child features.
count_none (bool, optional) – Counts the 0th layer as a layer. Defaults to True.

Returns:

The counts for each layer.

Return type:

array

utils.expansion.feature_tree(name, data, parent=None, iter=0, max_iter=3, is_only=False, percent=1.0)

Generates a tree of features from a column of encoded data.

Parameters:

name (str) – The name of the feature.
data (series) – The column of data.
parent (Node, optional) – The parent node. Defaults to None.
iter (int, optional) – The current iteration. Defaults to 0.
max_iter (int, optional) – The maximum number of iterations. Defaults to len(DEPTH_NAMES).
is_only (bool, optional) – If the current node is an only child. Defaults to False.
percent (float, optional) – The percentile threshold. Defaults to 1.0.

Returns:

The root node of the tree.

Return type:

Node

utils.expansion.filter_children(children, percent)

Filters children within a given percentile.

Parameters:

children (dataframe) – The child features.
percent (float) – The percentile threshold.

Returns:

The filtered child features.

Return type:

dataframe

utils.expansion.generate_children(data, delim)

Generates the child features for a column of data.

Each column of the generated dataframe is a child feature.

Parameters:

data (series) – The column of data.
delim (str) – The delimiter used to separate features.

Returns:

The child features.

Return type:

dataframe

utils.expansion.generate_name(leaf, name=None)

Recursively generates a name for a leaf node.

Steps through the nodes to name the feature at the leaf node based on its depth. Examples of names at different depths are:

Depth 0: OriginalFeature_Layer_0
Depth 1: OriginalFeature_Layer_1_Feature_0
Depth 2: OriginalFeature_Layer_1_Feature_2_Deposition_1

Parameters:

leaf (Node) – The leaf node.
name (str, optional) – The name to set. Defaults to None.

Returns:

The name of the leaf node.

Return type:

str

utils.expansion.leaf_matrix(root)

Generates a dataframe of the data from the leaf nodes of a tree.

The columns are labeled with the node names.

Parameters:: root (Node) – The root node of the tree.
Returns:: The dataframe of the data.
Return type:: dataframe

utils.expansion.percent_index(arr, percent=0.95)

Finds the index where a given percent of data falls within.

Parameters:

arr (array) – The bitarray which describes the sparsity of the data.
percent (float, optional) – The percentile threshold. Defaults to 0.95.

Returns:

The index where a given percent of data falls within.

Return type:

int