utils.dataset

Functions

generate_dataset(target[, database, save])

Generates an expanded dataset for a given target feature.

generate_groups(target, group[, database])

Generates a list of groups for a given target feature and group feature.

generate_reference(refs)

Generates a database reference.

has_dataset(target[, in_path])

Checks if a dataset exists for a given target feature.

has_reference([path])

Checks if the database reference exists.

load_dataset(target[, in_path])

Loads the data and metadata for a given target feature.

load_reference([in_path])

Loads the database reference.

save_dataset(data, features, target[, out_path])

Saves the data and metadata for a given target feature.

save_reference(ref[, out_path])

Saves a database reference.

Classes

DataSet(target[, group_by])

Stores the expanded perovskite data and metadata for a given target feature.

class utils.dataset.DataSet(target, group_by=None)

Stores the expanded perovskite data and metadata for a given target feature.

target

Name of the target feature.

Type:

str

data

The expanded data.

Type:

dataframe

reference

The database reference.

Type:

dict

all_features

The full set of features for the expanded data.

Type:

dict

features

A reduced set of features generated during preprocessing.

Type:

dict

groups

A list of groups for the target feature.

Type:

list

collect_features()

Collects the features from the reduced set of features.

Returns:

The list of features.

Return type:

list

get_Xy()

Returns the reduced data and the target series.

Returns:

The reduced data. series: The target series.

Return type:

dataframe

get_dataset(database=<utils.database.PerovskiteData object>, save: bool = True)

Loads the data and metadata for a given target feature.

Stores the data and metadata in the class attributes.

Parameters:
  • database (PerovskiteDatabase, optional) – An instance of the Perovskite Dataset. Defaults to DATABASE.

  • save (bool, optional) – Whether to save the dataset. Defaults to True.

Returns:

None

preprocess(threshold=None, exclude_sections=[], exclude_cols=[])

Preprocesses the data.

If an unseen target is used to generate the preprocessed dataset, it is saved for future use. Otherwise, the previously generated file is loaded and returned instead.

Parameters:
  • threshold (float, optional) – The sparsity threshold. Defaults to None.

  • exclude_sections (list, optional) – The list of sections to exclude. Defaults to [].

  • exclude_cols (list, optional) – The list of columns to exclude. Defaults to [].

Returns:

The preprocessed data. series: The target series.

Return type:

dataframe

prune_by_sparsity(threshold)

Prunes the reduced set of features by sparsity.

Parameters:

threshold (float) – The sparsity threshold.

Returns:

None

remove(sections=[], features=[])

Removes both sections and features from the reduced set of features.

Parameters:
  • sections (list) – The list of sections to remove. Defaults to [].

  • features (list) – The list of features to remove. Defaults to [].

Returns:

None

remove_features(features)

Removes features from the reduced set of features.

Parameters:

features (list) – The list of features to remove.

Returns:

None

remove_sections(sections)

Removes an entire section of features from the reduced set of features.

Parameters:

sections (list) – The list of sections to remove.

Returns:

None

reset_features()

Resets the reduced set of features to the full set of features.

Returns:

None

utils.dataset.generate_dataset(target, database=<utils.database.PerovskiteData object>, save=True)

Generates an expanded dataset for a given target feature.

Parameters:
  • target (str) – Name of the target feature.

  • database (PerovskiteDatabase, optional) – An instance of the Perovskite Dataset. Defaults to DATABASE.

  • save (bool, optional) – Whether to save the dataset. Defaults to True.

Returns:

The data. dict: The data features. dict: The database reference.

Return type:

dataframe

utils.dataset.generate_groups(target, group, database=<utils.database.PerovskiteData object>)

Generates a list of groups for a given target feature and group feature.

Parameters:
  • target (str) – Name of the target feature.

  • group (str) – Name of the feature to group by.

Returns:

The list of groups.

Return type:

list

utils.dataset.generate_reference(refs)

Generates a database reference.

Parameters:

refs (dataframe) – An expanded representation of the database reference.

Returns:

The reference.

Return type:

dict

utils.dataset.has_dataset(target, in_path='C:\\Users\\Violet\\Documents\\GitHub\\PerovskiteML\\data\\expanded')

Checks if a dataset exists for a given target feature.

Parameters:
  • target (str) – Name of the target feature.

  • in_path (str, optional) – Path to the directory containing the dataset. Defaults to EXPAND_DIR.

Returns:

True if the dataset exists. False otherwise.

Return type:

bool

utils.dataset.has_reference(path='C:\\Users\\Violet\\Documents\\GitHub\\PerovskiteML\\data\\expanded')

Checks if the database reference exists.

Parameters:

path (str, optional) – Path to the directory containing the reference. Defaults to EXPAND_DIR.

Returns:

True if the reference exists. False otherwise.

Return type:

bool

utils.dataset.load_dataset(target: str, in_path='C:\\Users\\Violet\\Documents\\GitHub\\PerovskiteML\\data\\expanded')

Loads the data and metadata for a given target feature.

Parameters:
  • target (str) – Name of the target feature.

  • in_path (str, optional) – Path to the directory to load the dataset. Defaults to EXPAND_DIR.

Returns:

The data. dict: The features.

Return type:

dataframe

utils.dataset.load_reference(in_path='C:\\Users\\Violet\\Documents\\GitHub\\PerovskiteML\\data\\expanded')

Loads the database reference.

Parameters:

in_path (str, optional) – Path to the directory to load the reference. Defaults to EXPAND_DIR.

Returns:

The reference.

Return type:

dict

utils.dataset.save_dataset(data, features: dict, target: str, out_path='C:\\Users\\Violet\\Documents\\GitHub\\PerovskiteML\\data\\expanded')

Saves the data and metadata for a given target feature.

See the README.md in ./data for more information about the file structure.

Parameters:
  • data (dataframe) – The data.

  • features (dict) – The features.

  • target (str) – Name of the target feature.

  • out_path (str, optional) – Path to the directory to save the dataset. Defaults to EXPAND_DIR.

Returns:

None

utils.dataset.save_reference(ref, out_path='C:\\Users\\Violet\\Documents\\GitHub\\PerovskiteML\\data\\expanded')

Saves a database reference.

See the README.md in ./data for more information about the file structure.

Parameters:
  • ref (dict) – The reference.

  • out_path (str, optional) – Path to the directory to save the reference. Defaults to EXPAND_DIR.

Returns:

None