astrodata.preml.processors package

Submodules

astrodata.preml.processors.MissingImputator module

class astrodata.preml.processors.MissingImputator.MissingImputator(categorical_columns=None, numerical_columns=None, artifact_path=None)

Bases: PremlProcessor

Missing value imputator for handling missing data in datasets.

This class provides functionality to impute missing values in numerical and categorical columns using specified strategies. It supports saving and loading imputation artifacts for reuse.

process(preml)

Imputes missing values in the dataset.

This method imputes missing values in numerical columns using the mean and in categorical columns using the mode. If an artifact path is provided, it loads the imputation artifact and applies it to the test features. Otherwise, it fits new imputers on the training features, transforms both training and test features, and saves the artifact for reuse.

Parameters:
  • preml (Premldata) – The data to be processed.

  • artifact (Optional[str]) – Path to a saved imputation artifact.

Returns:

The processed data with imputed values.

Return type:

Premldata

astrodata.preml.processors.Ohe module

class astrodata.preml.processors.Ohe.OHE(categorical_columns=None, numerical_columns=None, artifact_path=None)

Bases: PremlProcessor

OneHotEncoder (OHE) processor for encoding categorical features.

This class provides functionality to one-hot encode categorical columns in a dataset, while optionally retaining numerical columns. It supports saving and loading encoding artifacts using pickle for reuse.

process(preml)

One-hot encodes categorical features in the data.

This method encodes categorical columns in the dataset using one-hot encoding. If an artifact path is provided, it loads the encoding artifact and applies it to the test features. Otherwise, it fits a new encoder on the training features, transforms both training and test features, and saves the artifact for reuse.

Parameters:
  • preml (Premldata) – The data to be processed.

  • artifact (Optional[str]) – Path to a saved one-hot encoding artifact.

Returns:

The processed data with one-hot encoded features.

Return type:

Premldata

astrodata.preml.processors.Standardizer module

class astrodata.preml.processors.Standardizer.Standardizer(numerical_columns=None, artifact_path=None, save_path=None)

Bases: PremlProcessor

Standardizer for scaling numerical features.

This class provides functionality to standardize numerical columns in a dataset by scaling them to have a mean of 0 and a standard deviation of 1. It supports saving and loading scaling artifacts for reuse.

process(preml)

Standardizes numerical features in the dataset.

This method scales numerical columns to have a mean of 0 and a standard deviation of 1. If an artifact path is provided, it loads the scaling artifact and applies it to the test features. Otherwise, it fits a new scaler on the training features, transforms both training and test features, and saves the artifact for reuse.

Parameters:

preml (Premldata) – The data to be processed.

Returns:

The processed data with standardized numerical features.

Return type:

Premldata

astrodata.preml.processors.TrainTestSplitter module

class astrodata.preml.processors.TrainTestSplitter.TrainTestSplitter(**kwargs)

Bases: PremlProcessor

Processor to convert ProcessedData to Premldata.

This processor splits the input ProcessedData into training, testing, and optionally validation sets according to the configuration provided. It supports specifying target columns, test size, random state, and validation split. The output is a Premldata object containing the split datasets and metadata.

process(data, **kwargs)

Converts a ProcessedData object to a Premldata object.

This method splits the input ProcessedData into training, testing, and optionally validation sets using scikit-learn’s train_test_split. The configuration determines the target columns, test size, random state, and validation split. The resulting Premldata object contains the split features, targets, and metadata.

Parameters:

data (ProcessedData) – The input processed data to be split.

Returns:

The resulting Premldata object containing the split datasets.

Return type:

Premldata

astrodata.preml.processors.base module

class astrodata.preml.processors.base.PremlProcessor(artifact_path=None, **kwargs)

Bases: ABC

An abstract base class for preml processors.

Subclasses must implement the process method to define how the input Premldata is processed.

process(preml

Premldata) -> Premldata: Abstract method to process the input Premldata and return a new Premldata object.

load_artifact(path)

Loads an artifact from a specified path.

Parameters:

path (str) – The path from where the artifact should be loaded.

abstractmethod process(preml, artifact=None, **kwargs)

process the input Premldata and returns a new Premldata object.

Return type:

Premldata

save_artifact(artifact)

Saves an artifact to a specified path.

Parameters:

Artifact (Any) – The artifact to be saved, which can be any object.

Module contents