processors

The astrodata.preml.processors module provides a framework to perform advanced preprocessing and transformation steps on your data after it has been loaded and split into training, testing, and (optionally) validation sets. Similar to the astrodata.data.processors module, it is built around an extensible interface that allows users to compose preprocessing workflows chaining multiple operations together.

Abstract Class

The core of the astrodata.preml.processors module is the PremlProcessor abstract base class, which defines the interface for all preml processors. Each processor is responsible for transforming a Premldata object, and can optionally save or load artifacts (such as fitted encoders or imputers) to ensure reproducibility and consistency across different runs.

How to Use

Creating a Premlprocessor

from astrodata.preml import Premldata, PremlProcessor

class CustomProcessor(PremlProcessor):
    def process(self, preml: Premldata) -> Premldata:
        # Transform the input Premldata and return the result.
  • Subclasses must implement the process method, which takes a Premldata object and returns a transformed Premldata.

  • Artifacts (such as fitted parameters) can be saved in the process method with PremlProcessor.save_artifact(), and a processor can be initialized with an existing artifact.

  • Processors can be chained together in a pipeline for complex preprocessing workflows.

Using built-in Processors

The astrodata.preml.processors module provides several built-in processors for common preprocessing tasks:

TrainTestSplitter

Splits your dataset into training, testing, and optionally validation sets. You can specify target columns, test size, random state, and validation split. The output is a Premldata object containing the split datasets and metadata.

from astrodata.preml import TrainTestSplitter

splitter = TrainTestSplitter(
    targets=["target_column"],
    test_size=0.2,
    random_state=42,
    validation={"enabled": True, "size": 0.1},
)

OHE (OneHotEncoder)

One-hot encodes specified categorical columns, optionally retaining numerical columns. Artifacts can be saved for reproducibility.

from astrodata.preml import OHE

encoder = OHE(
    categorical_columns=["cat1", "cat2"],
    numerical_columns=["num1", "num2"],
)

MissingImputator

Imputes missing values in numerical columns (using the mean) and categorical columns (using the most frequent value).

from astrodata.preml import MissingImputator

imputer = MissingImputator(
    numerical_columns=["num1", "num2"],
    categorical_columns=["cat1", "cat2"],
)

Standardizer

Standardizes numerical columns to have zero mean and unit variance. Useful for scaling features before machine learning.

from astrodata.preml import Standardizer

scaler = Standardizer(
    numerical_columns=["num1", "num2"],
)