processors¶
The astrodata.preml.processors
module provides a framework to perform advanced preprocessing and transformation steps on your data after it has been loaded and split into training, testing, and (optionally) validation sets. Similar to the astrodata.data.processors
module, it is built around an extensible interface that allows users to compose preprocessing workflows chaining multiple operations together.
Abstract Class¶
The core of the astrodata.preml.processors
module is the PremlProcessor
abstract base class, which defines the interface for all preml processors. Each processor is responsible for transforming a Premldata
object, and can optionally save or load artifacts (such as fitted encoders or imputers) to ensure reproducibility and consistency across different runs.
How to Use¶
Creating a Premlprocessor¶
from astrodata.preml import Premldata, PremlProcessor
class CustomProcessor(PremlProcessor):
def process(self, preml: Premldata) -> Premldata:
# Transform the input Premldata and return the result.
Subclasses must implement the
process
method, which takes aPremldata
object and returns a transformedPremldata
.Artifacts (such as fitted parameters) can be saved in the process method with
PremlProcessor.save_artifact()
, and a processor can be initialized with an existing artifact.Processors can be chained together in a pipeline for complex preprocessing workflows.
Using built-in Processors¶
The astrodata.preml.processors
module provides several built-in processors for common preprocessing tasks:
TrainTestSplitter¶
Splits your dataset into training, testing, and optionally validation sets. You can specify target columns, test size, random state, and validation split. The output is a Premldata
object containing the split datasets and metadata.
from astrodata.preml import TrainTestSplitter
splitter = TrainTestSplitter(
targets=["target_column"],
test_size=0.2,
random_state=42,
validation={"enabled": True, "size": 0.1},
)
OHE (OneHotEncoder)¶
One-hot encodes specified categorical columns, optionally retaining numerical columns. Artifacts can be saved for reproducibility.
from astrodata.preml import OHE
encoder = OHE(
categorical_columns=["cat1", "cat2"],
numerical_columns=["num1", "num2"],
)
MissingImputator¶
Imputes missing values in numerical columns (using the mean) and categorical columns (using the most frequent value).
from astrodata.preml import MissingImputator
imputer = MissingImputator(
numerical_columns=["num1", "num2"],
categorical_columns=["cat1", "cat2"],
)
Standardizer¶
Standardizes numerical columns to have zero mean and unit variance. Useful for scaling features before machine learning.
from astrodata.preml import Standardizer
scaler = Standardizer(
numerical_columns=["num1", "num2"],
)