# processors The `astrodata.preml.processors` module provides a framework to perform advanced preprocessing and transformation steps on your data after it has been loaded and split into training, testing, and (optionally) validation sets. Similar to the `astrodata.data.processors` module, it is built around an extensible interface that allows users to compose preprocessing workflows chaining multiple operations together. ## Abstract Class The core of the `astrodata.preml.processors` module is the `PremlProcessor` abstract base class, which defines the interface for all preml processors. Each processor is responsible for transforming a `Premldata` object, and can optionally save or load artifacts (such as fitted encoders or imputers) to ensure reproducibility and consistency across different runs. ## How to Use ### Creating a Premlprocessor ```python from astrodata.preml import Premldata, PremlProcessor class CustomProcessor(PremlProcessor): def process(self, preml: Premldata) -> Premldata: # Transform the input Premldata and return the result. ``` - Subclasses must implement the `process` method, which takes a `Premldata` object and returns a transformed `Premldata`. - Artifacts (such as fitted parameters) can be saved in the process method with `PremlProcessor.save_artifact()`, and a processor can be initialized with an existing artifact. - Processors can be chained together in a pipeline for complex preprocessing workflows. ### Using built-in Processors The `astrodata.preml.processors` module provides several built-in processors for common preprocessing tasks: #### TrainTestSplitter Splits your dataset into training, testing, and optionally validation sets. You can specify target columns, test size, random state, and validation split. The output is a `Premldata` object containing the split datasets and metadata. ```python from astrodata.preml import TrainTestSplitter splitter = TrainTestSplitter( targets=["target_column"], test_size=0.2, random_state=42, validation={"enabled": True, "size": 0.1}, ) ``` #### OHE (OneHotEncoder) One-hot encodes specified categorical columns, optionally retaining numerical columns. Artifacts can be saved for reproducibility. ```python from astrodata.preml import OHE encoder = OHE( categorical_columns=["cat1", "cat2"], numerical_columns=["num1", "num2"], ) ``` #### MissingImputator Imputes missing values in numerical columns (using the mean) and categorical columns (using the most frequent value). ```python from astrodata.preml import MissingImputator imputer = MissingImputator( numerical_columns=["num1", "num2"], categorical_columns=["cat1", "cat2"], ) ``` #### Standardizer Standardizes numerical columns to have zero mean and unit variance. Useful for scaling features before machine learning. ```python from astrodata.preml import Standardizer scaler = Standardizer( numerical_columns=["num1", "num2"], ) ```