PremlPipeline

The PremlPipeline class in the astrodata.preml module orchestrates machine learning preprocessing steps using a configurable sequence of processors. It is designed to prepare your data for supervised ML tasks by handling preconfigured operations like splitting, encoding, and imputing missing values; it also supports custom processors by subclassing the PremlProcessor class.

Overview

The pipeline consists of:

  • Processors: A list of preprocessing steps (e.g., splitting, encoding, imputing) applied in order.

  • Configuration: Processors and their parameters can be defined in code or in a YAML config file. Code-defined processors take precedence. Refer to the Configuration documentation for more details.

One constraint of the PremlPipeline is that it requires a TrainTestSplitter processor to be included in the pipeline. This processor is essential for splitting the dataset into training and testing sets, which is a common requirement in machine learning workflows.

Example Usage

from astrodata.data import ProcessedData
from astrodata.preml import OHE, MissingImputator, PremlPipeline, TrainTestSplitter

# Previous steps of the pipeline...
processed_data = ProcessedData(...)

# Define processors
tts = TrainTestSplitter(targets=["target"], test_size=0.2, random_state=42)
ohe = OHE(categorical_columns=["feature2"], numerical_columns=["feature1", "feature3"])
imputer = MissingImputator(categorical_columns=["feature2"], numerical_columns=["feature1", "feature3"])

# Create the pipeline
preml_pipeline = PremlPipeline(
    config_path="example_config.yaml",
    processors=[tts, imputer, ohe],
)

# Run the pipeline
preml_data = preml_pipeline.run(processed_data, dump_output=False)
print(preml_data.train_features.head())