PremlPipeline¶
The PremlPipeline
class in the astrodata.preml
module orchestrates machine learning preprocessing steps using a configurable sequence of processors. It is designed to prepare your data for supervised ML tasks by handling preconfigured operations like splitting, encoding, and imputing missing values; it also supports custom processors by subclassing the PremlProcessor
class.
Overview¶
The pipeline consists of:
Processors: A list of preprocessing steps (e.g., splitting, encoding, imputing) applied in order.
Configuration: Processors and their parameters can be defined in code or in a YAML config file. Code-defined processors take precedence. Refer to the Configuration documentation for more details.
One constraint of the PremlPipeline
is that it requires a TrainTestSplitter
processor to be included in the pipeline. This processor is essential for splitting the dataset into training and testing sets, which is a common requirement in machine learning workflows.
Example Usage¶
from astrodata.data import ProcessedData
from astrodata.preml import OHE, MissingImputator, PremlPipeline, TrainTestSplitter
# Previous steps of the pipeline...
processed_data = ProcessedData(...)
# Define processors
tts = TrainTestSplitter(targets=["target"], test_size=0.2, random_state=42)
ohe = OHE(categorical_columns=["feature2"], numerical_columns=["feature1", "feature3"])
imputer = MissingImputator(categorical_columns=["feature2"], numerical_columns=["feature1", "feature3"])
# Create the pipeline
preml_pipeline = PremlPipeline(
config_path="example_config.yaml",
processors=[tts, imputer, ohe],
)
# Run the pipeline
preml_data = preml_pipeline.run(processed_data, dump_output=False)
print(preml_data.train_features.head())