# PremlPipeline The `PremlPipeline` class in the `astrodata.preml` module orchestrates machine learning preprocessing steps using a configurable sequence of processors. It is designed to prepare your data for supervised ML tasks by handling preconfigured operations like splitting, encoding, and imputing missing values; it also supports custom processors by subclassing the `PremlProcessor` class. ## Overview The pipeline consists of: - **Processors**: A list of preprocessing steps (e.g., splitting, encoding, imputing) applied in order. - **Configuration**: Processors and their parameters can be defined in code or in a YAML config file. Code-defined processors take precedence. Refer to the [Configuration]() documentation for more details. One constraint of the `PremlPipeline` is that it requires a `TrainTestSplitter` processor to be included in the pipeline. This processor is essential for splitting the dataset into training and testing sets, which is a common requirement in machine learning workflows. ## Example Usage ```python from astrodata.data import ProcessedData from astrodata.preml import OHE, MissingImputator, PremlPipeline, TrainTestSplitter # Previous steps of the pipeline... processed_data = ProcessedData(...) # Define processors tts = TrainTestSplitter(targets=["target"], test_size=0.2, random_state=42) ohe = OHE(categorical_columns=["feature2"], numerical_columns=["feature1", "feature3"]) imputer = MissingImputator(categorical_columns=["feature2"], numerical_columns=["feature1", "feature3"]) # Create the pipeline preml_pipeline = PremlPipeline( config_path="example_config.yaml", processors=[tts, imputer, ohe], ) # Run the pipeline preml_data = preml_pipeline.run(processed_data, dump_output=False) print(preml_data.train_features.head()) ```