Overview¶

The astrodata.preml submodule is designed to perform advanced preprocessing and transformation steps on your data after it has been loaded and split into training, testing, and (optionally) validation sets. While the initial data pipeline handles loading and basic preprocessing, preml focuses on preparing your datasets for machine learning tasks by applying transformations such as one-hot encoding, missing value imputation, and other feature engineering steps.

The core concept of preml is to provide a flexible and reproducible way to chain together multiple preprocessing operations using the PremlPipeline class. This pipeline ensures that all transformations are applied consistently across your train, test, and validation splits, maintaining data integrity.

Typical usage involves:

Defining a TrainTestSplitter to split your processed data.
Specifying additional processors (e.g., OHE for categorical encoding, MissingImputator for handling missing values).
Running the PremlPipeline to apply these transformations in sequence.

This approach allows you to easily configure, extend, and track your preprocessing steps, ensuring your data is ready for downstream machine learning workflows.