OverviewΒΆ
The astrodata.preml
submodule is designed to perform advanced preprocessing and transformation steps on your data after it has been loaded and split into training, testing, and (optionally) validation sets. While the initial data pipeline handles loading and basic preprocessing, preml
focuses on preparing your datasets for machine learning tasks by applying transformations such as one-hot encoding, missing value imputation, and other feature engineering steps.
The core concept of preml
is to provide a flexible and reproducible way to chain together multiple preprocessing operations using the PremlPipeline
class. This pipeline ensures that all transformations are applied consistently across your train, test, and validation splits, maintaining data integrity.
Typical usage involves:
Defining a
TrainTestSplitter
to split your processed data.Specifying additional processors (e.g.,
OHE
for categorical encoding,MissingImputator
for handling missing values).Running the
PremlPipeline
to apply these transformations in sequence.
This approach allows you to easily configure, extend, and track your preprocessing steps, ensuring your data is ready for downstream machine learning workflows.