processors¶
The astrodata.data.processors
module provides a framework for transforming data within the astrodata workflow. It is built around an extensible interface that allows users to compose preprocessing workflows by chaining multiple operations together. This module is designed to be flexible, enabling users to create custom processors that can be applied to their data in a modular way.
Abstract Class¶
AbstractProcessor
is the abstract base class for all processors. Subclasses must implement:
process(raw: RawData) -> RawData
: Applies a transformation to the inputRawData
and returns the result.
This interface ensures that all processors can be chained together in a pipeline, regardless of their specific function.
How to Use¶
Creating a Processor¶
To define a custom processor, subclass AbstractProcessor
and implement the process
method:
from astrodata.data import RawData, AbstractProcessor
class FeatureAdder(AbstractProcessor):
def process(self, raw: RawData) -> RawData:
raw.data["feature_sum"] = raw.data["feature1"] + raw.data["feature2"]
return raw
Using Built-in Processors¶
The astrodata.data.processors.common
module includes some simple ready-to-use processors to showcase the functionality. These processors can be used directly in your data pipelines:
NormalizeAndSplit
: Normalizes data by subtracting the mean and dividing by the standard deviation.DropDuplicates
: Removes duplicate rows from the dataset.
Example usage:
from astrodata.data import NormalizeAndSplit, DropDuplicates
processors = [NormalizeAndSplit(), DropDuplicates(), FeatureAdder()]
Extensibility¶
To add new preprocessing steps, simply create a new processor by subclassing AbstractProcessor
. Processors can be combined in any order, allowing for flexible and reusable data transformations.
Hint
Processors are applied in sequence by the DataPipeline
, order matters! Ensure that the sequence of processors is logical for your data transformation needs.