astrodata.preml package¶

Submodules¶

astrodata.preml.pipeline module¶

class astrodata.preml.pipeline.PremlPipeline(config_path, processors=None)¶

Bases: object

Pipeline for processing data using a configurable sequence of processors.

Features:

Requires either a config_path or a processors list (not both None).
Merges processors from config and argument, with argument processors taking priority.
Ensures the first processor is a TrainTestSplitter.

Parameters:

config_path (str) – Path to the configuration file.
processors (list[PremlProcessor], optional) – List of processor instances.

run(processeddata: ProcessedData) -> Premldata: Executes the pipeline, applying processors in order and returning the final Premldata.

run(processeddata, dump_output=True)¶

Executes the data pipeline by applying each processor in sequence.

Parameters:

processeddata (ProcessedData) – The input processed data.
dump_output (bool) – Whether to dump the output to disk.

Returns:

The final processed data object.

Return type:

Premldata

astrodata.preml.schemas module¶

class astrodata.preml.schemas.Premldata(**data)¶

Bases: BaseModel

Represents processed data after transformations.

train_features¶

Training features.

Type:: pd.DataFrame

val_features¶

Validation features, if available.

Type:: Optional[pd.DataFrame]

test_features¶

Test features.

Type:: pd.DataFrame

train_targets¶

Training targets.

Type:: pd.DataFrame | pd.Series

val_targets¶

Validation targets, if available.

Type:: Optional[pd.DataFrame | pd.Series]

test_targets¶

Test targets.

Type:: pd.DataFrame | pd.Series

metadata¶

Additional metadata about the processed data.

Type:: Optional[dict]

class Config¶

Bases: object

arbitrary_types_allowed = True¶

dump_parquet(path)¶

Dumps the processed data to a Parquet file.

Parameters:: path (Path) – The file path to save the Parquet file.

dump_supervised_ML_format()¶

Returns the data into training and testing sets.

Returns:: A tuple containing the training and testing features and targets.
Return type:: tuple

metadata: Optional[dict]¶

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

test_features: DataFrame¶

test_targets: DataFrame | Series¶

train_features: DataFrame¶

train_targets: DataFrame | Series¶

val_features: Optional[DataFrame]¶

val_targets: Union[DataFrame, Series, None]¶

astrodata.preml.utils module¶

astrodata.preml.utils.instantiate_processors(config, ignore_unknown=True, defaults=None)¶

Given a config dict, returns a dict mapping processor names to instances. Validates processor names, catches instantiation errors, and allows for defaults.

Parameters:

config (dict) – The ‘preml’ section of the configuration.
ignore_unknown (bool) – If True, unknown processors are ignored. If False, raises error.
defaults (dict) – Optional default parameters for processors.

Returns:

Dictionary mapping processor names to their instances.

Return type:

dict

Raises:

ValueError – If unknown processor is found and ignore_unknown is False.
RuntimeError – If processor instantiation fails.

astrodata.preml package¶

Subpackages¶

Submodules¶

astrodata.preml.pipeline module¶

astrodata.preml.schemas module¶

astrodata.preml.utils module¶

Module contents¶