astrodata.preml package¶
Subpackages¶
Submodules¶
astrodata.preml.pipeline module¶
- class astrodata.preml.pipeline.PremlPipeline(config_path, processors=None)¶
Bases:
object
Pipeline for processing data using a configurable sequence of processors.
- Features:
Requires either a config_path or a processors list (not both None).
Merges processors from config and argument, with argument processors taking priority.
Ensures the first processor is a TrainTestSplitter.
- Parameters:
config_path (str) – Path to the configuration file.
processors (list[PremlProcessor], optional) – List of processor instances.
- run(processeddata
ProcessedData) -> Premldata: Executes the pipeline, applying processors in order and returning the final Premldata.
- run(processeddata, dump_output=True)¶
Executes the data pipeline by applying each processor in sequence.
- Parameters:
processeddata (ProcessedData) – The input processed data.
dump_output (bool) – Whether to dump the output to disk.
- Returns:
The final processed data object.
- Return type:
astrodata.preml.schemas module¶
- class astrodata.preml.schemas.Premldata(**data)¶
Bases:
BaseModel
Represents processed data after transformations.
- train_features¶
Training features.
- Type:
pd.DataFrame
- val_features¶
Validation features, if available.
- Type:
Optional[pd.DataFrame]
- test_features¶
Test features.
- Type:
pd.DataFrame
- train_targets¶
Training targets.
- Type:
pd.DataFrame | pd.Series
- val_targets¶
Validation targets, if available.
- Type:
Optional[pd.DataFrame | pd.Series]
- test_targets¶
Test targets.
- Type:
pd.DataFrame | pd.Series
- metadata¶
Additional metadata about the processed data.
- Type:
Optional[dict]
- dump_parquet(path)¶
Dumps the processed data to a Parquet file.
- Parameters:
path (Path) – The file path to save the Parquet file.
- dump_supervised_ML_format()¶
Returns the data into training and testing sets.
- Returns:
A tuple containing the training and testing features and targets.
- Return type:
tuple
-
metadata:
Optional
[dict
]¶
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}¶
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
-
test_features:
DataFrame
¶
-
test_targets:
DataFrame
|Series
¶
-
train_features:
DataFrame
¶
-
train_targets:
DataFrame
|Series
¶
-
val_features:
Optional
[DataFrame
]¶
-
val_targets:
Union
[DataFrame
,Series
,None
]¶
astrodata.preml.utils module¶
- astrodata.preml.utils.instantiate_processors(config, ignore_unknown=True, defaults=None)¶
Given a config dict, returns a dict mapping processor names to instances. Validates processor names, catches instantiation errors, and allows for defaults.
- Parameters:
config (dict) – The ‘preml’ section of the configuration.
ignore_unknown (bool) – If True, unknown processors are ignored. If False, raises error.
defaults (dict) – Optional default parameters for processors.
- Returns:
Dictionary mapping processor names to their instances.
- Return type:
dict
- Raises:
ValueError – If unknown processor is found and ignore_unknown is False.
RuntimeError – If processor instantiation fails.