DVC¶
What is DVC and Why Use It?¶
DVC (Data Version Control) is a tool for versioning large data files, datasets, and files in general. In the context of AstroData, DVC helps you track and version your data outputs (such as processed datasets or intermediate files), ensuring that your experiments are reproducible and your data is synchronized with your code.
Using DVC functionalitites within AstroData allows you to:
Version control your data alongside your code.
Reproduce experiments by tracking exactly which data was used.
Share data and results easily with collaborators.
Configuring DVC in AstroData¶
To enable data versioning with DVC, you need to configure the relevant section in your YAML file:
Enable Data Tracking
Setdata.enable: true
in your config.Specify Data Paths
List the files or folders you want to track underdata.paths
. For example, raw data files you input into your pipeline.
Hint
Files (artifacts and data) produced by AstroData will be automatically tracked if you have dump_output
enabled in your pipeline configuration.
Set the DVC Remote
Provide the full path to your DVC remote storage withdata.remote
. This is where your data will be pushed.Enable Data Push
Setdata.push: true
to automatically push tracked data to the remote after each run.
Example configuration:
data:
enable: true
paths:
- "files_to_version"
remote: "/path/to/remote/" # Must be full path
push: true
When you run your data or preml pipeline with AstroData and have dump_output
enabled, the outputs will be saved and tracked by DVC according to this configuration.
Note: DVC operations are handled automatically by AstroData when configured. You can still use DVC from the command line as usual.