DVC

What is DVC and Why Use It?

DVC (Data Version Control) is a tool for versioning large data files, datasets, and files in general. In the context of AstroData, DVC helps you track and version your data outputs (such as processed datasets or intermediate files), ensuring that your experiments are reproducible and your data is synchronized with your code.

Using DVC functionalitites within AstroData allows you to:

  • Version control your data alongside your code.

  • Reproduce experiments by tracking exactly which data was used.

  • Share data and results easily with collaborators.

Configuring DVC in AstroData

To enable data versioning with DVC, you need to configure the relevant section in your YAML file:

  1. Enable Data Tracking
    Set data.enable: true in your config.

  2. Specify Data Paths
    List the files or folders you want to track under data.paths. For example, raw data files you input into your pipeline.

Hint

Files (artifacts and data) produced by AstroData will be automatically tracked if you have dump_output enabled in your pipeline configuration.

  1. Set the DVC Remote
    Provide the full path to your DVC remote storage with data.remote. This is where your data will be pushed.

  2. Enable Data Push
    Set data.push: true to automatically push tracked data to the remote after each run.

Example configuration:

data:
  enable: true
  paths:
    - "files_to_version"
  remote: "/path/to/remote/" # Must be full path
  push: true

When you run your data or preml pipeline with AstroData and have dump_output enabled, the outputs will be saved and tracked by DVC according to this configuration.

Note: DVC operations are handled automatically by AstroData when configured. You can still use DVC from the command line as usual.