Share data across `jobs` with artifacts

1 Background

Now that we’ve covered how to add multiple steps to a single job, how to run multiple jobs within a single workflow, and how to trigger a job based on the successful completion of another, we can build on additional features that may be helpful. If you’re using multiple jobs within a workflow, this likely entails the use of data from one job to the next. While one option is to commit these data files to the repo and then pull them into the new job during the checkout step, you may either a) not want to keep these intermediate or raw data in your repo or b) want to limit the number of files loaded onto the runner during the next job (especially if your full repo size may exceed the disk storage of the runner). A helpful option for passing data across jobs and storing it is through the use of artifacts. Artifacts can be retained between 1 and 90 days and provide a useful solution for temporary storage of data for jobs.

Limits on artifact storage

As briefly mentioned on the Basics page, your total monthly available artifact storage space is dictated by the type of GitHub account you have. Although this could potentially impose constraints on your ability to have your workflow run over the course of a month before the storage quota resets, see this blog post about a potential solution for cleaning up artifacts to prevent running into this problem and disrupting your workflow.

Table 1: Monthly artifact storage limits by GitHub plan.

Plan	Storage
Free	500 MB
Pro	2 GB
Enterprise Cloud	50 GB

Artifacts are used within a workflow by specifying which file(s) or directory you’d like to store, uploading them to GitHub using an action, and then downloading these files in the next job using another action. We’ll cover these steps in greater detail over the next couple of sections.

2 Upload artifact

Once the job you’ve run in the workflow has performed all of the steps of interest, you will add one more step that uses the upload-artifact action. This action includes a number of different options that can be specified by the user or left at their defaults. Below is an example showing how this action may be used:

artifact_example.yml

name: Share data with artifacts

on:
  schedule:
    - cron: '0 12 * * *'  #run daily at 8 am EDT (UTC-04:00)
  workflow_dispatch:

jobs:  
  dl_sst: 
    runs-on: ubuntu-latest
    steps:  
    - name: Check out repository  
      uses: actions/checkout@v5  

    - name: Install R
      uses: r-lib/actions/setup-r@v2  
    
    - name: Install R packages
      uses: r-lib/actions/setup-r-dependencies@v2
      with:
1        packages: |
          any::glue
          any::terra
      
    - name: Download and export SST
      shell: Rscript {0}
2      run: source("Complex_GHAs/R/download_export_sst.R")
    
    - name: Upload SST artifact
3      uses: actions/upload-artifact@v4
      with:
4        name: erddap_sst
5        path: mab_sst.tif
6        retention-days: 1

1: Only need a couple packages for this job
2: Short R script to download and export SST data
3: Action for uploading an artifact
4: The name we want to label the artifact object
5: The path to the file (or directory) thta we’d like to store in the artifact
6: The number of days we want this artifact to be retained by GitHub (default is 90 days)

If this workflow YAML runs successfully, it will complete by storing the mab_sst.tif file with our data in the erddap_sst artifact. With an artifact now available for later use, we can then download this object in subsequent jobs.

3 Download artifact

Now we will do the converse of the previous job, where the artifact will be downloaded toward the beginning of this next job right after checking out the repo. This will be performed using the download-artifact action, which also includes a number of different options that may be specified during use. Use of this action is demonstrated in an example below:

artifact_example.yml


1  mean_sst:
    runs-on: ubuntu-latest
2    needs: dl_sst
    steps:  
    - name: Check out repository  
      uses: actions/checkout@v5  
    
    - name: Download artifact w SST data
3      uses: actions/download-artifact@v5
      with:
4        name: erddap_sst
5        path: .

    - name: Install geospatial dependencies (e.g., GDAL)
      run: |
        sudo apt-get update
        sudo apt-get install -y libgdal-dev gdal-bin
    
    - name: Install xarray and other deps from existing reqs file
6      run: pip install -r requirements.txt
    
    - name: Calculate mean SST
7      run: python Complex_GHAs/R/summarize_sst.py

1: Name given to second job
2: Syntax for specifying that the mean_sst job needs to wait for the dl_sst job to successfully complete before this job starts
3: Action for downloading an artifact
4: Name of the artifact we defined in the previous job
5: (optional) Specify the path where you’d like the files from the artifact to be added. In this case, I’m using the period (.) syntax to refer to the current directory (which is also the root dir)
6: Command to install xarray and other Python packages for handling netCDF file
7: Command to run Python script to calculate mean SST from the raster file

4 Putting it all together

Now that we know how to both upload and download artifacts, as well as how to properly choose which files are stored within the artifact and where this artifact is downloaded to on the runner, we can expand this out to a number of different interconnected jobs within a single workflow if we wanted. For this simpler example, we just focused on two jobs. The full, single workflow YAML would therefore look like this:

artifact_example.yml

name: Share data with artifacts

on:
  schedule:
    - cron: '0 12 * * *'  #run daily at 8 am EDT (UTC-04:00)
  workflow_dispatch:

jobs:  
  dl_sst: 
    runs-on: ubuntu-latest
    steps:  
    - name: Check out repository  
      uses: actions/checkout@v5  

    - name: Install R
      uses: r-lib/actions/setup-r@v2  
    
    - name: Install R packages
      uses: r-lib/actions/setup-r-dependencies@v2
      with:
        cache: always
        packages: |
          any::glue
          any::terra
      
    - name: Download and export SST
      shell: Rscript {0}
      run: source("Complex_GHAs/R/download_export_sst.R")
      
    - name: Upload SST artifact
      uses: actions/upload-artifact@v4
      with:
        name: erddap_sst
        path: mab_sst.nc
        retention-days: 1
 
 

  mean_sst: 
    runs-on: ubuntu-latest
    needs: dl_sst
    steps:  
    - name: Check out repository  
      uses: actions/checkout@v5  
    
    - name: Download artifact w SST data
      uses: actions/download-artifact@v5
      with:
        name: erddap_sst
        path: .

    - name: Install geospatial dependencies (e.g., GDAL)
      run: |
        sudo apt-get update
        sudo apt-get install -y libgdal-dev gdal-bin
    
    - name: Install xarray and other deps from existing reqs file
      run:  pip install -r requirements.txt
    
    - name: Calculate mean SST
      run: python Complex_GHAs/R/summarize_sst.py

5 Takeaways

In this section, we covered another example of how data can be shared across jobs of a single workflow. This can be helpful by reducing the number of files committed to your repo, and thereby keeping it decluttered of intermediate files that aren’t needed. Additionally, it also provides a useful mechanism to specify the minimum set of files needed within a workflow for a very large repo where a normal checkout step would overload the runner.