Share data across jobs
with artifacts
1 Background
Now that we’ve covered how to add multiple steps to a single job, how to run multiple jobs within a single workflow, and how to trigger a job based on the successful completion of another, we can build on additional features that may be helpful. If you’re using multiple jobs within a workflow, this likely entails the use of data from one job to the next. While one option is to commit these data files to the repo and then pull them into the new job during the checkout
step, you may either a) not want to keep these intermediate or raw data in your repo or b) want to limit the number of files loaded onto the runner
during the next job (especially if your full repo size may exceed the disk storage of the runner
). A helpful option for passing data across jobs
and storing it is through the use of artifacts. Artifacts can be retained between 1 and 90 days and provide a useful solution for temporary storage of data for jobs
.
As briefly mentioned on the Basics page, your total monthly available artifact storage space is dictated by the type of GitHub account you have. Although this could potentially impose constraints on your ability to have your workflow run over the course of a month before the storage quota resets, see this blog post about a potential solution for cleaning up artifacts to prevent running into this problem and disrupting your workflow.
Plan | Storage |
---|---|
Free | 500 MB |
Pro | 2 GB |
Enterprise Cloud | 50 GB |
Artifacts are used within a workflow by specifying which file(s) or directory you’d like to store, uploading them to GitHub using an action
, and then downloading these files in the next job
using another action
. We’ll cover these steps in greater detail over the next couple of sections.
2 Upload artifact
Once the job
you’ve run in the workflow has performed all of the steps of interest, you will add one more step that uses the upload-artifact
action. This action
includes a number of different options that can be specified by the user or left at their defaults. Below is an example showing how this action
may be used:
artifact_example.yml
name: Share data with artifacts
on:
schedule:
- cron: '0 12 * * *' #run daily at 8 am EDT (UTC-04:00)
workflow_dispatch:
jobs:
dl_sst:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
uses: r-lib/actions/setup-r-dependencies@v2
with:
1 packages: |
any::glue
any::terra
- name: Download and export SST
shell: Rscript {0}
2 run: source("Complex_GHAs/R/download_export_sst.R")
- name: Upload SST artifact
3 uses: actions/upload-artifact@v4
with:
4 name: erddap_sst
5 path: mab_sst.tif
6 retention-days: 1
- 1
- Only need a couple packages for this job
- 2
- Short R script to download and export SST data
- 3
-
Action
for uploading an artifact - 4
- The name we want to label the artifact object
- 5
- The path to the file (or directory) thta we’d like to store in the artifact
- 6
- The number of days we want this artifact to be retained by GitHub (default is 90 days)
If this workflow YAML runs successfully, it will complete by storing the mab_sst.tif
file with our data in the erddap_sst
artifact. With an artifact now available for later use, we can then download this object in subsequent jobs
.
3 Download artifact
Now we will do the converse of the previous job
, where the artifact will be downloaded toward the beginning of this next job
right after checking out the repo. This will be performed using the download-artifact
action, which also includes a number of different options that may be specified during use. Use of this action
is demonstrated in an example below:
artifact_example.yml
1 mean_sst:
runs-on: ubuntu-latest
2 needs: dl_sst
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Download artifact w SST data
3 uses: actions/download-artifact@v5
with:
4 name: erddap_sst
5 path: .
- name: Install geospatial dependencies (e.g., GDAL)
run: |
sudo apt-get update
sudo apt-get install -y libgdal-dev gdal-bin
- name: Install xarray and other deps from existing reqs file
6 run: pip install -r requirements.txt
- name: Calculate mean SST
7 run: python Complex_GHAs/R/summarize_sst.py
- 1
-
Name given to second
job
- 2
-
Syntax for specifying that the
mean_sst
job needs to wait for thedl_sst
job to successfully complete before thisjob
starts - 3
-
Action
for downloading an artifact - 4
-
Name of the artifact we defined in the previous
job
- 5
-
(optional) Specify the path where you’d like the files from the artifact to be added. In this case, I’m using the period (
.
) syntax to refer to the current directory (which is also the root dir) - 6
-
Command to install
xarray
and other Python packages for handling netCDF file - 7
- Command to run Python script to calculate mean SST from the raster file
4 Putting it all together
Now that we know how to both upload and download artifacts, as well as how to properly choose which files are stored within the artifact and where this artifact is downloaded to on the runner
, we can expand this out to a number of different interconnected jobs within a single workflow if we wanted. For this simpler example, we just focused on two jobs. The full, single workflow YAML would therefore look like this:
artifact_example.yml
name: Share data with artifacts
on:
schedule:
- cron: '0 12 * * *' #run daily at 8 am EDT (UTC-04:00)
workflow_dispatch:
jobs:
dl_sst:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
uses: r-lib/actions/setup-r-dependencies@v2
with:
cache: always
packages: |
any::glue
any::terra
- name: Download and export SST
shell: Rscript {0}
run: source("Complex_GHAs/R/download_export_sst.R")
- name: Upload SST artifact
uses: actions/upload-artifact@v4
with:
name: erddap_sst
path: mab_sst.nc
retention-days: 1
mean_sst:
runs-on: ubuntu-latest
needs: dl_sst
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Download artifact w SST data
uses: actions/download-artifact@v5
with:
name: erddap_sst
path: .
- name: Install geospatial dependencies (e.g., GDAL)
run: |
sudo apt-get update
sudo apt-get install -y libgdal-dev gdal-bin
- name: Install xarray and other deps from existing reqs file
run: pip install -r requirements.txt
- name: Calculate mean SST
run: python Complex_GHAs/R/summarize_sst.py
5 Takeaways
In this section, we covered another example of how data can be shared across jobs
of a single workflow. This can be helpful by reducing the number of files committed to your repo, and thereby keeping it decluttered of intermediate files that aren’t needed. Additionally, it also provides a useful mechanism to specify the minimum set of files needed within a workflow for a very large repo where a normal checkout
step would overload the runner
.