flowchart LR A(Download raw data) --> B(Process data) B --> C(Make model predictions) C --> D(Create output products) D --> E(Update report or website) style A fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt style B fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt style C fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt style D fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt style E fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt linkStyle 0,1,2,3 stroke:grey
Run multiple jobs in a single workflow
1 Background
The next logical extension of running a set of scripts and producing new outputs within a job is to string together multiple jobs that represent different steps of your workflow. While we just covered an example in the previous section to run multiple scripts within a single step of a job, there may be instances where it makes more sense to distribute certain scripts across separate jobs (similar to how code is organized into separate scripts).
For example, you may have a set of R/Python/MATLAB scripts (in one or more languages) that each perform a major task within your workflow:
However, certain scripts may have different computation needs that require the use of switching among different runners, you may want to have some scripts only run after the successful completion of one or more initial scripts, or you may want to have multiple types of jobs running concurrently to speed up the workflow. In these cases, it may make sense to use multiple jobs within a single workflow. However, users will need to decide whether to run actions within a separate job or as a separate workflow as covered in the section on events. In the following examples, we’ll cover how to 1) run multiple jobs concurrently, and 2) run jobs sequentially.
2 Concurrent jobs
In this example, let’s say we’re interested in obtaining environmental data from different sources (e.g., CMEMS, ROMS) that will serve as input for different ecological or ecosystem models. Since we may need multiple environmental variables from each data source, it would make sense that we would likely use different scripts to access these data. Additionally, I have previously experienced issues trying to access ROMS data via OPeNDAP on an ubuntu runner, but these issues go away when using a windows runner. So let’s run two separate jobs concurrently using two types of runners:
concurrent_jobs.yml
name: Run jobs concurrently
on:
workflow_dispatch:
jobs:
download_cmems:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install Conda
uses: conda-incubator/setup-miniconda@v3
with:
auto-update-conda: true
channels: conda-forge,defaults
python-version: 3.12
- name: Install Copernicus Marine Toolbox
shell: bash -el {0}
run: |
conda install -c conda-forge copernicusmarine
conda install scipy
- name: Install R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
uses: r-lib/actions/setup-r-dependencies@v2
with:
cache: always
packages:
any::glue
- name: Download CMEMS data
env:
COPERNICUSMARINE_SERVICE_USERNAME: ${{ secrets.COPERNICUSMARINE_SERVICE_USERNAME }}
COPERNICUSMARINE_SERVICE_PASSWORD: ${{ secrets.COPERNICUSMARINE_SERVICE_PASSWORD }}
shell: Rscript {0}
run: |
source("Complex_GHAs/src/acquire_cmems.R")
download_roms:
runs-on: windows-latest
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
uses: r-lib/actions/setup-r-dependencies@v2
with:
cache: always
packages:
any::glue
any::ncdf4
any::terra
any::lubridate
- name: Download ROMS data
shell: Rscript {0}
run: |
source("Complex_GHAs/src/acquire_roms.R")
As you can see in the workflow YAML, we really haven’t done anything new other than adding a second job to the workflow. So in this example, we have separate jobs nmaed download_cmems and download_roms that are each working on different runners (ubuntu and windows). In this example, running these jobs concurrently makes sense since they are not dependent on one another. This may not be the case for all workflows, however.
3 Sequential jobs
If we wanted to download environmental data and then transform or summarize them, this would require that we have the jobs run sequentially. Luckily, this new workflow pattern only requires the addition of a single line of code to the YAML. The below example shows how to download sea surface temperature data from ERDDAP, push this netCDF file to the repo, and then summarize these data:
sequential_jobs.yml
name: Run jobs sequentially
on:
workflow_dispatch:
jobs:
dl_sst:
runs-on: ubuntu-latest
permissions:
contents: write
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install Conda
uses: conda-incubator/setup-miniconda@v3
with:
auto-update-conda: true
python-version: 3.12
- name: Install Python packages
run: pip install -r requirements.txt
- name: Download and export SST
run: python Complex_GHAs/src/download_export_sst2.py
- name: Commit and Push Changes
run: |
git config --global user.name "${{ github.actor }}"
git config --global user.email "${{ github.actor }}@users.noreply.github.com"
git add .
git commit -m 'Added new ERDDAP SST file'
git push
mean_sst:
runs-on: ubuntu-latest
1 needs: dl_sst
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
uses: r-lib/actions/setup-r-dependencies@v2
with:
cache: always
packages: |
any::ncdf4
any::terra
- name: Calculate mean SST
shell: Rscript {0}
run: source("Complex_GHAs/src/summarize_sst2.R")- 1
-
Using the
needsargument at the same indent level asruns-onandsteps, we specify the name of thejob(i.e.,dl_sst) that this newjob(i.e.,mean_sst) depends on.
Although both of these jobs used the same ubuntu runner, they separate different components of the workflow that modularizes how it runs. This is particularly useful for debugging issues with workflows and makes them generally easier to follow compared to running all scripts within a single step.
4 Takeaways
In this section, we covered how to extend workflow running scripts in a single job, to running multiple jobs in a workflow concurrently or sequentially. Separating tasks of a workflow into separate jobs instead of steps may make GitHub Actions workflows more modular and easier to update and debug over time. In the example shown on running sequential jobs, we demonstrated how files could be pushed to the repo in Job 1 and then read into Job 2 during the checkout step. Alternatively, we may not want to push any intermediate files to the repo, but use another method of bundling certain files from Job 1 to be used in Job 2. This method of creating artifacts to be used across jobs will be covered in the next section.