flowchart LR A(Download raw data) --> B(Process data) B --> C(Make model predictions) C --> D(Create output products) D --> E(Update report or website) style A fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt style B fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt style C fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt style D fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt style E fill:#F0F8FF,stroke:#008ECC,stroke-width:2px,font-size:16pt
Run multiple jobs
in a single workflow
1 Background
The next logical extension of running a set of scripts and producing new outputs within a job
is to string together multiple jobs
that represent different steps of your workflow. While we just covered an example in the previous section to run multiple scripts within a single step
of a job
, there may be instances where it makes more sense to distribute certain scripts across separate jobs
(similar to how code is organized into separate scripts).
For example, you may have a set of R/Python/MATLAB scripts (in one or more languages) that each perform a major task within your workflow:
However, certain scripts may have different computation needs that require the use of switching among different runners
, you may want to have some scripts only run after the successful completion of one or more initial scripts, or you may want to have multiple types of jobs running concurrently to speed up the workflow. In these cases, it may make sense to use multiple jobs
within a single workflow. However, users will need to decide whether to run actions within a separate job
or as a separate workflow as covered in the section on events
. In the following examples, we’ll cover how to 1) run multiple jobs
concurrently, and 2) run jobs
sequentially.
2 Concurrent jobs
In this example, let’s say we’re interested in obtaining environmental data from different sources (e.g., CMEMS, ROMS) that will serve as input for different ecological or ecosystem models. Since we may need multiple environmental variables from each data source, it would make sense that we would likely use different scripts to access these data. Additionally, I have previously experienced issues trying to access ROMS data via OPeNDAP on an ubuntu
runner, but these issues go away when using a windows
runner. So let’s run two separate jobs concurrently using two types of runners
:
concurrent_jobs.yml
name: Run jobs concurrently
on:
workflow_dispatch:
jobs:
download_cmems:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install Conda
uses: conda-incubator/setup-miniconda@v3
with:
auto-update-conda: true
channels: conda-forge,defaults
python-version: 3.12
- name: Install Copernicus Marine Toolbox
shell: bash -el {0}
run: |
conda install -c conda-forge copernicusmarine
conda install scipy
- name: Install R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
uses: r-lib/actions/setup-r-dependencies@v2
with:
cache: always
packages:
any::glue
- name: Download CMEMS data
env:
COPERNICUSMARINE_SERVICE_USERNAME: ${{ secrets.COPERNICUSMARINE_SERVICE_USERNAME }}
COPERNICUSMARINE_SERVICE_PASSWORD: ${{ secrets.COPERNICUSMARINE_SERVICE_PASSWORD }}
shell: Rscript {0}
run: |
source("Complex_GHAs/src/acquire_cmems.R")
download_roms:
runs-on: windows-latest
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
uses: r-lib/actions/setup-r-dependencies@v2
with:
cache: always
packages:
any::glue
any::ncdf4
any::terra
any::lubridate
- name: Download ROMS data
shell: Rscript {0}
run: |
source("Complex_GHAs/src/acquire_roms.R")
As you can see in the workflow YAML, we really haven’t done anything new other than adding a second job to the workflow. So in this example, we have separate jobs nmaed download_cmems
and download_roms
that are each working on different runners
(ubuntu
and windows
). In this example, running these jobs concurrently makes sense since they are not dependent on one another. This may not be the case for all workflows, however.
3 Sequential jobs
If we wanted to download environmental data and then transform or summarize them, this would require that we have the jobs
run sequentially. Luckily, this new workflow pattern only requires the addition of a single line of code to the YAML. The below example shows how to download sea surface temperature data from ERDDAP, push this netCDF file to the repo, and then summarize these data:
sequential_jobs.yml
name: Run jobs sequentially
on:
workflow_dispatch:
jobs:
dl_sst:
runs-on: ubuntu-latest
permissions:
contents: write
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install Conda
uses: conda-incubator/setup-miniconda@v3
with:
auto-update-conda: true
python-version: 3.12
- name: Install Python packages
run: pip install -r requirements.txt
- name: Download and export SST
run: python Complex_GHAs/src/download_export_sst2.py
- name: Commit and Push Changes
run: |
git config --global user.name "${{ github.actor }}"
git config --global user.email "${{ github.actor }}@users.noreply.github.com"
git add .
git commit -m 'Added new ERDDAP SST file'
git push
mean_sst:
runs-on: ubuntu-latest
1 needs: dl_sst
steps:
- name: Check out repository
uses: actions/checkout@v5
- name: Install R
uses: r-lib/actions/setup-r@v2
- name: Install R packages
uses: r-lib/actions/setup-r-dependencies@v2
with:
cache: always
packages: |
any::ncdf4
any::terra
- name: Calculate mean SST
shell: Rscript {0}
run: source("Complex_GHAs/src/summarize_sst2.R")
- 1
-
Using the
needs
argument at the same indent level asruns-on
andsteps
, we specify the name of thejob
(i.e.,dl_sst
) that this newjob
(i.e.,mean_sst
) depends on.
Although both of these jobs
used the same ubuntu
runner, they separate different components of the workflow that modularizes how it runs. This is particularly useful for debugging issues with workflows and makes them generally easier to follow compared to running all scripts within a single step
.
4 Takeaways
In this section, we covered how to extend workflow running scripts in a single job, to running multiple jobs in a workflow concurrently or sequentially. Separating tasks of a workflow into separate jobs
instead of steps
may make GitHub Actions workflows more modular and easier to update and debug over time. In the example shown on running sequential jobs
, we demonstrated how files could be pushed
to the repo in Job 1 and then read into Job 2 during the checkout
step. Alternatively, we may not want to push
any intermediate files to the repo, but use another method of bundling certain files from Job 1 to be used in Job 2. This method of creating artifacts
to be used across jobs
will be covered in the next section.