-o B04.tif https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/53/H/PA/2021/7/S2B_53HPA_20210723_0_L2A/B04.tif
curl
-o B03.tif https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/53/H/PA/2021/7/S2B_53HPA_20210723_0_L2A/B03.tif curl
Workflow Introduction
Description & purpose: This Notebook explains what a workflow is (in the context of the EODH platform) and provides information on the technology surrounding the workflows.
Author(s): Alastair Graham
Date created: 2024-11-08
Date last modified: 2024-11-08
Licence: This file is licensed under Creative Commons Attribution-ShareAlike 4.0 International. Any included code is released using the BSD-2-Clause license.
Copyright (c) , All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
What is a Workflow
One of the provisioned services made available to users of the EODH is the Workflow Runner (WR). More information about the WR is provided later on this page, but the fundamental question that requires answering is…
In the context of the EODH, what is a workflow?
Using the terminology of the EODH, a workflow is a file written in CWL (the Common Workflow Language) that creates a data processing chain. However, this CWL file only provides the orchestration of a wider set of tools that are brought together as an entity known as an Earth Observation Application Package (EOAP).
What is the Workflow Runner?
The WR is a required component of the Hub, designed to interprate CWL files and ensure that the algorithms wrapped within them are sensibly scaled across the available infrastructure. In the case of the EODH, the infrastructure is a Kubernetes cluster running on an AWS backend, but this is not a hard-set requirement.
The WR itself is a piece of software called ADES. If you want to understand the internal components of the WR then the ADES Design Document is available here: https://eoepca.github.io/proc-ades/master/.
The use case is that EOAPs can be transfered from platform to platform e.g. they could be developed on EODH and run on EOEPCA or the other way round. Platforms run by DLR and NASA also utilise EOAPs and processing algorithms can be shared between them.
The ADES software uses the ZOO-Project as the main framework for exposing the OGC compliant web services. The ZOO-kernel powering the web services is included in the software package.
The ADES functions are designed to perform the processing and chaining function on a Kubernetes cluster using the Calrissian Tool. Calrissian uses CWL over Kubernetes and enables the implementation of each step in a workflow as a container. It provides simple, flexible mechanisms for specifying constraints between the steps in a workflow and artifact management for linking the output of any step as an input to subsequent steps.
What is CWL?
Why CWL?
The Common Workflow Language (CWL) is an open standard designed for defining and executing data analysis workflows. It provides a formal way to describe the individual steps in a workflow, the inputs and outputs of each step, and how those steps are connected. CWL is platform-independent and focuses on portability, reproducibility, and scalability, enabling workflows to be executed in various environments, from personal computers to large cloud infrastructures.
By fostering collaboration and standardisation, CWL plays a crucial role in advancing research, and is particularly used in fields such as bioinformatics and climate science.
Scripting workflows
While shell scripts or other code scripts (e.g. Python) can meet the need of data processing workflows, using a formal workflow language (such as CWL) brings additional benefits such as abstraction and improved scalability and portability.
Computational workflows explicitly create a divide between a user’s dataflow and the computational details which underpin the chain of tools.
The dataflow is described by the workflow and the tool implementation is specified by descriptors that remove the workflow complexity.
Workflow managers such as cwltool
(see below) help with the automation, monitoring and tracking of a dataflow. By producing computational workflows in a standardised format, and publishing them (alongside any data) with open access, the workflows become more FAIR (Findable, Accessible, Interoperable, and Reusable). The Common Workflow Language (CWL) standard has been developed to standardise workflow needs across different thematic areas.
Execution sequence
The generic execution sequence of a CWL process (including Workflows and CommandLineTools) is as follows.
- Load an input object.
- Load, process and validate a CWL document.
- If there are multiple process objects (due to $graph) then choose the process with the id of “#main” or “main”.
- Validate the input object against the inputs schema for the process.
- Perform any further setup required by the specific process type.
- Execute the process.
- Capture results of process execution into the output object.
- Validate the output object against the outputs schema for the process.
- Report the output object to the process caller.
A simple example to introduce CWL
Context
Note: This workflow has been designed and tested using cwltool
on a local machine.
The first thing to do when designing a workflow is understand the context of what is desired, and how that may need to be referred to in the workflow. For this example workflow, we will take a list of Sentinel-2 ARD images, clip them to an area of interest, and stack them. The flow will look like the following:
graph LR;
get_data -- S2_ARD --> clip -- clipped --> stack -- stacked--> Output ;
The next thing to do is access the data to be used in the Workflow. In this case we will download two bands of a Sentinel 2 image held on AWS. We will use the curl tool to do this, saving the accessed image as B0$.tif
(where $ is the band number):
The commands that we will use in the workflow are all available through gdal.
Clip the image
We will use gdal_-_translate
to clip the larger image to a smaller more manageable dataset. The gdal command that we can test and that we will need to replicate in CWL is:
-projwin ULX ULY LRX LRY -projwin_srs EPSG:4326 BO4.tif B04_clipped.tif gdal_translate
where the coordinates UL refer to upper left and LR to lower right X and Y.
Stack the clips
Similarly, we will use gdal_merge.py
to construct the stacked images from the clipped image. This can be tested using the following command:
-separate B04_clipped.tif B03_clipped.tif -o stacked.tif gdal_merge.py
Building the Workflow
Required files
There are three main files that are required to construct a CWL Workflow. These are: * DockerFile or existing online container * CWL file * YAML file
It may be that other files e.g. a .sh script or a Python script are also needed, depending on how bespoke and/or complex the desired workflow is.
cwltool
To run CWL workflows you will need a CWL runner. The most commonly used (locally) is cwltool
which is maintained by the CWL community. cwltool
is the reference executor for Common Workflow Language standards and supports everything in the current CWL specification. cwltool
can be installed using pip
or variants of conda
. More information can be found here and here, or via ReadTheDocs.
Containers
For the purposes of this example, we will be pulling the GDAL container from the OSgeo repository (see here).
NOTE: There are a number of different images that can be accessed. To use the .py tools available through GDAL then ‘GDAL Python’ is required.
If we wanted to we could also build our own bespoke image using a DockerFile and then run that. This is often used when data processing scripts need to be copied into the container.
We will also be using Podman as our container software. ‘podman’ is a drop in replacement for Docker but does require the --podman
arguement in the cwltool
command. If using Windows, or if you are more familiar with Docker, then using Docker is the default containerisation method.
CWL files
For this example we require a CWL CommandLine file for both the clipping and stacking components of the workflow. We will also need a CWL Workflow file to bring these together and run the entire process. The next block of code outlines the overall Workflow file.
Note: This example is based on the example found here. Some errors were found in the original CWL files and the version presented here has been tested and is known to work on a local Linux (Debian based) system.
class: Workflow
-2 clipping and stacking
label: Sentinel
doc: This workflow creates a stacked composite. File name: composite.cwlid: main
requirements:- class: ScatterFeatureRequirement
inputs:
geotiff:list of geotifs
doc: type: File[]
bbox: as a bounding box
doc: area of interest type: string
epsg:
doc: EPSG code type: string
"EPSG:4326"
default:
outputs:
rgb:
outputSource:- node_concatenate/composite
type: File
steps:
node_translate:
-translate.cwl
run: gdal
in:
geotiff: geotiff
bbox: bbox
epsg: epsg
out:- clipped_tif
scatter: geotiff
scatterMethod: dotproduct
node_concatenate:
run: concatenate2.cwl
in:
tifs:/clipped_tif
source: node_translate
out:- composite
.0 cwlVersion: v1
From this example, we can see that we require two CommandLine CWL files: gdal-translate.cwl
and concatenate2.cwl
. Let’s deal with these in order.
class: CommandLineTool
.0
cwlVersion: v1
doc: This runs GDAL Translate to clip an image to bbox corner coordinates.
requirements:
InlineJavascriptRequirement: {}
DockerRequirement: /osgeo/gdal:ubuntu-small-latest
dockerPull: ghcr.io
baseCommand: gdal_translate
arguments:- -projwin
- valueFrom: ${ return inputs.bbox.split(",")[0]; }
- valueFrom: ${ return inputs.bbox.split(",")[3]; }
- valueFrom: ${ return inputs.bbox.split(",")[2]; }
- valueFrom: ${ return inputs.bbox.split(",")[1]; }
- valueFrom: ${ return inputs.geotiff.basename.replace(".tif", "") + "_clipped.tif"; }
8
position:
inputs:
geotiff: type: File
inputBinding:7
position:
bbox: type: string
epsg:type: string
"EPSG:4326"
default:
inputBinding:6
position: -projwin_srs
prefix:
separate: true
outputs:
clipped_tif:
outputBinding:'*_clipped.tif'
glob: type: File
class: CommandLineTool
.0
cwlVersion: v1
doc: This runs GDAL Merge to stack images together.
requirements:
InlineJavascriptRequirement: {}
DockerRequirement: /osgeo/gdal:ubuntu-small-latest
dockerPull: ghcr.io
baseCommand: gdal_merge.py
arguments: - -separate
- valueFrom: ${ return inputs.tifs; }
- -o
- composite.tif
# gdal_merge.py -separate 1.tif 2.tif 3.tif -o rgb.tif
inputs:
tifs:type: File[]
outputs:
composite:
outputBinding:'*.tif'
glob: type: File
NOTE: YAML generally doesn’t play well with tabs as whitespace so it’s best practice to use spaces for indentations
Running the Workflow
Now that we have our commandline CWL component files, and the Workflow CWL file that brings the tools together, we need to specify the input parameters. This is done using a parameters.yml
file, where the name of the file can be anything that you want. The contents should follow the layout that we will be using:
"136.659,-35.96,136.923,-35.791"
bbox:
geotiff: - { "class": "File", "path": "../B04.tif" }
- { "class": "File", "path": "../B03.tif" }
"EPSG:4326" epsg:
You will need to change the path
parameter to match the location of your input files.
Now we run it with the command:
cwltool --podman composite.cwl composite-params.yml
Note: remember that if you are using Docker then you do not need the --podman
arguement.
Outputs
This workflow takes a couple of minutes to run, during which time the executed commands and their runtime messages are displayed on the command line. Once the workflow completes, the output file will be found in the directory from where the workflow was run. Intermediate files that are not specified in the out block in the workflow are automatically deleted.
The output .tif file can now be opened in QGIS or a similar software application to check that the output is as expected (in this case a 2-layer image of a clipped area of the extent of the input files).
Tips
You can pass --leave-tmpdirs
to the cwltool
command. This is often helpful to figure out if the outputs from a step are what you think they should be.
Another good (non-spatial) tutorial can be found here: https://andrewjesaitis.com/posts/2017-02-06-cwl-tutorial/