Skip to content

Observation-Management-Service/ewms-task-management-service

Repository files navigation

GitHub release (latest by date including pre-releases) GitHub issues GitHub pull requests

ewms-task-management-service v1

A Task Management Service for EWMS

The TMS is the central component responsible for communication between the WMS and an HTCondor pool. It runs on an HTCondor Access Point (AP). This service:

  • Starts condor clusters for new taskforces (1:1), see taskforce.
  • Stops condor clusters (condor_rm) when necessary.
  • Watches condor clusters, snapshots taskforce-level stats, and relays information to the WMS.

Overview

In short, the TMS receives its instructions from the Workflow Management Service (WMS).

Starting and Stopping Taskforces/Clusters

Internally, the service makes routine calls to the WMS to determine whether to start or stop clusters for specific taskforces.

Watching the Job Event Logs

Concurrently, the service sends updates to the WMS for each taskforce in a job event log. Taskforces share a job event log if they start on the same day. A new file is created as needed, and files are deleted after a period of inactivity.

For statelessness, when the TMS restarts, snapshot'd taskforce updates will be re-sent to the WMS, which handles these appropriately.

How to Build

The image-publish.yml GitHub Actions workflow publishes this package as an Apptainer image in CVMFS when a new release is made.

How to Run

In production, the TMS runs on an HTCondor Access Point (AP) using systemd. Files for this are in tms-prod/ and tms-dev/, as well as additional helper scripts in resources/systemd/.

Whichever systemd variant you choose, a envfile is required. The file for tms-prod looks something like (minus the redactions):

EWMS_ADDRESS="https://ewms-prod.icecube.aq"
EWMS_CLIENT_ID="ewms-tms-prod"
EWMS_CLIENT_SECRET="XXXX"
EWMS_TOKEN_URL="https://keycloak.icecube.wisc.edu/auth/realms/IceCube"

JOB_EVENT_LOG_DIR="/.../tms-prod/jobs"

TMS_ENV_VARS_AND_VALS_ADD_TO_PILOT="_EWMS_PILOT_APPTAINER_BUILD_WORKDIR=/srv/var_tmp/"
TMS_WATCHER_INTERVAL="15"

How to Update in Production

Use the helper script, update_tms_image_symlink.sh, to roll out a new TMS version on an HTCondor Access Point (AP) using systemd:

ewms@sub-2 ~/resources/systemd/tms-dev $ ./update_tms_image_symlink.sh v1.2.3

EWMS Glossary Applied to the TMS

Workflow

Does not exist within the TMS. Compare to WMS.

Task

A task is not a first-order object in the TMS. However, each taskforce holds a reference to a container, arguments, environment variables, etc. Collectively, these comprise a task. Compare to WMS.

Task Directive

Does not exist within the TMS. Compare to WMS.

Taskforce

The taskforce is the primary object within the TMS. It is associated with one condor cluster. See Taskforce's cluster_id.
Compare to WMS.

Cluster

The cluster is the realization of a taskforce within an HTCondor pool. The two are mapped 1:1 and are nearly synonymous at a high level.

However, the term "cluster" is used exclusively within the context of an HTCondor pool, the job event log, and debugging. Unlike the taskforce, the cluster is not relevant in the broader EWMS context.

Bump semver release test 1

About

EWMS's Task Management Service (TMS): The HTCondor Interface

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •