A Task Management Service for EWMS
The TMS is the central component responsible for communication between the WMS and an HTCondor pool. It runs on an HTCondor Access Point (AP). This service:
- Starts condor clusters for new taskforces (1:1), see taskforce.
- Stops condor clusters (
condor_rm) when necessary. - Watches condor clusters, snapshots taskforce-level stats, and relays information to the WMS.
In short, the TMS receives its instructions from the Workflow Management Service (WMS).
Internally, the service makes routine calls to the WMS to determine whether to start or stop clusters for specific taskforces.
Concurrently, the service sends updates to the WMS for each taskforce in a job event log. Taskforces share a job event log if they start on the same day. A new file is created as needed, and files are deleted after a period of inactivity.
For statelessness, when the TMS restarts, snapshot'd taskforce updates will be re-sent to the WMS, which handles these appropriately.
The image-publish.yml GitHub Actions workflow publishes this package as an Apptainer image in CVMFS when a new release is made.
In production, the TMS runs on an HTCondor Access Point (AP) using systemd. Files for this are in tms-prod/ and tms-dev/, as well as additional helper scripts in resources/systemd/.
Whichever systemd variant you choose, a envfile is required. The file for tms-prod looks something like (minus the redactions):
EWMS_ADDRESS="https://ewms-prod.icecube.aq"
EWMS_CLIENT_ID="ewms-tms-prod"
EWMS_CLIENT_SECRET="XXXX"
EWMS_TOKEN_URL="https://keycloak.icecube.wisc.edu/auth/realms/IceCube"
JOB_EVENT_LOG_DIR="/.../tms-prod/jobs"
TMS_ENV_VARS_AND_VALS_ADD_TO_PILOT="_EWMS_PILOT_APPTAINER_BUILD_WORKDIR=/srv/var_tmp/"
TMS_WATCHER_INTERVAL="15"Use the helper script, update_tms_image_symlink.sh, to roll out a new TMS version on an HTCondor Access Point (AP) using systemd:
ewms@sub-2 ~/resources/systemd/tms-dev $ ./update_tms_image_symlink.sh v1.2.3Does not exist within the TMS. Compare to WMS.
A task is not a first-order object in the TMS. However, each taskforce holds a reference to a container, arguments, environment variables, etc. Collectively, these comprise a task. Compare to WMS.
Does not exist within the TMS. Compare to WMS.
The taskforce is the primary object within the TMS. It is associated with one condor cluster. See Taskforce's cluster_id.
Compare to WMS.
The cluster is the realization of a taskforce within an HTCondor pool. The two are mapped 1:1 and are nearly synonymous at a high level.
However, the term "cluster" is used exclusively within the context of an HTCondor pool, the job event log, and debugging. Unlike the taskforce, the cluster is not relevant in the broader EWMS context.
Bump semver release test 1