Skip to content

[Epic] Extend Amun feature wise to be a testing platform for containers in a cluster #630

@fridex

Description

@fridex

Is your feature request related to a problem? Please describe.

As a developer, I would like to know how applications that are deployed into the cluster behave from different aspects so that I can observe what is happening to my application based on statistics produced by the service.

As a data scientist, I would like to have a unified report that can be analyzed from different points of view so that I can sport possible issues with the container image deployed in the cluster. To support this, I would like to have an ability to reuse jupyter notebooks that can automatically load reports produced by the service.

As we already have deployment and core features of amun in place, this is more about abstracting out some features and eventually provide more, such as GPU utilization for the container when it is run in a cluster.

Workflow:

  1. User submits Amun inspection with a pre-built container image respecting configuration supplied (node placement in the cluster, GPU requirements, CPU requirements, ...)
  2. Amun runs the application in a way user requests to do it (e.g. run a training phase of a machine learning model)
  3. Amun captures runtime statistics of the application (CPU utilization, process statistics from the PCB, GPU utilization, networking, ...) and reports them as a JSON
  4. A prepared jupyter notebook is used to automatically visualize statistics to users
  5. Users can spot issues, discrepancies, or other runtime characteristics from the report to analyze the application behavior in the cluster

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/amunIssues or PRs related to Amunarea/knowledge-graphIssues or PRs related to Knowledge Graphkind/featureCategorizes issue or PR as related to a new feature.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.priority/backlogHigher priority than priority/awaiting-more-evidence.sig/user-experienceIssues or PRs related to the User Experience of our Services, Tools, and Libraries.triage/needs-informationIndicates an issue needs more information in order to work on it.

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions