-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Is your feature request related to a problem? Please describe.
As a developer, I would like to know how applications that are deployed into the cluster behave from different aspects so that I can observe what is happening to my application based on statistics produced by the service.
As a data scientist, I would like to have a unified report that can be analyzed from different points of view so that I can sport possible issues with the container image deployed in the cluster. To support this, I would like to have an ability to reuse jupyter notebooks that can automatically load reports produced by the service.
As we already have deployment and core features of amun in place, this is more about abstracting out some features and eventually provide more, such as GPU utilization for the container when it is run in a cluster.
Workflow:
- User submits Amun inspection with a pre-built container image respecting configuration supplied (node placement in the cluster, GPU requirements, CPU requirements, ...)
- Amun runs the application in a way user requests to do it (e.g. run a training phase of a machine learning model)
- Amun captures runtime statistics of the application (CPU utilization, process statistics from the PCB, GPU utilization, networking, ...) and reports them as a JSON
- A prepared jupyter notebook is used to automatically visualize statistics to users
- Users can spot issues, discrepancies, or other runtime characteristics from the report to analyze the application behavior in the cluster
Metadata
Metadata
Assignees
Labels
Type
Projects
Status