Skip to content

sidharthvinod24/AI-Exercise-Video-Classifier

Repository files navigation

AAI3001 Term 1 Project: AI Exercise Video Classifier

Team Members

Name Student ID
Sidharth Vinod 2400635
Lim Bing Xian 2401649
Tan Yu Xuan 2400653
Boo Wai Yee Terry 2402445
Haris Bin Ahmad Rithaudeen 2403053

Table of Contents

Overview

This project implements and compares three state-of-the-art video classification architectures for identifying gym exercises from video input. The system can accurately classify five different exercises by analyzing both spatial features (body posture) and temporal dynamics (movement patterns).

Key Features:

  • Real-time exercise classification from video
  • Comparison of 2D CNN, 3D CNN, and Transformer-based approaches
  • Interactive Streamlit web application
  • Achieves 94.12% accuracy with VideoMAE model

Exercise Classes

The classifier recognizes the following five exercises:

  1. Bicep Curls
  2. Push Ups
  3. Squats
  4. Shoulder Press
  5. Lateral Raises

Demo

Click the thumbnail below to watch our demo video showcasing the Streamlit application in action:

Watch Demo

An intelligent video classification system that identifies 5 different gym exercises using deep learning models trained on video data.

Project Poster

Poster Image

Project Poster

Overview of the AI Video Exercise Classifier project, including model architecture, dataset, and results.

Poster PDF (Optional)

Download the full poster PDF

For a high-resolution version of our poster, click to download.

Quick Start

Prerequisites

  • Python 3.8+
  • Git with Git LFS installed
  • 8GB+ RAM
  • For custom training, we recommend using Google Colab for GPU access.
  • PyTorch 2.6.0+cu126 (CUDA 12.6)

Installation

  1. Clone the repository with Git LFS

    git lfs install
    git clone https://github.com/sidharthvinod24/AAI3001-Project
    cd AAI3001-Project
    git lfs pull
  2. Create a virtual environment

    python -m venv env
    
    # On Windows
    env\Scripts\activate
    
    # On macOS/Linux
    source env/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Launch the Streamlit application

    streamlit run main.py

The application will open in your default browser. Upload a video of any supported exercise to get predictions!

Training Models (Optional)

To train models from scratch:

  1. Upload the project folder to Google Drive
  2. Open main.ipynb in Google Colab
  3. Select a GPU runtime (Runtime → Change runtime type → GPU)
  4. Update file paths to match your Google Drive structure
  5. Run all cells

Project Structure

AAI3001-Project/
├── main.py                         # Streamlit application
├── main.ipynb                      # Model training notebook
├── requirements.txt                # Python dependencies
├── README.md                       # Project documentation
├── processed/                      # Preprocessed videos (224×224, no audio)
├── pretrained_model/               # Saved model weights
│   ├── videomae/                   # VideoMAE model
│   ├── resnet18_model.pth          # ResNet18 weights
│   └── pretrained_resnet3d.pth    # R3D-18 weights
├── old_exercises/                  # Original Kaggle dataset
├── new_exercises/                  # Augmented dataset
└── assets/                         # Project assets such as poster

Dataset

Data Sources

Primary Dataset: Kaggle Gym Workout Exercises

  • Original dataset containing 20+ exercise types
  • Selected 5 exercise classes for this project

Custom Dataset:

  • Team-recorded videos to address class imbalance
  • Added supplementary examples for minority classes

Preprocessing:

  • Resized to 224×224 resolution
  • Audio removed
  • Format: MP4

Model Architectures

1. VideoMAE (masked autoencoder-based transformer)

VideoMAE is a transformer-based masked autoencoder designed for video understanding. It captures both spatial features (appearance, body posture) and temporal dynamics (motion across frames) in a unified manner.

Specific Changes

  • Custom Model Definition: Used pretrained VideoMAE [1] and adapted for exercise classification by freezing most encoder layers and training only the last 2 layers. Output layer adjusted to 5 classes.
  • Data Augmentation: Resizing, random cropping, horizontal flips, small random rotations, and color jittering.

2. ResNet18 (2D CNN with average framing)

ResNet18 serves as our 2D convolutional baseline. Average framing is used to aggregate features from sampled frames.

Specific Changes

  • Custom Model Definition: Pretrained ResNet18 [2] with frozen layers, replaced FC layer with 2 linear layers + ReLU + Dropout, adjusted output to 5 classes.
  • Data Augmentation: Random resized crops, horizontal flips, color jitter, grayscale conversion, Gaussian blur.

3. R3D_18 (3D Convolutional ResNet)

R3D-18 extends 2D convolutions into the temporal dimension to model motion patterns across frames.

Specific Changes

  • Custom Model Definition: Pretrained R3D-18 [3] with frozen layers, custom classification head with ReLU and dropout, output adjusted to 5 classes.
  • Data Augmentation: Resizing, random cropping, horizontal flips, small rotations, color jitter.

Training Details

VideoMAE

Parameters Value
Epochs 5
Optimizer AdamW (weight decay: 1e-4)
Loss Function CrossEntropyLoss (label smoothing: 0.1)
Learning Rate 5e-5
Scheduler Cosine with warmup
Batch Size 4
Frames per Video 16
Training Framework HuggingFace Trainer API

ResNet18

Parameters Value
Epochs 100 (with early stopping)
Optimizer AdamW (weight decay: 1e-4)
Loss Function CrossEntropyLoss (label smoothing: 0.1)
Learning Rate 1e-4
Scheduler ReduceLROnPlateau (patience: 5)
Batch Size 8
Frames per Video 25

R3D-18

Parameters Value
Epochs 100 (with early stopping)
Optimizer Adam (weight decay: 1e-4)
Loss Function CrossEntropyLoss
Learning Rate 1e-4
Scheduler ReduceLROnPlateau (patience: 5)
Batch Size 4
Frames per Video 16

Results

Test Set Performance

Model Accuracy F1-Score Parameters (trainable)
VideoMAE 94.12% 94.14% ~86M (~5M fine-tuned)
R3D-18 85.29% 85.49% ~33M (~2M fine-tuned)
ResNet18 82.35% 82.31% ~11M (~1M fine-tuned)

Analysis

VideoMAE significantly outperforms both CNN-based models due to:

  • Rich pretrained representations from masked autoencoding on large-scale video datasets
  • Superior temporal modeling through self-attention mechanisms
  • Better generalization even with limited training data

R3D-18 improves over ResNet18 by:

  • Native temporal feature extraction via 3D convolutions
  • Direct motion pattern learning across consecutive frames

ResNet18 provides a solid baseline but:

  • Frame averaging loses temporal information
  • Limited ability to model motion dynamics

Future Improvements

  • Increase dataset size.
  • Apply advanced data augmentation (e.g., motion warping).
  • Use adaptive frame sampling.

References

  1. VideoMAE Paper
  2. R3D-18 Paper
  3. ResNet18 Paper
  4. Fine-tuning for Video Classification with 🤗 Transformers

License: MIT License. See the LICENSE file for details.

Contact: For questions or collaboration, please open an issue on GitHub.