RallyClip | Free Tennis Match Segmentation

What It Does

RallyClip takes a full tennis match recording and extracts only the points, removing the dead time between rallies. Drop in a 2-hour match video, get back a condensed video containing just the action, plus an optional CSV with timestamps for each point.

It works through a pipeline of computer vision and deep learning: court detection, pose estimation with YOLOv8, feature engineering, and a bidirectional LSTM that predicts frame-by-frame whether play is happening. The result is a segmented video with only the rallies, ready for review.

Motivation

Watching yourself play is one of the best ways to get better at competing in tennis. However, only ~25% of a recorded match is spent "in-point"; the rest is dead time. When you're going through your footage, it's a pain to have to parse through this dead time to get to the actual points.

Software like SwingVision has tools that accomplish this exactly. However, SwingVision only allows you 2 hours a month free of match segmentation, forcing you to buy their expensive subscription. While SwingVision does have many more analytics, their match segmentation is one of their most popular and useful features.

I built RallyClip (there's not a ton of tennis/computer vision name schemes so cut me some slack on the name) to solve this problem exactly. I believe this match segmentation should be made free for anyone to use, which is why I made RallyClip open-source and local.

To avoid the costs of cloud compute, you run RallyClip directly on your computer. There are trade-offs with this: this type of video inference is computationally expensive, so running it on your laptop may take some time. I'm using a MacBook Pro M2, and here are the metrics for running it on my laptop. Note that using GPU acceleration significantly improves performance:

Model	GPU (MPS)	CPU	Speedup
YOLO nano	~0.8× video length	~3× video length	~3.75×
YOLO small	~1.5× video length	~7× video length	~4.7×

In other words, GPU acceleration provides ~4× gains, so if you have CUDA or MPS, enable it.

How It Works

1. Court Detection

Runs a script to find the lines of the court. Uses classical CV methods to detect lines, then uses heuristics based on the angles of lines to detect the near baseline, singles and doubles sidelines. From those, it creates a court mask of the region where the players would be.

This enables us to only use bounding boxes from players in the playable area: if there are other players/people on nearby courts, this ensures that we don't use their pose data in our model.

2. Pose Extraction

Runs YOLOv8 pose models on 15 fps downsampled video to extract all potential bounding boxes and keypoint (joints) estimation of players.

3. Preprocessing & Feature Engineering

Preprocess the data: filter out all bounding boxes with centroids not inside our court mask, feature engineer limb lengths, joint velocity and acceleration.

4. LSTM Inference

Create overlapping chunks of 20 seconds, feed into trained bidirectional LSTM, which outputs frame-by-frame probabilities of being in a point vs out of point.

5. Output Generation

Average output signal, apply smoothing and hysteresis filtering to get discrete point start/end times; write these "in-point" chunks to output video.

Challenges

This project had some decently nontrivial challenges, the main of which were finding a data modality with consistent signal and creating a computationally efficient model that could be run locally.

Choosing the Data Modality

Firstly, I had to decide on a data modality to train my models on.

Raw video/images: The pure image/video is computationally infeasible. Training a video transformer or CNN to extract relevant features, even if training on single frames and not including temporal convolutions, would require much more data than I could reasonably gain access to.

Sound: Using sound fails for several reasons: (1) Videos with multiple courts make it very difficult to differentiate if the sound of the racket hitting the ball came from the relevant or irrelevant match. (2) Not all shots produce sound, not all players hit hard enough to make loud enough sounds to detect. (3) Sound can be noisy/inconsistent, or a recording may not even have sound.

Ball tracking: Ball tracking is doable, but can fail in obfuscation cases, or when the court and ball blend together, it can be very hard to see.

Player poses: I finally decided on using player pose data. A point being in play requires the players to move and swing their rackets in distinct patterns. Furthermore, as players will almost certainly be visible in frame during points, they will provide a consistent signal. Lastly, pose extraction is computationally cheap: the YOLOv8-pose models I use are 3.1 and 11.6 million parameters, which result in fast feature extraction.

Training Data

Lastly, there are no publicly available datasets I can use to train on this, so I manually annotated data for ~8 hours of matches to train my first models on. I'm quite proud that I built a functioning model trained on such low amounts of data, proof that with clever engineering you can achieve similar results to brute-force big data approaches.

Limitations

If the court detection mask fails and there are more people on the screen than the 2 players, the model could fail as pose detector could pick up on wrong people, and output would be completely unreliable. There are very few cases where this occurs: mainly in low light conditions where classical CV methods fail to detect edges, or where the camera angle is terrible.

Furthermore, the model could fail to generalize to some videos, likely due to some potential overfitting or insufficient representation of the distribution of match videos. As I collect more data, performance will improve in this regard.

Future Improvements

Model distillation: Currently, the YOLO nano pose model is computationally the most feasible, but at the cost of some accuracy compared to YOLO small. I plan to fine-tune the nano YOLO pose model on the outputs of the YOLO large model to close this gap while keeping efficient computation.
More data: I will increase and augment data, using some more hand-labelled data, and using model outputs to create more training data.
Attention mechanisms: As my data increases, I will switch to LSTM with attention, which will improve performance, as it can learn which parts of the 20s window are most relevant with respect to each other.
Rally detection: I will add a rally detection model. My current model doesn't perform as well on this because it has learned that a service or return motion is associated with the start of a point. I'll simply train a model with the same pipeline on rally data instead of match data.
Fully installable app: Make RallyClip a standalone desktop app that requires no terminal commands. Fully non-technical friendly, just download and run.
Cloud compute option: A pay-per-minute cloud service for fast and cheap segmentation. For those who don't want to run it locally or need faster processing.

If you have any potential improvements, please PR them! I'm super open to feedback and help.

Installation

Requires Python 3.10+. 8GB+ RAM recommended. GPU optional (MPS/CUDA supported).

Using UV (recommended)

UV is a fast Python package manager.

# Install UV (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/iroblesrazzaq/RallyClip.git
cd RallyClip
uv sync

# Run with UV
uv run rallyclip gui
uv run rallyclip --video "match.mp4"

Using pip

# Clone and install with pip
git clone https://github.com/iroblesrazzaq/RallyClip.git
cd RallyClip
python -m venv .venv && source .venv/bin/activate
pip install .

Models are included in the repo. YOLO weights auto-download on first run.

Usage

If installed with UV, prefix commands with uv run. If installed with pip, run directly after activating your venv.

GUI

Launch the browser-based interface:

rallyclip gui          # pip install
uv run rallyclip gui   # uv sync

Drag and drop your video, adjust settings if needed, and download the result. The GUI provides real-time progress tracking and advanced settings.

CLI

# Basic usage (only video path required)
rallyclip --video "match.mp4"

# With CSV output
rallyclip --video "match.mp4" --write-csv

# Custom output directory
rallyclip --video "match.mp4" --output-dir "./processed"

# Use config file for all options
rallyclip --config config.toml

See the README for full CLI options and config file documentation.