Transcript-Driven Video Bad-Take Auto-Editor

Summary

A desktop tool that opens a consumer video editor's project file, parses the on-screen captions back into a transcript, detects repeated phrases (the tell-tale signal of a re-take where the talent restarted a sentence), and proposes which segments to cut. The editor reviews the proposed cuts in a side-by-side panel, accepts or rejects each one, and the tool writes the modified project file back so the editor opens with the bad takes already removed.

The Problem

Self-shot, single-camera content has a specific editing problem: every time the talent fluffs a sentence and restarts, the caption track ends up with two near-identical copies of the same line. Finding all of those manually in a 30-minute project takes longer than the actual edit. The savings target is straightforward — if a tool can flag the duplicates with high precision, the editor only reviews what's flagged instead of scrubbing the whole transcript.

The wrinkle: the consumer video editor in question stores its project in a JSON file with caption text linked to media segments via internal IDs and microsecond timestamps. There's no public API. And on some installs the file is encrypted, so the tool can't always parse it directly.

The Approach

The system is built as two modes that share the same downstream pipeline. Mode A parses the project JSON directly — fast, deterministic, exact timestamps. Mode B falls back to OCR on the editor's UI plus screen automation when the file is encrypted — slower, less precise, but always works. Both modes produce the same intermediate transcript shape, which is what the duplicate detector and the project writer both consume.

Mode A: project file (JSON)  ─┐
                              ├──→  Normalized transcript
Mode B: OCR + UI automation  ─┘            ↓
                                    4-pass duplicate detector
                                            ↓
                                    Reviewable cut proposals (GUI)
                                            ↓
                                    Project writer  →  modified JSON
                                                       (editor opens with
                                                        cuts already in)

Duplicate detection runs in 4 passes of decreasing strictness: exact-match, near-exact (whitespace + punctuation tolerant), fuzzy phrase match with edit distance, then sliding-window similarity. Each pass tags candidates with a confidence score. The GUI surfaces high-confidence cuts at the top, low-confidence cuts collapsed at the bottom — editor decides on each.

What I Built

Project file reader (Mode A) — parses the editor's draft JSON, walks materials.texts → tracks.segments via material_id, normalizes microsecond timestamps, emits a clean transcript shape
OCR + UI fallback (Mode B) — automates the editor on native Windows, captures caption text from the timeline, reconstructs timestamps from playhead position
4-pass duplicate detector — exact → near-exact → fuzzy phrase → sliding-window, each producing scored candidates that get merged and ranked
Review GUI — dark-mode desktop app with a transcript viewer on the left, proposed cuts with accept/reject on the right, keyboard-driven for speed
Project writer — applies accepted cuts back into the project JSON, preserves all the rest of the project state (effects, transitions, audio levels)
Native Windows runner — must run on native Windows Python because Mode B drives the GUI of a Windows desktop app; WSL Python can't reach the Windows process

Engineering Highlights

Two-mode design protects against the unknown. I could have shipped Mode A only, but the moment a user hit an encrypted project the tool would have been useless. Layering OCR on top — even slow, imperfect OCR — makes the tool work for every project, which is the only thing the user actually cares about.
4-pass detection over a single fuzzy match. A single fuzzy-match algorithm tuned for "find duplicates" produces too many false positives at high sensitivity and too many false negatives at low. Four passes with confidence scores, surfaced in priority order, lets the editor work top-down — fast for obvious cuts, careful for ambiguous ones.
Project file as ground truth. The tool never modifies the actual video media. It only edits the editor's project JSON. If the tool gets something wrong, the editor reopens, undo works as expected, no media is touched. Fail-safe by design.
WSL vs Windows Python boundary documented. Mode B can't run from WSL because it needs to drive a Windows GUI process. That constraint is documented in the project rules so future contributors don't waste an afternoon hitting it.

Outcome

In personal use for content production. Cuts an editing pass that used to take an afternoon down to a focused review session of a few minutes. Accepts or rejects each cut explicitly — no destructive changes ever go through unreviewed.

Tech footprint

Frontend — customtkinter desktop app, dark mode, keyboard-driven review panel
Core — Python project reader + duplicate detector + project writer
Mode A — JSON parsing of the editor's draft file (materials/tracks/segments graph)
Mode B — OCR + pyautogui desktop automation for encrypted projects
Runtime constraint — native Windows Python required (not WSL) for Mode B
No external services — entirely local, no AI vendor calls, no cloud dependency