The dummy backend that saved our lesson recordings
The cheapest way to record video calls at scale is to not record video. We learned this the hard way at Lingawa, where we run about 1,500 live tutoring lessons per week between native-speaker tutors and diaspora kids learning Yoruba and Igbo. This post is about two weeks of bugs that were deeper than the previous one each time, and a 50-line Python file that ended up being the most important component in the whole pipeline.
The cheapest way to record video calls at scale is to not record video.
We learned this the hard way at Lingawa, where we run about 1,500 live tutoring lessons per week between native-speaker tutors and diaspora kids learning Yoruba and Igbo. Every lesson happens on our self-hosted Jitsi instance. Every lesson needs to be reviewable after the fact, both for safeguarding (we're connecting paid adults with minors in private video calls, so there's a real responsibility to know what was said) and for tutor quality (the tutor growth team needs to spot-check whether the tutor was actually teaching or just chatting).
The straightforward way to solve this is Jibri, the official Jitsi recording component. One Jibri instance per concurrent recording, each one a 4-8GB RAM VM running headless Chromium. With 10-20 concurrent lessons at peak, that's 10-20 VMs. Estimated $500 to $1,000 per month, just for the recording layer.
That number was a non-starter. So we pivoted: audio only, recorded server-side by Jigasi, transcribed in batches on a single CPU VM running Faster Whisper. Total infrastructure cost came in at around $50-80 per month. About a 10x reduction on a load-bearing piece of business infrastructure.
This post is about what happened next, which was two weeks of bugs that were deeper than the previous one each time. It's also about a 50-line Python file called dummy_vosk.py that ended up being the most important component in the whole pipeline.
What was supposed to happen
The architecture, on paper, was clean. Three pieces:
- Jigasi joins every lesson as a hidden participant and records the mixed audio to a WAV file.
- An upload script on the Jitsi VM converts the WAV to FLAC (smaller, lossless) and SCPs it to a separate Whisper VM.
- The Whisper VM transcribes in batch, uploads the JSON transcript to GCS, and triggers a Cloud Function that does two-tier flagging: regex against a list of solicitation terms (fast and free), and Claude Haiku for anything ambiguous (around $10-15 per month at our volume).
Jitsi VM Whisper Cloud Fn ┌──────────────┐ ┌─────────┐ ┌────────┐ │Lesson happens│ │ │ │ │ │Jigasi → WAV │FLAC→│Faster │JSON→│T1 regex│ │Upload → SCP │ SCP │Whisper │ GCS │T2 Haiku│→ Chat │ │ │large-v3 │ └────────┘ │dummy_vosk.py │ │int8 │ │keeps Jigasi │ └─────────┘ │alive │ └──────────────┘
The flagging side worked from day one. The regex tier catches the obvious stuff like phone numbers, "WhatsApp", and "add me on", and Haiku handles the edge cases where context matters, like a Yoruba vocabulary lesson that legitimately mentions WhatsApp. On the first batch of 15 transcripts, five triggered alerts and three were confirmed violations. That part of the system was not the problem.
The recording side was the problem. For two weeks, we couldn't get Jigasi to record a single complete lesson.
Bug 1: Jigasi won't record audio without a transcription backend
The first thing I tried was the obvious thing. Set RECORD_AUDIO=true in Jigasi's config, set SAVE_JSON=false and SAVE_TXT=false because we didn't want Jigasi's transcription output (it uses Vosk by default, which has zero Yoruba support and is the wrong tool for our use case), and let it record.
No WAV files appeared. Jigasi joined conferences, then immediately left. The logs showed it was trying to start a Google Cloud transcription session, failing because we hadn't configured Google credentials, and exiting cleanly within a couple of seconds.
So I installed Vosk as the transcription backend, gave it a model file, started its websocket server. Jigasi stopped crashing on join. But it still wasn't producing WAV files.
This took two more rounds of investigation to figure out. Reading Jigasi's source code, the maybeStartRecording() method that actually writes the WAV file is only called inside a loop over "transcript publisher promises". With both SAVE_JSON=false and SAVE_TXT=false, zero publishers are registered, the promise list is empty, and the recording loop never executes. The recording feature is gated on at least one transcription publisher being active.
The fix was to set SAVE_JSON=true, then have the upload script delete Jigasi's JSON files after extracting the room name and participant metadata.
The lesson here was small but real. When you set RECORD_AUDIO=true, you'd expect that to be sufficient. It is not. Jigasi's recording pipeline is structurally entangled with its transcription pipeline, and you can't run one without the other. That entanglement was about to become much more expensive.
Bug 2: The Vosk container is now using 400% CPU
With recording finally working, Jigasi was joining conferences and writing WAV files. Briefly. Within a few hours, the Jitsi VM started showing signs of distress. JVB stress hit 98%, threatening the live lessons that were actively running on the same machine.
The culprit was Vosk. The container I'd installed purely as a no-op transcription backend, to satisfy Jigasi's structural requirement and not to actually do anything with, was processing real-time speech recognition across every concurrent session. 402% CPU. 5.95GB of RAM. It was doing exactly what it was designed to do, which was the problem. I'd installed a Yoruba ASR model on it. It was transcribing.
The fix was easy in retrospect. Vosk doesn't need to actually transcribe anything. Jigasi just needs something on the other end of the websocket that accepts audio frames and returns valid JSON. So I wrote a 50-line Python file that did exactly that and nothing more:
#!/usr/bin/env python3
"""
Minimal dummy Vosk-compatible websocket server.
Accepts Jigasi's audio stream over websocket and returns empty Vosk-format
JSON responses so that Jigasi's transcription pipeline stays active and
RECORD_AUDIO continues to produce WAV files.
Actual transcription is handled by a separate Whisper VM, this server
only exists to keep the websocket connection alive.
Listens on ws://localhost:2700 (same port as real Vosk).
"""
import asyncio
import json
import logging
import signal
import sys
import websockets
LOG = logging.getLogger("dummy-vosk")
async def handle_client(websocket):
remote = websocket.remote_address
LOG.info("Connection from %s", remote)
try:
async for message in websocket:
if isinstance(message, str):
try:
data = json.loads(message)
except json.JSONDecodeError:
continue
if "config" in data:
LOG.debug("Config from %s: %s", remote, data)
elif data.get("eof"):
LOG.info("EOF from %s", remote)
await websocket.send(json.dumps({"text": ""}))
break
else:
await websocket.send(json.dumps({"partial": ""}))
except websockets.exceptions.ConnectionClosed:
LOG.info("Connection closed: %s", remote)
finally:
LOG.info("Disconnected: %s", remote)
async def main():
logging.basicConfig(level=logging.INFO, stream=sys.stdout)
loop = asyncio.get_running_loop()
stop = loop.create_future()
for sig in (signal.SIGTERM, signal.SIGINT):
loop.add_signal_handler(sig, stop.set_result, None)
async with websockets.serve(
handle_client, "localhost", 2700,
ping_interval=20, ping_timeout=60,
):
LOG.info("Dummy Vosk server listening on ws://localhost:2700")
await stop
if __name__ == "__main__":
asyncio.run(main())22MB of RAM. Less than 1% CPU. Same port (2700), same protocol, no Jigasi config changes needed. I stopped the Vosk container and started the Python script.
CPU usage dropped to baseline immediately. JVB stress went back to single digits. Jigasi continued recording.
Bug 3: Jigasi leaves conferences when the lobby is enabled
This is where things got harder.
Recordings were now being created, but they were the wrong shape. 96% of them were 200-300KB of audio, which is two or three seconds. Jigasi was joining the conference, then leaving within 6-20 seconds.
I spent a day chasing the wrong hypotheses. Maybe ICE was failing because Jigasi and JVB were co-located on the same machine. Maybe the conference was falling back to P2P mode and bypassing JVB entirely. Maybe Vosk's websocket was timing out (it was a real possibility, even with the dummy server). Each one took an hour or two to rule out.
The actual cause was in Jigasi's Java source code, in a method called processChatRoomMemberLeft(). When the conference has lobby enabled and "single moderator mode" disabled, any time a member leaves and the room contains only Jigasi plus Jicofo, Jigasi calls stop() and exits. The reasoning, from the code, is that if there are no real users left, there's nothing to record.
The problem is that "member leaves" fires on any transient roster change. Tutor refreshes their browser. Student joins from the lobby (which is technically a roster transition). Any presence drop. Every time the roster blipped, Jigasi assumed the conference was over and exited.
Our Prosody config had muc_room_default_lobby = true. That setting is good for the actual lessons (we want students to wait in a lobby until the tutor admits them), but it was catastrophic for Jigasi's recording loop.
The fix needed to advertise singleModeratorEnabled=true via Prosody's disco-info response so Jigasi would skip the leave check entirely. This required writing a custom Prosody module, which is a story for another post. For now, all that matters is that I eventually shipped a module that worked, and Jigasi started staying in conferences for the full lesson duration.
Bug 4: Jigasi is leaking threads
Within a few hours of the lobby fix landing, recordings started looking wrong again. Not in the same way as before. Lessons were being captured, but each lesson produced 20 to 30 separate WAV fragments instead of one continuous file.
The Jigasi process metrics told the story. Thread count had climbed from 55 (idle baseline) to 2,646. RSS had grown from 355MB to 9.6GB. The RTP sockets were unstable. Jigasi was disconnecting from conferences every 30 to 60 seconds, then reconnecting.
The dummy backend that had saved us in bug 2 was now actively breaking us in bug 4.
The leak source turned out to be the dummy Vosk server I'd written in bug 2. Every time Jigasi opens a recording session, it opens a websocket connection to the transcription backend per participant audio stream. When the RTP socket fails and Jigasi rejoins, it opens new websocket connections, but the old ones don't close. The dummy server I'd written had a 60-second pong timeout, which was generous enough to keep dead connections alive indefinitely. Each leaked session was holding around four threads. With four participants times two threads per stream times 25 reconnections per room, you can do the math.
The dummy backend that had saved us in bug 2 was now actively breaking us in bug 4. The root cause wasn't really the dummy server, it was the interaction between Jigasi's reconnection logic and any backend that didn't aggressively reap dead connections. But the dummy was what I had control over, so the fix went there.
A second version of the script (dummy_vosk_v2.py) added four things: a 70-minute hard cap on connection lifetime, a 10-second pong timeout (down from 60), a 100-connection cap with oldest-eviction, and an active reaper thread that swept dead connections every 30 seconds. None of this would matter for a real transcription backend running on its own machine. All of it mattered for a no-op stub running alongside Jigasi on the same VM.
I deployed v2 at 02:59 UTC via a self-deleting cron job timed to run one minute before the existing 03:00 UTC Jigasi restart. The cron had automatic rollback to v1 if the new server failed to start. (No live lessons run between 03:00 and 06:00 UTC, which is why I scheduled it there.) It started cleanly. Jigasi restarted at 03:00 against the v2 stub.
Idle thread count dropped from 194 to 51. A 73% reduction. Recordings became single continuous files. 60-minute lessons captured intact.
What this all came down to
Two weeks of work, four bugs, and the most important component in the system ended up being a 50-line Python file that does almost nothing. The actual numbers, before and after the leak fix:
- Thread count: 7,048 peak to 51 idle baseline
- File descriptors: 10,334 peak to 296 idle baseline
- RSS: 9.6GB peak to 364MB idle baseline
- WAV fragments per lesson: 20-30 to 1
- Pipeline cost: estimated $500-1,000/month (Jibri) to $50-80/month (Whisper VM + dummy stub)
The reason this took two weeks instead of two days is that each layer of abstraction in Jigasi hid a coupling I didn't expect. Setting RECORD_AUDIO=true should be sufficient. It isn't, because there's a transcription publisher gate. The transcription backend should be replaceable. It is, but only if your replacement handles connection lifecycle aggressively, because Jigasi assumes it doesn't have to. The lobby check should be configurable. It is, but only via a disco-info field that requires writing a custom Prosody module to set.
None of these were Jigasi's fault, exactly. They were the natural consequences of a tool that's designed to do one thing (transcribe in real time, full stack) being repurposed to do something subtly different (record audio for batch transcription elsewhere). Every couple of layers, an assumption that held in the original use case stopped holding in mine.
The general lesson, if there is one, is that "real-time" and "batch" are different cost classes by an order of magnitude, and most real-time tools have hidden assumptions baked in that quietly become bugs when you try to use them for batch. The pipeline we ended up with works because it sidesteps the real-time path entirely. Audio is captured during the lesson, but transcription happens at 15x real-time on a single CPU VM, hours later, when nothing is on fire.
What I'd do differently
Two things, both genuinely "I should have known better."
The first is that I should have written the v2 dummy server on day one. The real Vosk container was never the right component for our architecture, even briefly. A no-op stub with aggressive timeouts is what the system needed, and writing it would have prevented bugs 2 and 4 entirely.
The second is that I almost made this worse by trying to use pyannote for speaker diarization. I installed it, confirmed it ran, then disabled it via a feature flag because it processes at 0.2x real-time on CPU. A 60-minute lesson would take five hours just to diarize. With 215 lessons per day, the queue would have grown unboundedly. A 30-second benchmark on a single audio file would have caught this, and I didn't run one. I just assumed that since Faster Whisper ran fast on CPU, pyannote would too. It doesn't. Speaker embedding models are a different compute profile entirely.
There's a related debugging story I'll cover in a follow-up post: the silent-failure Prosody module that took several days to debug because Lua hooks fail in absolute silence when you call them with the wrong API. That one's its own thing. For now, the lesson recordings are working, the pipeline costs $50-80 per month, and the dummy backend is doing the most important job in the system, which is nothing.