We built Serveka because meeting bot infrastructure is needlessly painful
By Serveka Team
Building a meeting bot from scratch means Playwright automation, WebRTC audio capture, Deepgram WebSocket management, VM orchestration, self-deleting infrastructure, and six months of your life.
We didn't set out to build meeting infrastructure. We set out to add a recording feature to an internal tool. It was supposed to take two days.
Three months later, we had a Frankenstein system stitched together from a Playwright bot, a Linux audio loopback driver, a WebSocket connection to Deepgram, a VM manager, and a cleanup daemon that ran on a cron job. The recording feature worked. But we'd written about 4,000 lines of infrastructure code to get there — code that had nothing to do with the product we were building.
We eventually shipped it. Then we talked to other teams building AI meeting products. They had all done the same thing. Every single one.
The five layers nobody warns you about
When you decide to add meeting bot functionality to a product, you imagine one integration. You get five, and they're all load-bearing.
Layer 1: Browser automation
Meeting platforms don't have open APIs for bot participation. Google Meet and Zoom have SDKs, but they're built for building the client app, not for programmatically joining as a participant. So you use Playwright or Puppeteer to drive a real browser.
This sounds simple until you hit waiting rooms, admission dialogs, "you've been removed" pop-ups, meeting passwords, SSO-gated orgs, and the delightful variety of ways each platform renders a full-screen notification that crashes your bot's DOM interaction. You end up writing a state machine that handles roughly 14 distinct bot lifecycle states before you can reliably call a bot "in the meeting."
And this is per platform. Zoom and Google Meet behave differently. Teams behaves differently again.
Layer 2: Audio capture
The browser automation gets your bot into the meeting. Now you need audio. On Linux — which is where you're running this, because that's the only sensible choice for containerized bots — you set up a virtual audio device with ALSA and PulseAudio, route the browser's audio output through it, and tap that stream with an audio capture library.
ALSA configuration is not a pleasant experience. Getting a virtual sink to stay alive across container restarts, handle sample rate mismatches, and not produce clicks and pops in the transcript took us about a week of solid work. This is time spent on audio driver configuration, not on your product.
If you also want the bot to speak — to have the bot read back summaries or prompt participants — you need to go the other direction: text to speech, injected back into the virtual audio device, played through the browser, heard by participants. That's a separate integration.
Layer 3: Real-time transcription
Deepgram gives you a WebSocket. You stream PCM audio in, and transcription segments come back as JSON. The integration itself is not hard. Managing the WebSocket lifecycle is.
WebSockets drop. The meeting network is noisy. Deepgram's connection times out after periods of silence. You need reconnect logic with exponential backoff, segment buffering so you don't lose transcription during reconnects, deduplication because segments sometimes arrive twice, and a way to correlate transcript timestamps with meeting timestamps. You also need to handle the "finalized" vs "interim" segment distinction or your transcript will be full of duplicated partial sentences.
If you want speaker diarization — knowing which participant said what — you need to request it explicitly, handle the speaker label rotation that happens when Deepgram's model resets, and map Deepgram's generic "Speaker 1" labels back to the actual participant names you're tracking from the browser DOM.
Layer 4: VM orchestration
You can't run meeting bots in a shared process. A browser crash in one bot would take down others. A misbehaving audio driver would corrupt audio for everyone. Each bot needs to run in its own isolated environment.
So you build a VM manager. It provisions a fresh VM for each bot join request, injects the right credentials and meeting URL, monitors the VM's health, handles the case where the meeting never actually starts, and tears down the VM when the meeting ends. You need pre-warmed VMs because provisioning a fresh one takes 90–120 seconds — too slow for meetings that have already started. So you maintain a pool of warm VMs ready to receive the next join request.
The pool needs to be sized correctly (too small and you have capacity problems, too large and you're burning money on idle VMs), refreshed when VMs get stale, and distributed across zones for reliability.
Layer 5: Self-deleting infrastructure
Meeting bots exist for the duration of a meeting and then they're done. The VM needs to shut itself down. The recordings need to be moved to storage. The transcript needs to be finalized and delivered. The webhook events need to be fired. All of this needs to happen reliably, even if the VM loses its network connection in the middle of a meeting.
If any of this fails silently, you end up with leaked VMs running up your cloud bill, partial recordings, lost transcripts, or missing webhook deliveries. You need a cleanup daemon that runs independently of the bot process, can detect zombie bots, and can recover partial state from wherever it was interrupted.
Every team rebuilds this from scratch
After we shipped our own version of this, we started talking to other engineering teams. The pattern was identical every time. Team wants meeting intelligence. Assigns one engineer for "a few weeks." That engineer disappears into browser automation and audio drivers. The feature ships six to twelve weeks later. The engineer has become the internal expert on ALSA configuration and Deepgram WebSocket reconnect logic.
This is not a product problem. It's infrastructure. The actual product work — what you do with the transcript, how you surface insights, how you integrate meeting data into your workflow — that's where the differentiation is. Nobody is building a competitive advantage on their ALSA configuration.
And yet every team that wants meeting intelligence has to build it, maintain it, and debug it when it breaks at 2 AM because a new Chrome update changed the way audio is routed.
What we built
Serveka is what we wished existed when we were building that first recording feature. One API call. Bot joins in about 15 seconds. Real-time transcript arrives over SSE within 300ms of each speaker finishing a sentence. Recordings are MP4 with synchronized audio. Webhooks fire for every lifecycle event — bot joined, participant joined, transcript segment, meeting ended, recording ready.
We maintain all five layers. You don't have to think about them. When Zoom updates their web client and breaks the waiting room automation, we fix it. When Deepgram changes their WebSocket behavior, we update the integration. When Google Meet adds a new kind of admission dialog, we handle it.
The API is deliberately minimal. You POST a meeting URL, get back a bot ID, and then receive events. That's it. The complexity is ours to own.
Who this is for
If you're building an AI notetaker, a sales call intelligence tool, a coaching platform, an accessibility product for the deaf and hard-of-hearing, a legal transcript service, or anything else that needs to be in meetings and get data out — Serveka is the infrastructure layer you don't want to build.
You can have a bot in a meeting and transcript data arriving in your system in an afternoon. We've seen teams go from "we want meeting intelligence" to a working integration in a day. The product work starts immediately. The infrastructure work is already done.
There's a free tier with full API access and no credit card required. The quickstart gets you a working bot in under five minutes.