Drinkbot

May 22, 2026
A story of friendship, 10,000 beers, and multi-agent orchestration. drinks-pearl.vercel.app
leaderboard
3

In May 2025 my friends wanted a way to stay in touch after graduation. So we decided to start a challenge. The rules were simple.

“Text the groupchat a picture every time you have a beer.”

Repeat until ten thousand. One year later, we’re halfway done.

To celebrate, I built a live updating leaderboard for my friends. Here’s how:

TL;DR on maintaining friendships:

  • Distance takes away spontaneous togetherness. Stay close by building a system of rituals to create spontaneity.
  • Keep a stupid, low-friction sense of play
  • Small consistent gestures go a long way (so make it easy to try a little, every day.)
  • Spontaneity is the real friendship infrastructure. Friendship without structure evaporates, so invent the lowest-friction structure imaginable.
Chapter 1 - iMessage (Data Source)

The ideal format of every drink “log” is <image> <# drink> <type>. For example: <image> 9999, Modelo. Technically the values can be separated by a comma (or not), and can also include a story, thoughts, or feeling. Data gets messy when you drink.

For a while, our friend Jacob manually maintained a spreadsheet with four columns: person / type of drink / date / details.

When I first experimented with automating this process, I realized scraping iMessage data is notoriously difficult. A Mac Mini was perfect for accessing a local chat.db, the only way to really scrape data.

The polling mechanism is a simple launchd job that pulls every message newer than the last one it saw, and hands the new rows (text + attached images) downstream for parsing. No webhooks, no cloud.

But while building, a complex data wrangling problem emerged.

Chapter 2 - Edge Cases / Wrangling
  • Logging multiple drinks at once for others, only one image
  • Easy: log text immediately before the image OR immediately after.
  • Correcting a mistake via <number> <or number> or <actually number>
  • “Someone do the math” eventually followed by <number> text from somebody else.*
  • Two drinks in one log <number number>
  • Accident: jumping 2 numbers.
  • Jumping / counting for others.
Chapter 3 - Facial Embeddings

I tried multiple solutions:

  • Heuristics → falls apart because a person can log for others.
  • Facial recognition, tricky for hand (front facing camera) or group photos
    • Determine if photo has faces or not.
      • If YES, open-set recognition system:
        • Detect face
        • Convert to embedding (ArcFace / FaceNet)
        • Compare against your 12 known people
        • Decide:
          • If similarity is high → assign person
          • If not → label as “unknown”

This kinda worked!

But I decided that it would only be possible to have this FULLY automated if we made two rules for submission in the groupchat:

  1. If logging a drink for someone else, their face MUST be in the photo.
  2. Don’t jump numbers. If the last number was n, log must include n-1 and the new last number. Be in string format <1000, 999, 998> or <1000-998> or <1000 + 999>
Chapter 4 - Agent Orchestration

At this point I realized what I could try next. “A person can log for others” is precisely why pure heuristics and pure face-recognition each fall apart on their own. Neither signal wins by default; an agent can weigh them. An orchestrator’s most valuable function is conflict detection. When a vision agent says “person A” but the sender + text say “person B,” that disagreement is exactly my ambiguous case. Instead of forcing a guess, it:

  • resolves automatically when confidence is high and signals agree,
  • routes to a human-in-the-loop queue (a quick “who was this? [A] [B] [unknown]” ping) when they conflict or fall below threshold.

A single model trying to do all of this at once is exactly what gets “confused”: it has to hold the whole chat history, the running count, the corrections, and the face match in one context window, and it loses track.

Multi-agent orchestration helps because it lets you give each ambiguity its own narrow agent with scoped context and one job, then have an orchestrator reconcile them against a single source of truth (a structured ledger, not the raw chat).

  1. Segmenter agent: turns the raw message stream into candidate “log events.” Its only job is boundaries: where does one drink-log end and the next begin? This is the “log text immediately after image” and “two drinks in one log <number number>” cases. It doesn’t care who or what number, just chunks the stream into events with attached media.

  2. Vision agent: given one event’s image, returns {has_faces, person, confidence}. This is the open-set pipeline: detect → embed (ArcFace) → compare to the 12 → assign or unknown. Crucially it returns a confidence, not a verdict.

  3. Attribution agent — answers who drank it by fusing three signals:

  • who sent the message,
  • the vision agent’s face match + confidence,
  • textual cues (“logging for Jacob,” “do 3 for me”).
Chapter 5 - The Frontend

Live-push FastAPI setup:

  • REST / JSON: the bulk of it: GET /leaderboard, /drinks, /total, /recent, /messages, /chats, and /attachment/{rowid} (returns a FileResponse image, not JSON).
  • SSE: the one streaming endpoint /leaderboard/stream, via sse-starlette’s EventSourceResponse. So: one DB poll feeds all clients, and the browser gets pushed updates instead of asking. I later swapped it for plain 5s polling because it was simpler and survived the flaky tunnel better.

Basically the watcher polled chat.db every 2s → drinks.db; the API’s SSE task polled drinks.db every 1s; the browser polled the API every 5s.

Chapter 6 - Agentic UI

I had made a beautiful Spotify Wrapped visualization for our 5,000-drink milestone and one-year post-grad reunion. I personally chose the best moments to highlight. But as the logging continues and time moves forward, I don’t want to curate 24/7.

Can I make an agent that curates my photo album? To be continued!