The one-minute summary
A guest seated at a table still covered in pizza crumbs and a half-empty glass from the previous party will not write "the dining room needs cleaning" in a review. They will not come back, and they will leave two stars on Google Maps. Computer vision now spots the exact second a table empties out and fires an alert to the floor staff. LATAM QSR chains are rolling it out, and Dodo Pizza is one of the most dissected cases at engineering conferences.
- A dirty table costs the chain 3 to 7 USD per peak hour. Across 30 tables and 5 peak hours, that is 450 to 1,000 USD in lost revenue per day.
- Production stack for computer vision restaurant tables: YOLOv8/YOLO11 + an RTSP camera + an edge device (NVIDIA Jetson Orin Nano). 92-96% f1-score accuracy.
- Dodo Pizza started with CV in the pickup counter — a finished pizza waiting more than 30 seconds triggers an alert. They then expanded the model to detect table state in the dining room.
- LATAM detail: cameras in the dining room are regulated by LGPD (Brazil), LFPDPPP (Mexico), Ley 29733 (Peru), and Ley 1581 (Colombia). Visible notice and privacy-by-design are required.
- MVP budget per restaurant: 1,500-3,000 USD in hardware plus 60-120 hours of an ML engineer. Payback closes within 3-6 months.
Why catching dirty tables actually matters
In a sit-down restaurant, table turnover is direct revenue. Every minute between "guests left" and "the table is ready for the next party" is a lost seat. In QSR chains with dine-in — Dodo Pizza, Burger King, KFC, Starbucks — the average table turnover runs 20-35 minutes. If the floor staff spots a dirty table 4-6 minutes late, turnover drops by 10-15%.
The second pain is reputational. A guest sits at a table dotted with the previous diner's crumbs, snaps an Instagram story, drops a "filthy" review on Google Maps. In LATAM, where Google Maps drives foot traffic in tourist districts like Lima, Cartagena, and CDMX, one review of that kind costs more than the floor cleaner's monthly salary.
The third is managerial: nobody sets a KPI on what they cannot measure. Before CV, "table cleanup time" was a myth. The manager would report "everything clean," and verifying it meant walking the floor in person.
Computer vision for retail and restaurants closes all three. The camera watches the floor 24/7, an object-detection model scores every table on every frame: "occupied," "guests left," "dirty," "clean, ready." The metric "average time from departure to ready" becomes a real number — one that bonuses can be paid against.
How it works under the hood
The stack 80% of LATAM QSR projects pick today rests on six components. Drop one — or swap it for something "simpler" — and accuracy slips below 85%, with the floor staff losing trust in alerts within two weeks.
#1. Video source
Existing dining-room IP cameras (RTSP stream, 1080p, 15-25 fps). No need to buy new cameras: the security cameras already in place work. The condition is that they be digital — analog CCTV cannot feed the pipeline.
#2. Edge device in the restaurant
An NVIDIA Jetson Orin Nano at 249-499 USD, or a mini PC with a GPU. The model runs in real time on-site and frames never leave the premises — critical for privacy and for avoiding 200-400 USD per month of RTSP traffic to the cloud.
#3. Detection model
An object-detection model from the YOLO family, usually YOLOv8m or YOLO11s, trained on a custom "dirty table / clean table" dataset. The COCO base model does not know what "table with pizza crumbs" looks like — it has to be fine-tuned on 800-2,000 labeled images from the specific restaurant. The official Ultralytics docs cover the fine-tuning flow step by step.
#4. State logic
A state machine sits on top of the detector. A single frame decides nothing: "dirty table on frame 1" might be a shadow or a server's hand. The alert fires only when 5 consecutive frames over 10 seconds report the same state. Basic frame operations run on OpenCV.
#5. Event queue and integration
Apache Kafka or an MQTT broker pushes events to the operations system. The notification reaches the server on Slack or Telegram — in LATAM, the standard is the WhatsApp Business API. If the restaurant already runs Odoo POS, the Dodo Pizza Real-Time Order Chain case covers the integration.
#6. Dashboard
The manager sees a floor map color-coded by table, the average cleanup time per shift, and the three slowest servers. Without this final piece, the system is just a detector nobody audits.
Production accuracy lands between 92 and 96% f1-score. That means 4-8% of alerts are false positives or false negatives (a dirty table goes undetected for 30 seconds). With the state machine tuned correctly, it is a tolerable threshold.
The Dodo Pizza case: where they started, where they landed
Dodo Pizza is a global chain with more than 1,000 stores, built on its own Dodo IS platform and known for radical operational transparency (real-time sales dashboards). It operates in Mexico, Brazil, Nigeria, the UK, and other markets.
Dodo's first CV use case was not tables, but the pickup counter: a camera over the counter detected when a finished pizza had been sitting more than 30 seconds and pinged the courier or the floor staff. That cut "ready → in the guest's hand" time by a double-digit percentage.
The team then extended the approach. According to public engineering posts on the Dodo blog at Habr and conference talks, CV at Dodo now covers three fronts: finished-product QC (pizza shape, box temperature), pickup-counter control (wait time), and dining-room and cleanup control (litter detection and table state).
In LATAM stores, dirty-table detection is sharper than in Russian ones. Mexico and Brazil push dine-in; Russian pizzerias live on delivery. So the Mexico-Brazil perimeter became the natural pilot zone for table-status CV.
Dodo does not buy a packaged CV product from a vendor. They assemble the stack from open-source components — YOLO, OpenCV, Kafka — and keep the model on the edge device. That buys them two things: control over the data (dining-room frames never leave the store) and the ability to retrain for local quirks (a Mexican diner leaves different litter than a Russian one).
When it works, when it breaks
The pilot pays back when the concept is dine-in with 15+ tables and a real weekend peak. At 5 tables, the ROI does not close. Three conditions usually go unmentioned in sales pitches and break the project if they're missing.
Works cleanly when:
- The concept is dine-in with 15+ tables and a real lunch or weekend dinner peak.
- The restaurant already has IP cameras — or is willing to put 500-1,500 USD into the initial install per store.
- There's POS integration to cross-check "table free per POS" against "table free per camera." That closes the cold start.
Works with tuning when:
- Lighting shifts a lot through the day (large windows, direct sun). The dataset must cover multiple hours or f1 drops on the evening shift.
- The floor layout changes from time to time. Each table is its own region of interest and re-labeling is needed.
- There are glass tables or highly reflective surfaces. The model gets confused by reflections — additional data is required.
Does not work — or gets expensive — when:
- It's a fast-food format with disposable trays and the guest clears their own. The problem stops being "dirty table" and becomes "tray not returned" — different model, different business logic.
- The dining room is small (5-8 tables). Paying staff to walk the floor every 7 minutes is cheaper.
- Management does not act on alerts. If "table 7 dirty for 40 seconds" gets no response, the system degrades to ignored-by-default within two weeks.
Common rollout mistakes
#1. Starting with a pretrained model and skipping fine-tuning
Base YOLO knows the 80 COCO classes — "couch," "person," "fork" — but not "table after pizza." With the pretrained model and no fine-tuning, you'll sit at 50-60% production accuracy, which infuriates the floor staff and kills trust in the system. The floor is 800 labeled images from the specific restaurant.
#2. Streaming video to the cloud
RTSP to the cloud means 200-400 USD/month in traffic per restaurant plus a privacy footprint (LGPD/LFPDPPP fines). Inference belongs on the edge; only events like {"table":7,"status":"dirty"} should leave the store.
#3. Alerts without an operational change
The system gets installed but "table cleanup time" never enters the server's KPIs. Two weeks later alerts are ignored. CV without operational reform is money on fire.
#4. Ignoring LATAM privacy regulation
Brazil's LGPD requires a visible camera notice (see ANPD). Mexico's LFPDPPP is supervised by INAI. Skipping that means a fine plus a PR crisis. Privacy by design: faces are blurred on the edge, frames are not persisted.
#5. One dataset for the entire chain
A model trained in CDMX will not work in Cartagena (different tables, different lighting, different food). Fine-tune per cluster of restaurants.
Anonymous case: a 22-table pizzeria in CDMX
Independent pizzeria — not Dodo — with 22 tables in a tourist pocket of Roma Norte. Before the pilot: average turnover of 31 minutes per table, manual measurement showing 5.8 minutes between "guests left" and "table ready." Estimated cost: roughly 2,600 USD per week in covers lost across the Friday-Saturday peak.
Eight-week pilot: 2 existing IP cameras recabled, 1 Jetson Orin Nano, a YOLOv8m model fine-tuned on 1,400 images labeled in CVAT. Install cost: 4,800 USD covering hardware, WhatsApp Business integration, and hours from a freelance ML engineer based in Lima. Monthly operation: 380 USD.
After the pilot (week 8): average cleanup time dropped to 1.9 minutes, turnover tightened to 27 minutes. The added peak covers translated to roughly 1,350 USD per week of extra revenue. Effective payback at week 16. The manager backed the pilot by adding "cleanup time" to the shift close.
For the first two weeks the system was loud and the servers complained. Only after we fine-tuned against false positives from the window light did it settle.
LATAM context: price, law, and pilot reality
MVP budget per restaurant (Lima / Bogotá / CDMX market, May 2026):
| Component | USD range |
|---|---|
| Edge device (Jetson Orin Nano + case) | 300-500 |
| IP cameras (2-3 if not already installed) | 200-450 |
| ML build (model + integration), LATAM freelance | 2,500-5,000 |
| ML build, partner studio | 8,000-15,000 |
| Support and monthly retraining | 300-500 |
Total for a 3-5 store pilot: 15,000-30,000 USD upfront plus 1,500-2,500 USD/month in operations. For budget context on analogous projects, see machine learning for retail and QSR.
Legal minimum per country:
- BR: LGPD (Law 13.709/2018). Visible camera notice and, for mid-size chains, an appointed Encarregado de Dados (DPO).
- MX: LFPDPPP. Privacy notice at the entrance and on the receipt. Regulator: INAI.
- PE: Ley 29733. Registration of the personal-data record with the ANPD Peru.
- CO: Ley 1581 of 2012. Habeas Data, registered with SIC.
- AR: Ley 25.326. Registration with AAIP and visible signage.
The rule shared across jurisdictions: no frame persistence on the edge, real-time processing, only aggregated events leave the store ("table 7, dirty, 40 sec").
Payback. If a restaurant runs 20,000 to 50,000 USD per month in revenue and CV delivers 5 to 8% in turnover improvement (typical pilot result), that's 1,000-4,000 USD per month in additional revenue. Pilot ROI: 3-6 months. At chain scale (50+ stores), unit cost drops 2-3x thanks to a shared dataset and centralized DevOps.
What to do next
Table CV is the first step. The next obvious use cases — queue management at the till, SOP compliance, product-defect detection, drive-thru queue monitoring — each deserve their own article. For a general definition and historical context of the technology, the Wikipedia overview on computer vision is the right starting point.
If your chain has 5 or more restaurants in LATAM and you are weighing a pilot, start with one store, one use case ("table free → server alert"), 4 weeks of data collection and 4 weeks of piloting. If 8 weeks fail to add 3% to turnover, the design is wrong.
To go deeper on the stack or talk through the pilot with a team that has already shipped CV in LATAM QSR, see the Computer Vision service page or book a 30-minute diagnostic.
Frequently asked questions
How many cameras do I need per dining room?
One camera covers 6-10 tables at a 100-120° viewing angle. A 25-table dining room needs 3-4 cameras. Every table must be visible without obstruction.
Can I reuse the existing security cameras?
Yes, if they are IP (RTSP), 1080p, and at a reasonable angle. Analog CCTV does not work — the pipeline requires a digital feed.
How long does dataset labeling take?
For a production model: 800-2,000 images × 2-4 minutes of labeling = 25-130 hours. It is more efficient to hire external labelers on Roboflow or CVAT than to do it in-house.
What about guest privacy?
Faces are detected and blurred ON-DEVICE before the frame enters the table detector. Frames are not stored; only events leave the premises. This covers LGPD, LFPDPPP, and Ley 29733.
What accuracy is realistic?
92-96% f1-score on the test set with quality labeling and 1,500+ images. Below 90%, the system annoys the floor staff with false positives and loses trust.
Does it work in a low-light dining room?
Modern IP cameras with IR illumination deliver acceptable quality in dim conditions. If the venue is intentionally dark (lounge, bar), expect problems. Test against the actual on-site conditions.
Is a turnkey solution worth buying?
In the US, Presto Vision and SMG are options. In LATAM there are a handful of local integrators — mostly in Brazil and Mexico. As of May 2026, a packaged solution runs 400-800 USD/month/restaurant against 200-300 USD/month for an in-house build. The packaged route fits chains of up to 10 stores.
What comes after the table pilot?
The natural next use cases are till-queue management, SOP compliance (hands washed, gloves on), product-defect detection, and drive-thru queue monitoring. Each is its own project with its own dataset and stack.
