Home / Blog / Engineering / Lead scoring LATAM
EngineeringLATAM

Lead scoring in LATAM: from rules to ML for B2B teams in 2026

Three maturity levels (rules → logistic regression → XGBoost), metrics so you don't fool yourself, and a real Mexico case — 3.1% → 8.7% in 90 days.
What actually works, what burns money, and where ML has no business being.

Sergei Filatov
Sergei FilatovFounder · data-metrics.pro · May 26, 2026
◷ 13 min read

One-minute summary

If your SDR burns 70% of their day chasing leads that will never close, you don't have lead scoring. You have a color-coded spreadsheet and intuition that won't scale. You fix this in 4–12 weeks with discipline — not with an "AI feature".

79% of B2B leads never make it to a deal. The number has been recycled in MarketingSherpa and Forrester reports since 2012 and it hasn't moved in a decade. LATAM is worse: the regional B2B SaaS market grew from USD 1.2B in invested capital in 2020 to USD 4.8B+ in 2024 (regional VC trackers), but 80% of that growth is eaten by SDR teams that qualify leads by feel.

Lead scoring is not a "CRM feature" or an AI button. It is a contract between marketing and sales about which lead is worth a call today and which one stays in nurture until it ripens. Without a formal scoring layer, every SDR builds their own qualification model in their head — and those models are incompatible with each other, unverifiable, and collapse the moment someone rotates off the team.

This article is the pillar on lead scoring for B2B teams in LATAM. Inside: three maturity levels (from CRM rules to production ML), exact model-quality metrics, a real Mexico case where MQL→demo went from 3.1% to 8.7%, and an honest list of situations where lead scoring won't work — because half of the regional consultants sell "AI models" where there isn't even basic telemetry.

For the record: Sergei Filatov / Hacker Sergio, Forbes 30 Under 30 LATAM. We have shipped predictive scoring and CRM analytics for Aeroflot (loyalty +22%), Estée Lauder (12-brand remarketing, ROAS 1.5× → 4.2×) and large retail chains in Russia and Mexico. Now — for SMBs and mid-market in LATAM via applied Machine Learning.

TL;DR for executives

  • Lead scoring = ranking leads by close probability. Without it, the SDR works blind and wastes 60–70% of their time on non-target contacts.
  • Three maturity levels: (1) rules in Odoo/HubSpot/Salesforce — 1–2 weeks; (2) logistic regression — 4–6 weeks; (3) gradient boosting (XGBoost/LightGBM) with MLOps — 8–12 weeks.
  • Minimum data for ML: 500 leads/month + 12 months of history + ≥ 100 closed deals as the positive class.
  • Baseline quality metrics: AUC-ROC ≥ 0.75, precision@top-decile ≥ 3× baseline conversion, lift in the first decile ≥ 2.5×.
  • Typical LATAM ROI: MQL→demo from 3% to 8% in 90 days, +30–50% in deal velocity, payback 6–12 weeks.
  • Where it does NOT work: startups under 100 leads/month, sales cycles over 18 months, outbound with a different ICP per campaign.
i
Which level are you on today? If you never calibrated your scoring, assume you are at level 0 — spreadsheet and gut feel — even if your CRM tells you otherwise. Read the section on when it works and when it doesn't before spending a dollar on ML.

Context: why now, why LATAM

Lead scoring as a concept has been around since the 1980s. IBM popularized BANT (Budget, Authority, Need, Timeline) for qualifying enterprise customers. In 2008 HubSpot and Marketo democratized rule-based scoring for SMB: every lead action (email open, whitepaper download, pricing-page visit) added points; past the threshold, the lead went to an SDR.

Since 2018 the standard for B2B SaaS with ≥ 500 leads/month is predictive lead scoring via ML. Salesforce Einstein, HubSpot Predictive Scoring, Marketo Predictive Audiences, Microsoft Dynamics Sales Insights — they all share one idea: the model learns from your historical win/loss data and predicts which new lead statistically resembles the ones that closed before.

#1. Three structural shifts in LATAM 2024–2026

B2B SaaS boom. Tiendanube, Bitso, Kavak, Clip, Cobre, Nowports closed Series B–D rounds focused on operational efficiency. That means more CRO/RevOps roles in the region, arriving with US playbooks and demanding lead scoring as the minimum viable piece of a sales process.

WhatsApp as the primary channel. In Mexico, Colombia, Argentina, and Peru, 70–85% of B2B communication runs on WhatsApp Business. This breaks the classic email funnel: the lead does not leave an email or download a PDF — they message "I want info" in the chat 30 seconds after seeing the ad. Lead scoring has to incorporate WhatsApp response velocity, first-message length, and question specificity. None of that ships out of the box in HubSpot, Salesforce, or Marketo — you need custom feature engineering.

AI-assisted SDRs. ChatGPT, Claude, and Gemini are wired into the CRM stack via Zapier, Make, Bardeen, or direct API integrations. The SDR no longer writes first-touch by hand: they pick between 5 generated variants. Scoring stops being just a filter and becomes a dispatcher — it decides which prompt template the model should use for each specific lead.

#2. Which of the three situations are you in?

If you run a B2B team in Mexico City, Bogotá, Lima, Santiago, or Buenos Aires in 2026:

  • A. You have no lead scoring. SDRs work leads in arrival order (FIFO), MQL→demo conversion sits below 5%, and nobody knows why. Our estimate: ~60% of B2B companies in the region.
  • B. You have rule-based scoring, but it has not been calibrated in a year or more. The rules were written by the previous head of sales, the ICP shifted twice, and the weights were never revisited. Conversion degrades 10–15% per year from drift.
  • C. You have an ML model, but no MLOps. Someone from data science built the scoring a year ago, pushed it to production via a cron job, and the model is drifting — with no monitoring.

What to do in each case — below.

Three technical maturity levels

#1. Level 1 — Rule-based scoring in the CRM (1–2 weeks)

The baseline every company starts from. Inside Odoo CRM, HubSpot, Salesforce, or Pipedrive you define rules like "+10 if role is CEO/CTO", "+15 if company has 50–500 employees", "−20 if email is gmail.com", "+25 if visited /pricing in the last 7 days".

Minimum feature set for LATAM B2B:

Demographic:

  • Role (CEO/CTO/Director — high weight; Analyst/Coordinator — low weight)
  • Company size (50–500 employees — sweet spot for mid-market SaaS)
  • Industry (e-commerce, fintech, retail chain — priority; agencies, freelancers — lower)

Firmographic:

  • Geolocation (Mexico City, São Paulo, Bogotá, Lima, Santiago — higher; tier-2 cities — lower)
  • Tech stack (Shopify, VTEX, Stripe, AWS — high weight)
  • Company stage (Series A+ vs bootstrap)

Behavioral:

  • Opened an email in the last 7 days (+5)
  • Visited /pricing (+15), /demo (+25)
  • Downloaded a case study (+10)
  • Replied to the nurture sequence (+20)

Negative signals:

  • Personal email (gmail/hotmail/yahoo): −10
  • Job-seeker / student based on profile: −30
  • Competitor: −1000 (effectively exclude)

In Odoo this is configured via the crm_score module (community) or predictive_lead_scoring (Enterprise). In HubSpot, the Score builder in Marketing Hub Professional/Enterprise. In Salesforce, Lead Scoring + Einstein Lead Scoring. Thresholds: MQL = ≥ 50 points, SQL = ≥ 80, hot = ≥ 100. Tune them to the distribution of your historical data.

#2. Level 2 — Logistic regression (4–6 weeks)

When the rules plateau (usually around 6–12 months), you move to ML. Logistic regression is the right first step: interpretable (every feature has a coefficient with a clear sign), trains fast, does not overfit on small datasets.

Stack:

  • Python 3.11+, scikit-learn, pandas, numpy
  • Storage: PostgreSQL (Odoo's DB) or BigQuery/ClickHouse for analytics
  • Feature engineering: dbt or Airflow
  • Serving: a FastAPI endpoint with a webhook into the CRM

Pipeline:

  1. Extract — all leads from the last 12–24 months with outcome (closed_won / closed_lost / no_response).
  2. Labels — binary: 1 = closed_won within 90 days of creation, 0 = everything else.
  3. Features — 15–30 demographic / firmographic / behavioral attributes.
  4. Train/test split — 80/20, stratified by label.
  5. Training — LogisticRegression(class_weight='balanced', penalty='l1') for imbalanced data + automatic feature selection.
  6. Evaluation — AUC-ROC, precision-recall curve, calibration plot, confusion matrix.
  7. Deploy — model in joblib + FastAPI service + cron retraining every 30 days + MLflow registry.

Target metrics:

  • AUC-ROC ≥ 0.75 (random baseline = 0.5)
  • Precision@top-10% ≥ 3× baseline conversion rate
  • Lift in the first decile ≥ 2.5×

If you land on AUC < 0.65 — the data is noisy or the features do not separate classes. Do not deploy: dig into features and data quality.

#3. Level 3 — Gradient boosting (XGBoost/LightGBM, 8–12 weeks)

When you have ≥ 5,000 leads/month and complex non-linear interactions between features (for example, company_size × industry × source — pure non-linearity), you move to boosting.

Advantages:

  • Captures non-linear interactions and pairwise effects better.
  • Handles missing values natively.
  • SHAP values for interpretability — critical for the sales team: they need to understand why the score is what it is.

Disadvantages:

  • Needs more positive examples (5,000+).
  • Harder in production (versioning, monitoring, A/B testing).
  • Can overfit fast on small segments.

Additions to the stack:

  • MLflow for experiment tracking and model registry.
  • DVC or LakeFS for data versioning.
  • Evidently AI or Arize for drift monitoring.
  • Grafana dashboard for business metrics (conversion by score bucket).

Production metrics:

  • AUC-ROC ≥ 0.82
  • Precision@top-decile ≥ 4× baseline
  • Drift detection: PSI < 0.1 per feature, monthly.
  • Recalibration cadence: ≤ 30 days.
LevelTimeMin leads/monthExpected AUCTypical cost USD
1 · CRM rules1–2 wks50+3,000–8,000
2 · Logistic regression4–6 wks500+0.72–0.8015,000–35,000
3 · XGBoost + MLOps8–12 wks5,000+0.82–0.8840,000–120,000

#4. Odoo integration (the LATAM SMB pattern)

Typical architecture: a Python service outside Odoo reads leads via the XML-RPC API every hour, scores them, and writes back into expected_revenue or a custom ml_scorescore_explanation field (text with the SHAP explanation). That gives the SDR a sortable column in the Kanban view and automatic assignment by threshold via Odoo automations (base_automation). For the full implementation pattern without going off-rails, see the Machine Learning for LATAM SMBs pillar.

When it works — and when it doesn't

Half of the LATAM startups buy "AI lead scoring solutions" and burn three months because the basic conditions are not met. This is the most important section of the article.

#1. It works (green light)

You have ≥ 500 leads/month and ≥ 12 months of history. ML needs statistics. 100 leads/month × 12 months1,200 leads, of which ~50–80 closed_won. Not enough even for logistic regression. Rule-based: fine with 50 leads/month; ML: from 500/month.

Clear ICP and a single sales process. One ideal customer type (for example, Shopify e-commerce stores with USD 1–10M GMV in Mexico/Colombia) — the model learns one pattern. Three ICPs with different cycles — you need three models, not one.

Marketing and sales aligned on what "qualified" means. The hardest part. If the head of marketing thinks MQL = "downloaded a PDF" and the VP Sales thinks MQL = "company with ≥ USD 5M ARR" — the model learns noise. Lead scoring goes in after the marketing↔sales SLA, not before.

Clean CRM data. ≥ 80% of leads have the required fields filled in (company, role, source, industry). If 40% of industry is "N/A" — the model does not learn, and gradient boosting will invent fake patterns inside the missing pattern.

#2. It does NOT work (red light — do not spend the money)

!
Before you sign with an "AI lead scoring" vendor: count your leads from the last month in the CRM. If you are under 100, the project will fail — no matter how good the data scientist is. All you will buy is technical debt and two lost months.

Startup with < 100 leads/month. Do not launch ML. Stick with rule-based, focus on pipeline velocity and manual conversion tracking. Revisit at 12–18 months.

Sales cycle > 18 months. If you sell enterprise deals at USD 250k+ on 2–3 year cycles, labels (closed_won) arrive too late. By the time the model is trained — the market has moved. Use MEDDIC/MEDDPICC manually.

Outbound-only with a different ICP per campaign. You switch target verticals every week — the model will not generalize. Run separate playbooks, not a shared scoring layer.

One product / one customer. Enterprise deals (1–2 per year) — statistically pointless. Score manually with custom research per account.

WhatsApp-only with no CRM logging. Classic LATAM problem: SDRs write from personal WhatsApp and nothing is logged. Lead scoring without telemetry is useless. Fix tracking first (WhatsApp Business API via Twilio / 360dialog / Gupshup + integration with Odoo/HubSpot), then score.

#3. Gray zone

  • SMB with 200–500 leads/month. Rule-based + simple logistic regression works. ROI on XGBoost does not justify the cost.
  • Marketplace / two-sided. Scoring works for one side (sellers or buyers), but you need a different model per side.
  • Freemium SaaS / PLG. The "lead" is a signup, not a contact form. What you score is usage behavior, not demographics. That is product-led growth scoring — a different methodology.

Five mistakes that kill lead scoring

#1. Not using time-decay on behavioral features

A lead who opened an email yesterday ≠ a lead who opened one three months ago. Most rule-based models count them the same. Six months in, you have "zombie leads" with 150 points who cooled off in Q1.

Fix: assign each behavioral event a half-life. Email open = 14 days. Pricing visit = 30 days. Demo request = 90 days. In Odoo this runs as a scheduled action that recalculates the score daily with exponential decay.

#2. Too many features → overfitting

"More features is better" is a myth. With 200 closed_won and 80 features the model finds "patterns" in the noise. Train AUC = 0.95, on new leads — random.

Fix: rule of thumb — no more than N/10 features, where N = positive examples. 200 closed_won = max 20 features. L1 regularization (Lasso) for automatic feature selection. Cross-validation is mandatory before deploy, not just train/test split.

#3. Ignoring negative signals

Most companies score only the "value-adding" events (opened, downloaded, visited). That produces score inflation: every "active" lead has 80+ points, but 90% is garbage.

Fix: subtract explicitly for red flags. Personal email: −10. Job-seeker role: −30. Competitor: −1000 (exclude). IP outside ICP geo: −20. Bot pattern (10+ pages in 30 seconds): −50.

#4. Not recalibrating the model regularly

A model trained in January drifts 15–25% by December. The market shifts, the ICP evolves, new channels appear. Without retraining — the score decays.

Fix: cron on the 1st of each month — retrain on fresh data from the last 12 months, compare performance, deploy only if AUC did not get worse. Every version in the MLflow registry for rollback.

#5. Handing scores to SDRs without context

The SDR gets "Lead X = 87". What does that mean? Why 87 and not 65? Without explanation, they either ignore the scoring (when their intuition disagrees) or follow it blindly (worse).

Fix: every score ships with an explanation via SHAP values: "87 points: +25 CTO role, +20 company 100–500, +15 visited /pricing 3× this week, +27 demographic match". The SDR sees the reasoning, validates it, and calibrates their intuition. In Odoo — a custom score_explanation field (HTML text); the ML service writes a rendered bullet list there.

#6. Bonus LATAM mistake: ignoring WhatsApp engagement

If 70% of the sales process runs through WhatsApp but the scoring is built on email opens — you are blind to the primary signal. Integrate the WhatsApp Business API (Twilio, 360dialog, Gupshup, Wati) with the CRM, log message velocity, response time, message length, and the questions/statements ratio as features. This usually adds +0.05–0.08 to AUC in most LATAM B2B cases.

Case: B2B SaaS in Mexico — from 3.1% to 8.7% MQL→demo in 90 days

Context (anonymized under NDA). Mid-market B2B SaaS in Mexico, target: retail chains with 50+ stores. 5,200 inbound leads per month, 7 SDRs on the team, average deal size USD 24k ARR, 90-day sales cycle.

Problem. MQL→demo conversion had been stuck at 3.1% for a year and a half. The SDRs complained: 70% of assigned leads never responded or said "no estoy interesado". Marketing thought their work was fine (cost-per-MQL USD 12); sales said the leads were garbage. The classic marketing↔sales conflict with no numerical basis.

What we did in 90 days:

Days 1–14 — diagnosis. We pulled 47,000 leads from the last 12 months out of HubSpot and Salesforce, joined them on the email key. We analyzed the distribution of closed_won by feature. Finding: 80% of real deals came from companies with 50–250 retail stores, running Shopify or VTEX, with role "Director de Operaciones" or "CTO". The existing scoring did not surface them — weights were inflated toward "email opens" (an easily-gamed metric).

Days 15–45 — rule-based v2. We rewrote the rules in HubSpot:

  • High weight: ≥ 50 retail stores, retail/e-commerce, Director+ role, uses Shopify/VTEX.
  • Negative: agencies, freelancers, gmail email, IP outside Mexico.
  • Behavioral: /pricing visit + /demo-request + ≥ 2 emails opened = +50 bonus.

By day 30: MQL→demo from 3.1% to 5.4%. The project had already paid for itself.

Days 46–90 — ML model (logistic regression). We built a predictive model on 18 features (demographic + firmographic + behavioral + WhatsApp velocity). Test AUC = 0.79, precision@top-decile = 4.2×. We replaced the rule-based score with the model, keeping rule-based as a fallback for cold-start (leads with under 7 days of history).

Results on day 91:

MetricBeforeAfterDelta
MQL → demo3.1%8.7%+180%
SDR utilization on quality leads32%58%+26 pts
Demo → closed_won18%23%+5 pts
ARR per SDR / quarterbaseline+47%+47%
Project cost (all-in)USD 38k
Payback6 weeks
"The SDRs stopped fighting with marketing. Nobody argues about the morning lead list anymore — the list is the list, sorted by close probability."

What did NOT work. An attempt to add WhatsApp message-content features with a mini NLP classifier overfit on the small dataset (6 months of WhatsApp logging). We rolled it back and deferred until we had 12+ months of data. Plan: Q2 2026.

Closing: engineering, not magic

Lead scoring is not an "AI feature" you buy and forget. It is a process: marketing↔sales SLA, clean CRM telemetry, an iterative model, drift monitoring. Drop any one of the four pillars — money down the drain.

In LATAM the gap is especially wide: ~60% of B2B companies still run a 2019-era Excel "scoring", ~25% bought HubSpot Predictive and never configured it, ~10% built ML but do not monitor it. Only ~5% do it right. If you are in the first three groups — you have 3–6 months of headstart on competitors who are also just waking up. Take it.

Want to diagnose your own funnel? Book a 30-minute audit — we walk through your Odoo/HubSpot/Salesforce together and tell you straight which maturity level you are on and whether ML is worth it today. No sales pitch.

Further reading

FAQ

How much does lead scoring cost in LATAM?

It depends on the level. Rule-based in Odoo/HubSpot: USD 3,000–8,000 (2–4 weeks; audit + setup + team training). Logistic regression: USD 15,000–35,000 (6–10 weeks + integration). Production XGBoost with MLOps: USD 40,000–120,000 (10–16 weeks3 months of support).

SMBs with 200–500 leads/month usually start at level 1; mid-market with 1,000+ leads — straight to level 2.

What ROI should I actually expect?

Realistic range: MQL→demo rises 50–180% in the first 90 days after level 2. SDR utilization +20–35%. Payback in LATAM B2B cases: 6–12 weeks.

No guarantees. If the ICP is fuzzy or data is thin, ROI drops. That is why the "when it does NOT work" filter is the most important part of the project.

Lead scoring vs lead nurturing — what is the difference?

Scoring is ranking your existing leads. Nurturing is the email/WhatsApp cadence for the ones not ready yet. They work together: scoring decides who goes to demo now, who goes to nurture, and who gets dropped.

Without scoring, nurturing spams everyone equally and burns the list.

Can I use ChatGPT or Claude for lead scoring?

LLM-based scoring is a niche method. It is good for qualitative data (analyzing the first WhatsApp message, assessing fit from a company description). It is bad at the classic ranking task: expensive, slow, not interpretable, unstable.

Use LLMs as a feature extractor for a classical ML model, not as a replacement for the model.

What do I do with leads that got a low score?

Do not delete them. Three options: (1) automatic 90-day nurture flow; (2) re-scoring every 30 days (behavior changes); (3) a final sweep with a personalized offer before archiving on day 90.

5–8% of the "cold" leads revive in 60–180 days, especially in LATAM where the decision cycle tends to be longer than in the US.

How does this integrate with Odoo in practice?

Three patterns: (1) Odoo Enterprise's native predictive_lead_scoring module — simple, closed box; (2) external Python service via XML-RPC writing into a custom field — flexible, requires dev; (3) sklearn model inside Postgres via PL/Python — slow, zero infra.

For LATAM SMBs we recommend option 2 — the sweet spot between flexibility and maintenance cost.

Where do I get training data if the CRM is empty?

If you do not have 12 months of history — do not build ML. Spend the first 3–6 months on rule-based plus disciplined logging of every touchpoint. In parallel, buy intent data (ZoomInfo, Apollo, Clearbit, Cognism) for enrichment to bootstrap robust demographic features.

ML comes at month 12. Skipping this step is the #1 cause of failed projects.

What stack do the data-metrics.pro teams actually use?

Typical stack: Python 3.11 + scikit-learn + XGBoost for models; PostgreSQL as the source of truth; FastAPI for serving; MLflow for tracking; Evidently AI for drift; Grafana for business dashboards.

For the CRM side: Odoo Enterprise or HubSpot Pro, depending on what the client already has. We do not impose a stack — we adapt to the existing one.