Web4Guru AI Operations
Blog ·How-to · ·9 min read

How to scrape leads legally in 2026

The lines between legal, gray, and actionable. A playbook for building lead lists in 2026 without getting sued or banned.

Not legal advice. Consult counsel for your specific situation.

TL;DR

License from Apollo/Clearbit first. If you must scrape, stick to public pages with clean TOS, respect robots.txt, rate-limit, collect only business contact info, log provenance, and honor opt-outs within 30 days.

What you'll learn

  • The legal tests for "public data" in the US, EU, and Canada
  • Which data sources are clean, gray, and forbidden
  • A compliance-clean scraping stack with rate limits and provenance
  • How to handle GDPR/CCPA opt-out and deletion requests

What you need

  • A data broker subscription (Apollo, Clearbit, or Lusha)
  • A Notion or Postgres table for provenance logging
  • A suppression list process
  • A privacy policy on your site (skip at your peril)

Step 1: Understand what "public" actually means

US law (hiQ v. LinkedIn, 2022) protects scraping publicly accessible data. "Public" means no login required, no TOS click-wrap that explicitly forbids automated access, and no personal/sensitive data. Stay on the right side of those three tests. When in doubt: if it requires a login or accepting a TOS click-wrap, treat it as private.

Step 2: Pick sources with clean TOS

Allowed with care: Google Maps, Yelp, Crunchbase public pages, G2 reviews, company About pages, published directories. Gray area: LinkedIn (TOS forbids scraping; risk of account ban, not jail). Avoid: anything behind a login, Facebook Groups, Instagram DMs. Read the TOS of any source before scraping. A "we reserve all rights" sentence isn't a prohibition; an explicit "no automated access" is.

Step 3: Use an enrichment API instead of raw scraping

Apollo.io, Clearbit, and Lusha have negotiated data licenses. Apollo Basic $49/mo. Legally cleaner than DIY scraping. Use these as your primary source; only scrape what they don't cover. 95% of the time, you don't need to scrape at all.

Step 4: When you do scrape, respect robots.txt

Before writing any crawler, check site.com/robots.txt. If Disallow: / is set for your user agent, pick a different source. Ignoring it is evidence of bad faith in a dispute. Courts weigh robots.txt as a signal of site owner intent.

Step 5: Rate-limit requests

Max 1 request per 2 seconds, under 500 requests/hour to any one domain. Use rotating residential proxies (Bright Data from $15/GB, Smartproxy from $8.5/GB) to avoid IP bans — not to evade detection. Load-testing someone else's site with 1000 req/sec is a CFAA issue, not a TOS issue.

Step 6: Only collect business contact info

Business email (firstname@company.com), phone (published office line), title, company name. Do NOT scrape personal emails (gmail, yahoo), personal phones, or home addresses. The legal exposure is 100x higher on personal data. Personal data = regulator attention.

Step 7: Store the source + timestamp

For every lead row, log: source URL, scrape date, field-by-field provenance. GDPR and CCPA both require this. A Notion database or a Postgres table with columns (field, value, source, collected_at) is sufficient. Provenance is the single most useful thing you can log. Makes regulator responses a 10-minute job.

Step 8: Honor opt-out requests in 30 days

Anyone replies "remove me" — you have 30 days under CCPA and CAN-SPAM, 30 days under GDPR (shorter is better). Maintain a suppression list and check it before every send. Keep a hashed email suppression list so the unsubscribed person never touches your send system again.

Concrete example: a compliant 5K-lead list

An agency we work with built a 5,000-contact list of ecom founders in 10 days: 3,400 from Apollo ($49), 1,200 from Crunchbase public pages (rate-limited at 1 req/2s via Apify at $0.10/1K), 400 from conference attendee pages (public). Every row tagged with source + date. Suppression list checked at send time. Zero regulator inquiries in 18 months; reply rate 6.8%.

Common pitfalls + how to avoid them

  • Scraping logged-in pages. LinkedIn full profiles, Facebook groups — huge risk. Stick to public.
  • Collecting personal emails. gmail/yahoo = personal data. Regulator exposure.
  • No provenance. "I don't remember where this came from" is the worst answer to a regulator.
  • Ignoring opt-outs. Every ignored unsubscribe is a demonstrable violation. Build the suppression list first.

Key takeaways

  • License-first, scrape-second. Apollo et al. save you legal work.
  • Public, no-login, TOS-clean, robots.txt-compliant. Four gates.
  • Business contact info only. No personal data.
  • Log provenance. Row-level. Field-level if you can.
  • Honor opt-outs fast. Suppression list is non-negotiable.

FAQ

Can I scrape LinkedIn?

Legally in the US, per hiQ: scraping public profile data without logging in is not a CFAA violation. But LinkedIn's TOS forbids it and they will ban accounts and IPs. Practical answer: use an Apollo-style data broker that already paid for the data license.

Is scraping Google Maps legal?

Scraping public search results: gray. Using the official Google Places API ($17 per 1K requests): clearly legal. The API is cheap enough that DIY Maps scraping isn't worth the risk.

Does GDPR apply to B2B contacts?

Yes. EU personal data includes work emails (firstname@company.com is personal data in GDPR's definition). You need a lawful basis — "legitimate interest" is defensible for B2B, but you must document the assessment and honor opt-outs.

What's the safest approach?

License from Apollo/Clearbit/Lusha. They've done the legal work. You get data + provenance + a defensible story if a regulator asks.

Further reading

Black Box does this automatically

The Research specialist licenses from Apollo, respects TOS, logs provenance, and checks suppression lists before every send. Compliance baked in.

Web4Guru — Web4Guru is the team behind Black Box. We build AI companies for solo operators and small teams. Published April 23, 2026.