Home · Guides · April 26, 2026 · 12 min read

AI agency evaluation checklist

Twenty-five questions to ask before you sign an AI agency contract. Each one with the reason it matters and what a strong answer looks like.

By the Web4Guru AI Operations Team · Last updated April 26, 2026

The fastest way to lose money on an AI engagement is to skip diligence. Most buyers ask 4-6 questions on a sales call, sign the SOW, and discover the issues in week 8. This checklist exists so you do not become one of them. Send it to every agency on your shortlist. Score each answer 1-5. Compare. The patterns will tell you everything.

We use this checklist on ourselves with new prospective clients, and we welcome you using it on us. If we cannot answer something, we will say so. That is the signal.

Section 1: Team and capability (questions 1-6)

1. Who specifically will work on our account, by name?

Why it matters: The most common bait-and-switch is a founder pitch followed by a junior delivery team. You are buying the people, not the logo.

Strong answer: Names, titles, percentage of their time on your account, and examples of work each has shipped.

2. What is the senior-to-junior ratio on this account?

Why it matters: AI work fails on architectural mistakes more than on execution. You want senior judgment on the design phase.

Strong answer: At least one senior architect actively engaged through build, not just on the sales call.

3. How long has the agency been doing AI work specifically?

Why it matters: Many agencies pivoted from traditional dev work in 2024-25. That is not disqualifying, but it changes the depth available.

Strong answer: A specific date, a specific team, and a coherent story about what was built before and what changed.

4. How many engagements have you completed in our vertical?

Why it matters: Vertical-specific knowledge compresses the discovery phase. A first-of-vertical engagement is risky for both sides.

Strong answer: A specific number, with willingness to introduce you to one or two reference clients.

5. What models, frameworks, and tools do you build on?

Why it matters: Tells you the agency's center of gravity. A shop that only knows GPT-3.5 wrappers is different from one that builds production agentic systems.

Strong answer: Specific stack with rationale for each choice and a clear opinion on what they avoid and why.

6. Show us a non-public artifact of your work.

Why it matters: Marketing case studies prove copywriting, not engineering. A live demo, a sanitized prompt library, or a shared dashboard reveals more.

Strong answer: They have something to show beyond the website.

Section 2: Process and methodology (questions 7-12)

7. Walk us through your end-to-end process.

Why it matters: Mature agencies have a documented process. Improvised agencies make it up per engagement.

Strong answer: Named phases, deliverables per phase, gating decisions between phases. See our own methodology page for an example.

8. Walk us through a comparable engagement, including what failed.

Why it matters: The single most diagnostic question on this list. Real agencies have failures and have learned from them. Agencies that have never failed have either never shipped or never told the truth about it.

Strong answer: A specific story, ideally recent, with a clear lesson and a concrete change in process or stack.

9. How do you handle scope creep?

Why it matters: The single largest source of cost overruns and relationship damage. The agency needs a clear answer.

Strong answer: Written change-order process, transparent rates, willingness to push back on bad scope additions.

10. What is your incident-response process?

Why it matters: AI systems break differently than traditional software. You need to know who picks up the phone at 3 a.m.

Strong answer: Defined SLA, escalation paths by severity, postmortem cadence.

11. How do you measure the system after launch?

Why it matters: Without measurement, you cannot tell if it is working, drifting, or quietly failing.

Strong answer: A dashboard with named KPIs, an evaluator that catches model drift, regular reviews.

12. How do you handle model deprecation?

Why it matters: OpenAI and Anthropic deprecate models on a regular cadence. Migration is real work and someone has to pay for it.

Strong answer: Either bundled into the retainer with notice, or a published rate sheet for migration work. Vague answers mean you will pay surprise invoices.

Section 3: Commercial and contractual (questions 13-18)

13. Who owns the IP at the end of the engagement?

Why it matters: The prompts, the code, the dashboards. Default agency contracts often retain ownership; you may want to negotiate.

Strong answer: Clear and specific by artifact type, with willingness to negotiate.

14. What is the exit clause?

Why it matters: You will eventually want to take this in-house, switch vendors, or sunset the system. The exit terms should not punish that.

Strong answer: Notice period (typically 30-60 days), handoff process, transition support pricing.

15. What external API spend will land on our card?

Why it matters: Often 20-60% of total cost. Surprises here ruin the ROI.

Strong answer: A line-item estimate with assumptions stated, monthly monitoring, a heads-up before crossing budget.

16. What is your billing model and payment terms?

Why it matters: Hourly billing for AI work usually means the agency has not productized. Net-30 vs. net-15 affects your cash flow.

Strong answer: Productized fee or retainer with clear scope, reasonable payment terms, no hidden surcharges.

17. Who pays if the model gets it wrong and causes harm?

Why it matters: Liability allocation matters. Agencies that take none of the risk are arbitrating.

Strong answer: Agency carries professional liability insurance, named limits in contract, willingness to discuss how harm is allocated by failure mode.

18. Will you sign our DPA, BAA, or vendor agreement?

Why it matters: Refusing to sign client paper signals immaturity or risk aversion.

Strong answer: Yes, with reasonable redline negotiation. Mature agencies have done this many times.

Section 4: Security and compliance (questions 19-22)

19. What is your data-handling and security posture?

Why it matters: Your data will pass through their systems. The exposure is real.

Strong answer: SOC 2 (or pursuing it), ISO 27001 (or aware of why not), documented incident response, encrypted storage and transit.

20. How do you handle PII and sensitive data?

Why it matters: AI agents can leak in subtle ways. Prompt injection, data exfiltration via tool calls, log exposure.

Strong answer: Redaction, scoped access tokens, audit logs, no-training guarantees from underlying model providers.

21. Do you use enterprise model endpoints with no-training guarantees?

Why it matters: Free or consumer tier API access can train on your data. Enterprise endpoints do not.

Strong answer: Yes, with documentation. Specific endpoint names.

22. How do you handle prompt injection and adversarial inputs?

Why it matters: Real attack vector against AI systems. Naive agencies do not have defenses.

Strong answer: Input validation, evaluator gates, output filtering, principle of least privilege on tool access.

Section 5: Outcomes and references (questions 23-25)

23. Can we speak to two reference clients in our vertical?

Why it matters: Marketing claims do not survive a 30-minute call with a peer.

Strong answer: Yes, with intros within a week. Reluctance is a major flag.

24. Have you ever talked a prospective client OUT of a project?

Why it matters: Sales-shop agencies cannot name an example. Real consultancies can.

Strong answer: Specific story with reasoning. Bonus points if they tell you when you should walk on this engagement.

25. What is the most expensive mistake you have made on a client account in the past 24 months, and what changed because of it?

Why it matters: The closing-question version of question 8. Tests honesty one more time. The lesson learned should map to a process or technical change you can verify.

Strong answer: Specific, recent, with a documented change in approach.

Scoring guide

1 — Did not answer or answered evasively
2 — Generic, marketing-language answer with no specifics
3 — Specific answer that is reasonable but not exceptional
4 — Specific, demonstrable, evidence-backed answer
5 — Exceptional, including nuance and willingness to concede limitations

A genuinely capable agency will score 80+ across the 25 questions (3.2 average). Anything under 70 is a concern. Perfect 125 should make you suspicious — answer the same question to a different agency rep and see if the answer holds.

Use it on us

We mean this literally. Send us this checklist on a sales call and we will work through every question on camera. If we cannot answer something, we will say so. The checklist works on us as well as it works on anyone else, and that is the standard buyers should hold every agency to.

Frequently asked questions

How do I use this checklist?

Send it to two or three agencies you are considering. Score each answer 1-5. Compare. The patterns matter more than the individual answers — agencies that dodge the same questions reveal a similar weakness.

How many questions should an agency answer well?

A genuinely capable agency will answer 22 of 25 well. Be skeptical of perfect scores; perfection on a vendor questionnaire usually means a salesperson, not an engineer, completed it.

What if an agency refuses to answer some questions?

Specific questions about IP, exit clauses, and failure history are reasonable to push on. Refusal to answer security or pricing questions is a hard signal to walk.

Should I share this checklist with the agency upfront?

Yes. Good agencies welcome it. Lazy agencies push back. The negotiation behavior is a useful signal.

What is the single most important question on this list?

Question 8 — the worked example of a comparable engagement, including what failed. An agency that cannot name a failure has either no real experience or no honesty.

How long should the evaluation process take?

Two to four weeks for a serious engagement. Two to four days for a productized one. Faster than that and you are not doing diligence; slower and you are stalling.

Can I use this checklist for AI consultancies, not just agencies?

Yes. The questions about delivery and operations apply less, but everything around team, IP, security, and exit applies cleanly.

What is the most common red flag missed?

The agency cannot name the actual people who will work on your account. Brand-name pitches by founders followed by junior delivery teams remains the most common bait-and-switch.

Should the agency have case studies on their website?

Helpful but not sufficient. Case studies are marketing. The worked example of a failure (question 8) is more diagnostic than three glossy success stories.

What if I am evaluating Web4Guru with this checklist?

Use it. We will answer everything. If we cannot, we will say so plainly. The checklist works on us as well as it works on anyone.

Ready to use this on us?

Book the call. Bring the checklist. We will work through every question.

Talk to us See pricing