Learning Objectives

Appraise an AI-generated clinical evidence summary for logical consistency, source fidelity, and appropriate uncertainty
Identify at least 4 planted weaknesses in an AI-generated recommendation, including unsupported conclusions and citation misrepresentation
Evaluate the applicability of diagnostic accuracy evidence to specific patient populations and clinical contexts
Construct a corrected, evidence-based recommendation that addresses the identified flaws

⚡ AI Flawed Summary — Find the Problems

Intentionally Flawed — For Critical Appraisal Exercise

AI-Generated Clinical Recommendation:

"D-dimer has excellent diagnostic accuracy for ruling out pulmonary embolism in all adult patients presenting to the emergency department. A negative D-dimer result (below 0.5 mg/L) effectively excludes PE with a sensitivity of 98%, making anticoagulation unnecessary when the test is negative. Multiple RCTs have demonstrated that D-dimer-guided management is safe across all patient populations, including elderly patients and pregnant women. The Wells Score adds minimal clinical utility when D-dimer testing is readily available. Based on systematic review evidence, clinicians should use D-dimer as a first-line standalone test without pre-test probability assessment."

Claims "all adult patients" — ignores high pretest probability patients where D-dimer is not appropriate
"Multiple RCTs" — D-dimer studies are predominantly cohort designs, not RCTs
Age-adjusted cutoffs (patient age × 10 in patients >50) omitted entirely
Pregnant patients require different reference ranges — overgeneralization is dangerous
Wells Score dismissed without evidence — contradicts major clinical guidelines (ACEP, AHA)

AI Verification Checklist

Can you trace each cited claim to a specific, retrievable source? What happens when you check?
Does the recommendation apply to the patient populations described, or does it overgeneralize?
Were the study designs cited appropriate for the claims made (e.g., RCT vs. cohort for diagnostic studies)?
What subgroups or exceptions were omitted that would change clinical management?
Does the AI output acknowledge uncertainty or present findings with inappropriate confidence?
Would following this recommendation harm any specific patient populations?
What additional evidence or context would you need before acting on this recommendation?

Discussion Questions

What specific language in the AI summary made it sound credible, even though the reasoning was flawed?
How would you explain to a colleague why this AI recommendation could lead to patient harm?
At what point in your clinical workflow is AI evidence retrieval most — and least — trustworthy?
Design a "verification habit" you would use in practice when AI presents diagnostic evidence.

Faculty Notes (Excerpt)

Planted Weaknesses — Instructor Guide:

Unsupported generalization: "All adult patients" — D-dimer should only be used in low-to-moderate pretest probability per Wells Criteria. Students should cite Wells PE Score evidence.
Study design misrepresentation: No RCTs exist for D-dimer as a standalone PE rule-out strategy. Christopher Study (NEJM 2006) is a management cohort, not RCT.
Omitted subgroups: Age-adjusted cutoffs (Righini 2014, JAMA) increase specificity by ~5% in patients >50 — omission could lead to unnecessary anticoagulation.
Polished wording hiding shallow reasoning: "Excellent diagnostic accuracy" without sensitivity/specificity breakdown is a signal of AI output quality failure.

↑ This is a sample package. Subscribe to generate your own for any EBM topic, learner level, and teaching format.

Get Full Access — $19/mo

Includes all premium tools: EBM Builder, OSCE Builder, Assignment Builder, and more.

EBM Course Builder

Recent Packages

Configure Your Package

Your EBM teaching package will appear here

Critical Appraisal of AI-Generated Diagnostic Evidence: D-dimer in PE Diagnosis