What LLMs Get Wrong About Documentation

Dhini Nasution
Dec 7, 2025
7 min read

Updated: Dec 8, 2025

(and What Providers Should Demand Instead)

Large language models (LLMs) are suddenly everywhere in healthcare.

Vendors promise auto-generated notes, “one-click” HCC documentation, and “AI scribes” that quietly listen in the background. Early studies and pilots show that AI scribes can reduce time spent on documentation and improve perceived efficiency and satisfaction, but they also raise questions about accuracy, safety, and over-reliance. PMC.NCBI.NIH

For clinicians exhausted by clicks and templates, this all sounds tempting.

But there’s a hard truth: LLMs are very good at writing — and very bad at owning clinical responsibility.

If you treat them like a smarter dictation tool, you’re mostly fine. If you treat them like a clinical brain, you’re in trouble.

This article breaks down what LLMs systematically get wrong about documentation, why that matters in value-based care, and what a safe provider-facing AI stack actually needs.

1. LLMs don’t “understand” the record — they predict text

An LLM doesn’t see a patient; it sees tokens.

Given a prompt, it produces the most statistically likely continuation based on its training data. That’s why it can write fluent, human-sounding notes. But:

It doesn’t truly “know” whether a patient has CKD stage 3b or 4.
It doesn’t inherently care about temporal consistency (diagnosis dates, chronic vs resolved, annual recapture).
It can hallucinate diagnoses, medications, or prior history that were never documented.

Recent work on LLMs in medicine and note generation has shown that even strong models can generate false but plausible clinical details (“hallucinations”), including unsupported diagnoses or findings, which pose real safety risks when outputs are not tightly constrained. JAMA Network

For narrative tasks (like drafting patient letters), that risk is manageable. For problem lists and HCC documentation, it’s unacceptable if used naively.

2. “Pretty summaries” are not clinically faithful notes

LLMs are fantastic at turning a messy chart or transcript into a clean paragraph. But clinical documentation is not just “the story”; it’s the data backbone for:

Care continuity
Quality measurement
Risk adjustment (HCC / RAF)
Legal and regulatory audit trails

Recent evaluations of AI-generated clinical notes emphasize that high-quality drafts still require careful physician review, and that field-specific requirements (precision, attribution, temporal accuracy) are easy to violate when models are optimized mainly for fluent language. Frontiers

Typical failure modes when you ask an LLM to “write the note” from raw text or transcripts include:

a. Loss of critical qualifiers

“CKD” vs “CKD 3b due to diabetic nephropathy”
“Depression” vs “Recurrent major depressive disorder, in partial remission”
“History of” vs active conditions

LLMs tend to simplify, because their objective is fluent text, not coding precision. That simplification can decrease specificity, which hurts both care quality and revenue integrity in risk-based contracts.

b. Confusing past and present

Models can easily:

Pull in old diagnoses as if they are current.
Repeat historical problems that have since resolved.
Miss the subtleties of “suspected vs ruled out vs confirmed.”

In value-based care, that means overstating some risk and missing others — a nightmare from both compliance and actuarial perspectives.

c. Making up “linkages” that never existed

To satisfy the style of a good SOAP note, an LLM might assert relationships like:

“The patient’s CHF is likely due to long-standing uncontrolled hypertension…”

even if the chart never actually said that.

To a human, it reads reasonable. To a regulator or malpractice lawyer, it’s a statement of clinical judgment attributed to you.

3. LLMs don’t know your contracts, your coding rules, or your risk model

Even if an LLM is fine-tuned on clinical notes, it still doesn’t:

Know your payer-specific rules (MA vs MSSP vs commercial ACO).
Align with your internal coding compliance policies.
Understand HCC v28 vs v24, local coverage determinations, or recent audit findings.

If you simply ask:

“Generate the assessment and plan and list all relevant diagnoses.”

You’ll get something fluent, but not necessarily compliant or contract-aligned.

You need logic layers outside the LLM that:

Enforce which codes can be suggested and when.
Check that documentation supports any suggested chronic condition.
Respect visit type, encounter context, and policy constraints.

The LLM can draft; it must not be the source of truth for what’s allowed.

4. Garbage in, fluent garbage out

Most EHR data is:

Fragmented across encounters and systems
Full of copy-forwarded problem lists
Riddled with typos, legacy codes, and outdated diagnoses

An LLM that ingests all of this unfiltered and then “summarizes” it will:

Echo existing errors
Amplify outdated information
Bury what little signal you had under even more text

Studies on documentation burden and EHR usability show that clinicians already face information overload, long notes, and complex navigation, all of which increase cognitive load and error risk. JAMA Network

If you simply add generative AI on top of noisy, inconsistent data, you don’t get “smart notes” — you get faster wrongness in a more polished paragraph.

The model isn’t broken; the data pipeline and governance are.

5. The invisible cost: trust and cognitive load

The promise is: “AI will reduce documentation burden.” But if clinicians:

Don’t know what the AI looked at
Can’t see why it suggested a diagnosis
Have to re-read everything to be sure nothing is fabricated

…then you’ve simply moved the burden from “typing notes” to “auditing notes.”. We already know that EHR work and after-hours charting are strongly associated with burnout, information overload, and diagnostic risk. JAMA Network

Two predictable outcomes with naive LLM deployment:

Overtrust – Clinicians accept AI-generated notes with minimal review.
Risk: silent propagation of hallucinated diagnoses and false linkages.
Undertrust – Clinicians don’t believe the AI, so they redo the work manually.
Result: more clicks, more burnout, no ROI.

The only sustainable path is calibrated trust: clinicians know exactly where the AI is strong, where it’s weak, and what they remain responsible for.

6. What “good” looks like: LLMs as assistants, not authors

For provider organizations, the question isn’t “LLMs: yes or no?” It’s how you use them.

A safer pattern looks like this.

a. Structured data first, language second

Pipeline order should be:

Ingest and normalize data: meds, labs, vitals, claims, problem lists, notes.
Apply deterministic and ML logic for:
1. Phenotypes (e.g., suspected CKD based on eGFR over time)
2. Quality and risk rules (e.g., open gaps, recapture needs)
Then let the LLM turn that structured, auditable evidence into:
1. Drafted assessment language
2. Patient-facing explanations
3. Follow-up letters or summaries

The LLM explains and rephrases; it does not invent the underlying facts.

b. Evidence-linked suggestions

Every AI-driven diagnosis suggestion should come with receipts:

“Suggested: CKD 3b”
Because:
eGFR 42, 44, 40 over last 18 months
UACR 60 mg/g
Existing diabetes + hypertension codes

LLM can then draft:

“Chronic kidney disease stage 3b, likely secondary to long-standing diabetes and hypertension, with stable but reduced eGFR (40–44) and albuminuria (UACR 60 mg/g).”

But the decision to accept that diagnosis remains with the clinician, who can verify the evidence in one glance.

c. Tight scope: from “write the note” to “draft this part”

Instead of giving the model the entire chart and saying “write everything,” constrain it:

“Draft the HPI from this transcript.”
“Draft the patient letter explaining new diagnosis X.”
“Draft the provider-to-provider summary for this referral.”
“Turn this evidence bundle into a concise assessment sentence.”

This reduces hallucination risk and keeps the model focused on what it does best: turning clear inputs into clear language.

d. Clear human-in-the-loop workflows

Design the system so that:

Clinicians see every suggested change to the problem list and diagnoses.
Acceptance is explicit (check boxes, “apply all,” etc.).
You log what was accepted, modified, or rejected — and use that feedback to continuously tune prompts and policies.

Safety frameworks now being proposed for clinical LLM use emphasize exactly this kind of error taxonomy, safety assessment, and interface design, rather than just “turn it on and see what happens.” Nature

Without this, you’re flying blind on both safety and performance.

7. Questions providers should ask any “AI documentation” vendor

Before you roll out an AI scribe or LLM-based documentation tool, ask:

What exactly does the LLM do?
1. Does it only draft language, or does it also pick diagnoses/codes?
2. Where are deterministic rules vs. where is free-text generation?
How do you prevent hallucinations?
1. Do you enforce that all diagnoses come from a structured evidence set?
2. Can the model introduce a condition that isn’t supported by chart data?
Can I see the evidence behind each suggestion?
1. Is there a one-click view of “why this diagnosis is suggested”?
2. Are you using phenotypes, risk rules, or just text pattern matching?
How do you handle my specific contracts and coding policies?
1. Can the system be tuned for my HCC version, my payers, my internal guidelines?
2. Who is on the hook if the AI encourages noncompliant coding?
What data do you train on?
1. Is my data used to train your models? How is PHI protected?
2. Can we keep a private, tenant-specific model boundary?
How do clinicians stay in control?
1. Is there an easy way to decline or edit AI suggestions?
2. What does the audit trail look like?

If a vendor can’t answer these clearly, they’re not ready to handle your documentation.

8. The real opportunity: documentation as a byproduct of good care, not more typing

Used well, LLMs can help:

Turn conversations into high-quality draft notes.
Turn complex evidence into succinct, guideline-aligned assessments.
Turn population-level analytics into patient-specific nudges.

But that only works when:

The clinical facts are governed outside the model (data layer, rules, phenotypes).
The model’s job is language, not judgment.
Clinicians are in the loop, not out of it.

Documentation in value-based care is too important to outsource to a black box that “sounds smart.”

The future isn’t “AI that writes your chart.”

It’s AI that quietly does the data work, surfaces the right facts, and lets you be the doctor — while making the documentation fall out naturally from that process.

That’s the bar we should hold LLMs — and their vendors — to.

References

Shah SV, et al. Accuracy, Consistency, and Hallucination of Large Language Models for Clinical Note Tasks. JAMA Netw Open. JAMA Network
Roustan D, et al. Clinicians’ Guide to Large Language Models. JMIR. IJMR
Jung KH, et al. Large Language Models in Medicine: Applications, Challenges, and Ethics. PMC.NCBI.NIH
Asgari E, et al. Framework to Assess Clinical Safety and Hallucination in LLM-Generated Notes (CREOLA). Nature
Palm E, et al. Assessing the Quality of AI-Generated Clinical Notes. Frontiers in Artificial Intelligence. Frontiers
Sasseville M, et al. Impact of AI Scribes on Streamlining Clinical Documentation: A Systematic Review. PMC.NCBI.NIH
Leung TI, et al. AI Scribes in Health Care: Balancing Transformative Potential and Risk. JMIR Med Inform. JMIR Medical Informatics
Holmgren AJ, et al. EHR Usability, Satisfaction, and Burden Among Physicians. JAMA Netw Open. JAMA Network
Asgari E, et al. Impact of EHR Use on Cognitive Load and Burnout. JMIR Med Inform. JMIR Medical Informatics
Cho H, et al. EHR System Use and Documentation Burden in Acute and Critical Care Nurses. PMC.NCBI.NIH
Nijor S, et al. Patient Safety Issues from Information Overload in the EHR. J Patient Saf. Lippincott Journals
Dixit RA, et al. EHR-Use Issues and Diagnostic Error: A Scoping Review. PMC.NCBI.NIH
Singh H, et al. Situational Awareness and Diagnostic Error in Primary Care. PMC.NCBI.NIH
Tajirian T, et al. Assessing the Impact on EHR Burden in Clinical Practice. humanfactors.jmir.org
Frayha N. How to Teach Good EHR Documentation and Deflate Bloated Notes. AMA J Ethics. Journal of Ethics

Recent news on AI scribes and ambient documentation