What about Chat GPT-4 as Chatbot for Medicine? Let’s study that with Lee et al. (2023)

Lee et al. 2023 — GPT-4 as an AI Chatbot for Medicine — GIVEMEA Study Guide
GIVEMEA Study Guide · AI in Medicine

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

Peter Lee, Sebastien Bubeck & Joseph Petro · New England Journal of Medicine · 388(13):1233–1239 · March 30, 2023

Special Report Exploratory / Narrative GPT-4 · OpenAI · Microsoft Research NEJM 2023 6-Month Study
>90%USMLE Accuracy
6 moResearch Period
3Scenario Examples
11References
Central Finding
GPT-4 demonstrates remarkable capability across medical note-taking, clinical reasoning, and consultation tasks — yet hallucinations, absence of private patient data in training, and unresolved questions about acceptable AI performance mean that careful human oversight remains essential before clinical deployment.

Research Question

What are the practical capabilities, limitations, and risks of GPT-4 when applied to healthcare delivery and medical research — specifically across documentation, innate medical knowledge, and clinical consultation?

What GPT-4 Is — and Is Not

GPT-4 is a general-purpose large language model with a chat interface, trained entirely on openly available internet data including medical texts, research papers, and health websites. Crucially, it has never been trained on private electronic health records or restricted institutional data. It was not designed for medicine specifically — its medical capability emerges from general-purpose cognitive training rather than clinical fine-tuning.

Three Demonstration Scenarios

The paper illustrates GPT-4’s medical potential through three worked examples, all executed in December 2022 with a pre-release version: (1) generating a structured SOAP medical note from a physician-patient encounter transcript, including catching its own BMI hallucination when asked to self-review; (2) correctly answering a USMLE Step 1 question about post-streptococcal glomerulonephritis with full clinical reasoning; and (3) advising a clinician through a COPD exacerbation curbside consult with adaptive follow-up responses.

The Hallucination Problem — and a Partial Solution

In the medical note example, GPT-4 fabricated a BMI of 14.8 not supported by the transcript. However, when a separate GPT-4 session was given the full transcript and note and asked to check for errors, it identified the hallucination. The paper proposes this self-verification approach — using GPT-4 to catch GPT-4’s mistakes — as a practical mitigation strategy for deployment, while acknowledging it is not a complete solution.

Authorship and Conflict of Interest

All three authors are employees of Microsoft Research (Lee, Bubeck) and Nuance Communications (Petro), the company whose DAX ambient documentation product is directly cited as a GPT-4 integration target. The paper explicitly acknowledges this bias. This does not invalidate the findings but frames the paper as an insider exploration rather than independent evaluation — an important critical reading consideration.

■ AI Technology   ■ Clinical Applications   ■ Risks & Limitations   ■ Standards & Tools

AI Technology

GPT-4
tap to define
GPT-4
Generative Pretrained Transformer 4 — OpenAI’s most advanced publicly released general-purpose LLM as of March 2023. Trained on open internet data (not private health records). Not designed for medicine but shows emerging medical capability. Described as a “work in progress” likely in near-constant change.
Large Language Model (LLM)
tap to define
Large Language Model (LLM)
A neural network AI system trained on vast text datasets to generate statistically probable language continuations. GPT-4, LaMDA, and GPT-3.5 are all LLMs. None were trained specifically for healthcare — their medical capability is a by-product of general-purpose cognitive training on open web data.
Prompt Engineering
tap to define
Prompt Engineering
The art and science of crafting input queries (prompts) to maximise the quality of an LLM’s response. Lee et al. note that GPT-4 is currently sensitive to the precise form and wording of prompts — meaning results vary significantly based on how questions are phrased. Future systems are expected to be less sensitive.
Self-Verification Loop
tap to define
Self-Verification Loop
A proposed deployment pattern in which a second, independent GPT-4 session is given the original prompt plus the first session’s output and asked to identify errors. Demonstrated in the paper to successfully catch the BMI hallucination. Proposed by the authors as a practical mitigation strategy for clinical note-taking applications.

Clinical Applications

Medical Note Taking
tap to define
Medical Note Taking
The first scenario demonstrated: GPT-4 receives a physician-patient encounter transcript (via a product like Nuance DAX) and generates a structured SOAP note with billing codes. The paper shows both the capability (accurate summary) and the limitation (hallucinated BMI) and proposes the self-verification loop as a safeguard.
Curbside Consult
tap to define
Curbside Consult
An informal, rapid consultation between healthcare professionals — typically a brief question asked “in passing.” GPT-4 is shown handling a curbside consult about COPD exacerbation, providing clinically structured reasoning and adapting to follow-up questions (no sputum, but cyanosis present). Authors suggest GPT-4 may become a routine “first opinion” tool.
SOAP Note
tap to define
SOAP Note
Subjective / Objective / Assessment / Plan — a standard structured format for clinical documentation. GPT-4 can produce notes in SOAP and other medical formats, automatically include billing codes (ICD), and generate HL7 FHIR-compliant lab and prescription orders — reducing the documentation burden on clinicians.
USMLE
tap to define
USMLE
U.S. Medical Licensing Examination — the standardised assessment required for physician licensure in the United States. GPT-4 answers over 90% of written USMLE questions correctly and provides detailed clinical reasoning for its answers. Prior work showed GPT-3.5 also passed the USMLE, suggesting this is an emergent LLM capability rather than specific fine-tuning.

Risks & Limitations

Hallucination
tap to define
Hallucination
When a GPT-4 response contains false or fabricated information stated with apparent confidence. The paper’s key example: GPT-4 inserted a BMI of 14.8 into a medical note when the transcript contained no weight data. Particularly dangerous in medicine because errors are subtle and stated convincingly — the patient or clinician may not question the output.
Training Data Gap
tap to define
Training Data Gap
GPT-4 was trained only on open internet data — it has never seen private EHR data, restricted clinical records, or proprietary institutional protocols. This means it has no knowledge of individual patients, local treatment pathways, or formulary-specific prescribing norms. All clinical output must be verified against actual patient records.
Near-Constant Change
tap to define
Near-Constant Change
Lee et al.’s warning that GPT-4 “is likely to be in a state of near-constant change, with behavior that may improve or degrade over time.” The March 2023 public release no longer exhibited the hallucinations from the December 2022 pre-release examples. This instability is a critical challenge for clinical standardisation and regulatory approval.
Acceptable Performance Problem
tap to define
Acceptable Performance Problem
The open question the paper identifies but cannot answer: what level of performance is acceptable for a general AI system in medicine? Prior narrow AI tools had precisely defined operating envelopes — GPT-4’s general intelligence makes that framework inapplicable. How much fact-checking is enough? How much can the user trust the output? These questions remain unresolved.

Standards & Tools

HL7 FHIR
tap to define
HL7 FHIR
Health Level Seven Fast Healthcare Interoperability Resources — the international standard for electronic health data exchange. GPT-4 can generate lab orders and prescriptions compliant with HL7 FHIR, enabling direct integration with hospital information systems. This is one of the paper’s strongest arguments for near-term clinical utility.
Nuance DAX
tap to define
Nuance DAX
Dragon Ambient eXperience — a clinical ambient documentation product from Nuance Communications (a Microsoft subsidiary). DAX records physician-patient encounters and produces clinical documentation. Author Joseph Petro is from Nuance; GPT-4 was tested with DAX transcripts in preliminary work. Represents the most direct commercial pipeline for GPT-4 in clinical note-taking.
LaMDA
tap to define
LaMDA
Language Model for Dialogue Applications — Google’s general-purpose conversational AI, cited as a predecessor comparator to GPT-4. Like GPT-4, LaMDA was not specifically trained for healthcare but has shown varying medical competence. Lee et al. use LaMDA as context for the broader landscape of general-purpose LLMs being evaluated in clinical settings.
EHR
tap to define
Electronic Health Record
A digital system storing a patient’s complete clinical history within a healthcare organisation. Private EHR data were explicitly absent from GPT-4’s training — it has no access to real patient records. This training gap is one of the paper’s key limitations: GPT-4 cannot personalise responses to a specific patient’s clinical history without being given that data in the prompt itself.

Study Design at a Glance

This is a Special Report — a narrative exploration, not a controlled trial or systematic review. Over 6 months (approximately July–December 2022), researchers at Microsoft Research and Nuance Communications tested GPT-4 across a range of healthcare tasks using a pre-release version of the model. The paper presents three curated scenario examples plus extensive supplementary transcripts.

What Was Tested

  • Medical documentation: Generating SOAP notes from physician-patient encounter transcripts; self-verification via a second GPT-4 session
  • Innate medical knowledge: Performance on written USMLE questions, including reasoning explanation
  • Clinical consultation: Curbside consult questions about COPD exacerbation with adaptive follow-up
  • Additional capabilities noted: Research article summarisation, FHIR-compliant order generation, multilingual translation, after-visit summaries, explanation-of-benefits decoding

Data Sources

  • USMLE sample questions (publicly available)
  • Dataset for Automated Medical Transcription (Kazi et al., Zenodo, 2020) — used instead of actual DAX patient transcripts to protect privacy
  • Researcher-designed curbside consult prompts
  • GPT-4 was not given access to real patient records at any point

Version Note

  • All three main examples were run in December 2022 on a pre-release GPT-4
  • The publicly released March 2023 version no longer exhibited the hallucinations shown in the paper’s figures
  • Authors re-ran examples with the public version and provide transcripts in the supplementary appendix
  • This versioning instability is itself flagged as a key clinical deployment concern

Strengths

  • First substantive NEJM report on GPT-4 specifically in clinical scenarios — high visibility and agenda-setting
  • Concrete worked examples with full transcripts allow independent evaluation by readers
  • Authors honestly disclose their own hallucination example rather than presenting only successful demonstrations
  • Supplementary appendix with additional transcripts provides more evidence than the article alone

Limitations

  • No independent evaluation — all testing done by authors employed by the entities that created GPT-4 (explicitly acknowledged)
  • No quantitative benchmarking methodology — the >90% USMLE figure is the only numerical claim; all other demonstrations are qualitative
  • Pre-release model used for main examples — the publicly available version behaved differently
  • No patient safety or harm assessment — the paper explores capabilities without measuring actual clinical outcomes
  • No comparison condition — there is no baseline (e.g. human clinician performance on same tasks) against which GPT-4 is measured

Cited References — Close-Up. Focus on the anchoring works that drive the paper’s argument. Click any card to expand the full analysis.

[7]

Singhal et al. (2022) — Large Language Models Encode Clinical Knowledge (Med-PaLM)

Preprint · arXiv:2212.13138 · Google Research
★★★ Primary Comparator Preprint 2022 Medical LLM Benchmark
Why this reference matters

Med-PaLM is Google’s medically fine-tuned LLM — GPT-4’s most direct competitor for clinical applications at the time of writing. Its inclusion positions GPT-4 within a landscape of competing general-purpose LLMs being tested in medical settings. Notably, Med-PaLM was specifically trained on medical data, whereas GPT-4’s medical capability emerges purely from general training.

Key contribution to this paper’s argument

Singhal et al. is cited to establish that the field of medical LLM evaluation already exists and that GPT-4 is entering an active research space. By referencing Med-PaLM — a specialist model — alongside GPT-4’s general-purpose architecture, Lee et al. implicitly make the case that general capability may rival or exceed specialised fine-tuning, at least on benchmark tasks.

Connection to the wider citation network

Forms the “AI landscape” cluster with [8] (ChatGPT/GPT-3.5 USMLE performance). These two references collectively establish the lineage: GPT-3.5 passed the USMLE, Med-PaLM was purpose-built for medicine, and GPT-4 outperforms both without medical-specific training. The progression narrative is implicit but central to the paper’s optimism.

[8]

Kung et al. (2023) — Performance of ChatGPT on USMLE

PLOS Digital Health · 2(2):e0000198 · 2023
★★★ Direct Precursor PLOS Digit. Health 2023 USMLE Benchmark Baseline
Why this reference matters

Kung et al. demonstrated that GPT-3.5 (ChatGPT) could pass all three steps of the USMLE with no specialised training — a landmark result that galvanised medical AI interest. Lee et al.’s claim that GPT-4 answers over 90% of USMLE questions correctly is a direct advancement on this prior result, making [8] the essential baseline for interpreting GPT-4’s medical knowledge benchmark.

Key contribution to this paper’s argument

Without [8], the USMLE performance figure would seem to appear from nowhere. With it, the paper establishes a clear progression: GPT-3.5 passed, GPT-4 exceeds 90%. This trajectory is the empirical backbone for the paper’s broader claim that LLM medical capability is real and accelerating, not a one-off result.

Connection to the wider citation network

Paired with [7] (Singhal/Med-PaLM) to form the “prior medical AI” cluster. Together they establish the shoulders GPT-4 stands on. Notably, both [7] and [8] are from late 2022 / early 2023, reflecting how compressed this literature was — the entire field of benchmarked medical LLM evaluation is less than a year old at the time of publication.

[9]

Nuance — Dragon Ambient eXperience (DAX)

Nuance Communications · nuance.com/healthcare · Product Documentation
★★ Commercial Pipeline Gray Literature Deployment Context
Why this reference matters

DAX is the real-world clinical documentation product into which GPT-4 is being integrated by Nuance (a Microsoft subsidiary). Author Joseph Petro is employed by Nuance. This citation is as much a product disclosure as it is an academic reference — it reveals the commercial pipeline the paper is demonstrating capability for.

Key contribution to this paper’s argument

DAX grounds the note-taking scenario in a concrete deployment context. Rather than purely theoretical capability, the paper points toward an imminent real-world application: after patient consent, GPT-4 listens to the encounter via a smart-speaker-style ambient device and produces the note automatically. This makes the paper’s demonstration immediately relevant to practicing clinicians.

Connection to the wider citation network

This is the most commercially significant citation in the paper. Combined with [10] (the Kazi dataset, which was used instead of actual DAX recordings for privacy reasons), it reveals the research setup: DAX provides the real transcripts in practice, but the published example uses a public stand-in. The gap between the published example and the actual intended deployment is an important limitation.

[5]

OpenAI — Introducing ChatGPT (November 30, 2022)

openai.com/blog/chatgpt · Blog Post
★★ Timeline Anchor Gray Literature Technology Introduction
Why this reference matters

The ChatGPT launch blog post anchors the public timeline of LLM chatbot availability. ChatGPT (GPT-3.5) launched November 30, 2022 — the paper was published March 30, 2023, exactly 4 months later. This compressed timeline is the context for understanding how rapidly the field moved from public debut to NEJM analysis.

Key contribution to this paper’s argument

Used to position GPT-4 as the successor to the model that had already captured massive public and professional attention. By anchoring to the ChatGPT launch, the paper frames GPT-4 not as a research curiosity but as the next step of a technology already in widespread public use — including by patients and clinicians searching for health information.

Connection to the wider citation network

Alongside [6] (Corbelle et al. on hallucination) and [7]/[8] (medical LLM benchmarks), this forms the “rapid emergence” cluster — citations establishing that this technology arrived fast, spread fast, and is already being used in medicine without systematic evaluation. This urgency justifies the paper’s exploratory rather than rigorous study design.

[6]

Corbelle et al. (2022) — Dealing with Hallucination and Omission in Neural NLG

INLG 2022 · Proceedings of the 15th International Conference on Natural Language Generation
★★ Definitional Source INLG 2022 Hallucination Definition
Why this reference matters

This is the formal academic source for the term “hallucination” as used in natural language generation — the technical vocabulary the paper borrows to describe GPT-4’s fabricated outputs. Grounding the term in an NLG conference paper gives it precision: hallucination refers specifically to generated content that is factually ungrounded relative to the input, not mere error.

Key contribution to this paper’s argument

By citing [6] when introducing hallucination, Lee et al. signal that this is a known, studied phenomenon in AI — not an unexpected flaw unique to GPT-4. This framing is double-edged: it normalises hallucination as a documented LLM property (managing expectations), while also implying the problem is unlikely to be quickly solved. The citation subtly supports the paper’s caution even amid its optimism.

Connection to the wider citation network

The only computer science conference paper in the reference list — all other citations are medical journals, regulatory documents, or gray literature. This signals the paper’s bridging function: bringing NLG/AI technical concepts into a clinical audience that may not be familiar with them. Corbelle et al. is the “translation” citation that allows Lee et al. to use a precise technical term without defining it at length.

Reference Network Summary

This paper’s citation network is notably compact (11 references for a NEJM Special Report) and falls into three distinct clusters. The medical AI benchmark cluster[7] (Med-PaLM) and [8] (ChatGPT USMLE) — provides the empirical baseline against which GPT-4’s performance is measured, establishing a progression narrative from GPT-3.5 to GPT-4. The clinical deployment cluster[9] (Nuance DAX) and [10] (Kazi dataset) — grounds the note-taking scenario in a concrete commercial pipeline, revealing Microsoft’s intended integration pathway. The AI context cluster[5] (ChatGPT launch) and [6] (Corbelle on hallucination) — frames GPT-4 within the rapid emergence of public LLMs and introduces the technical vocabulary of hallucination from NLG research. Two features of this network are unusual: first, three of the eleven references are gray literature (a blog post, a product page, and a dataset repository), reflecting the speed of the field and the lack of peer-reviewed precedent. Second, all three authors have direct financial relationships with the companies whose products are featured — Microsoft Research (the model), Nuance (the deployment product), and OpenAI (the underlying system) — making the entire citation network an insider account rather than an independent evaluation. This does not invalidate the work, but it is the most important single fact for critical reading.

Click an answer to reveal feedback. Each question locks after answering.

Question 1 of 5
What does the paper mean when it says GPT-4 “hallucinated” in the medical note example?
✓ Correct. GPT-4 stated the patient’s BMI as 14.8, but no weight data appeared in the transcript that would allow this calculation. When a separate GPT-4 session was asked to review the note, it identified this as an unsupported claim. This self-verification loop is one of the paper’s proposed mitigation strategies.
Not quite. The correct answer is B. A hallucination is a fabricated but plausible-sounding output. In this case GPT-4 inserted a specific BMI value that had no basis in the encounter transcript — a subtle error likely to go unnoticed without active verification. This is what makes hallucinations particularly dangerous in clinical contexts.
Question 2 of 5
Why does the paper use a publicly available dataset (Kazi et al.) for the note-taking example rather than real Nuance DAX transcripts?
✓ Correct. The authors explicitly state they used the Kazi dataset “to respect patient privacy.” Real DAX transcripts from actual clinical encounters contain protected health information and cannot be reproduced in a publication. This also means the published example is not identical to the actual intended deployment scenario — a subtle but important distinction.
Not quite. The correct answer is C. The paper explicitly notes the Kazi dataset was used to protect patient privacy — the authors had been working with real DAX transcripts in their research but could not publish those. This gap between the published demonstration and the actual deployment context is one of the study’s methodological limitations.
Question 3 of 5
What does the paper mean by GPT-4 being in “a state of near-constant change,” and why does this matter for clinical use?
✓ Correct. The paper notes that the public release of GPT-4 in March 2023 no longer exhibited the hallucinations shown in the December 2022 pre-release examples — an improvement. But this also means the reverse is possible: future updates could introduce new errors or degrade performance on tasks previously handled well. For clinical deployment, this unpredictability is a fundamental challenge to validation and standardisation.
Not quite. The correct answer is C. “Near-constant change” refers to OpenAI’s ongoing model updates between versions. The paper demonstrates this directly: the December 2022 pre-release model produced hallucinations that the March 2023 public release had already corrected. For clinical tools, this means a validated version could be replaced by a different-behaving version without explicit notification.
Question 4 of 5
The paper openly acknowledges the authors’ bias. What is the nature of that bias, and why is it important to acknowledge?
✓ Correct. Peter Lee and Sebastien Bubeck are from Microsoft Research; Joseph Petro is from Nuance Communications (a Microsoft subsidiary whose DAX product is the primary deployment pipeline described). OpenAI is explicitly thanked for early access. This means the paper evaluating GPT-4 was written by people financially dependent on GPT-4’s success. This is acknowledged in the paper — a mark of intellectual honesty — but readers must weigh findings accordingly.
Not quite. The correct answer is B. The conflict of interest is financial and institutional: all three authors work for entities directly involved in creating and commercialising GPT-4 (Microsoft Research, Nuance/Microsoft, OpenAI). The paper flags this explicitly, which is appropriate, but it means no findings in this paper have been independently reproduced or verified by parties without a financial stake in the outcome.
Question 5 of 5
What is the “acceptable performance problem” identified in the paper’s conclusion?
✓ Correct. Previous clinical AI tools — such as image analysis or risk stratification models — had precisely defined tasks and measurable performance thresholds (sensitivity, specificity, AUC). GPT-4’s general intelligence means it operates across an unbounded range of tasks with no single evaluable operating envelope. How much fact-checking is required? What error rate is tolerable for a consult versus a note versus a diagnosis? These questions cannot be answered with existing frameworks, and the paper identifies this as the fundamental unresolved challenge for clinical deployment.
Not quite. The correct answer is C. The paper’s closing concern is about evaluation frameworks, not task selection or language capability. Narrow clinical AI had clear metrics (accuracy on a defined task). GPT-4’s generality means there is no equivalent — and without an agreed standard for acceptable performance, deployment decisions become unavoidably subjective.
— / 5 Quiz Score
Core Thesis
GPT-4 is not an end in itself — it is the opening of a door to new possibilities and new risks. Used carefully and with appropriate caution, these evolving tools have the potential to help healthcare providers give the best care possible.
  • 📋

    Documentation Is the Strongest Near-Term Use Case

    GPT-4’s ability to generate SOAP notes from encounter transcripts, produce FHIR-compliant orders, and write after-visit summaries represents the most mature and immediately deployable application. The Nuance DAX integration provides an existing commercial channel. The self-verification loop (using GPT-4 to check GPT-4) partially mitigates hallucination risk in this constrained context.

  • ⚠️

    Hallucination Is Subtle and Dangerous in Medicine

    The BMI hallucination is significant precisely because it was plausible — a specific numerical value, contextually appropriate, stated with full confidence. It would not trigger alarm bells the way an obviously absurd response would. The paper’s warning that errors “can be subtle and are often stated in such a convincing manner” is the most important clinical safety message in the article.

  • 🎓

    Medical Knowledge Emerges Without Medical Training

    GPT-4 answers over 90% of USMLE questions correctly despite never being trained on clinical data. This is a remarkable result — and a slightly unsettling one. It suggests that public internet text about medicine is sufficient to encode substantial medical knowledge, but also that this knowledge has unknown gaps, biases, and inconsistencies inherited from whatever was written online.

  • 🔄

    Model Instability Undermines Clinical Standardisation

    The same prompt produced different results in December 2022 versus March 2023. For clinical tools, this is not a minor inconvenience — it means a validated workflow can silently become invalid when OpenAI updates the model. No equivalent problem exists for FDA-cleared software or pharmaceutical products, which are fixed at point of approval. This is the deepest structural challenge for GPT-4’s clinical deployment.

  • 👥

    Insider Evaluation Is Not Independent Evidence

    All three authors are employed by entities with direct financial interest in GPT-4’s clinical adoption. The paper is valuable as a demonstration and a provocation — but it is not an independent clinical evaluation. The next step is third-party replication with pre-specified endpoints, comparator conditions, and adversarial testing by researchers without commercial stakes in the outcome.

  • The Evaluation Framework for General AI in Medicine Does Not Yet Exist

    This may be the paper’s most important contribution: naming the “acceptable performance problem.” Narrow AI was evaluated against task-specific thresholds. GPT-4’s unbounded general capability makes those frameworks inapplicable. Building a new evaluation methodology for general-purpose clinical AI — one that accounts for task diversity, hallucination rates, versioning instability, and acceptable error thresholds — is the field’s most urgent research need.

Similar Posts

Leave a Reply