What about Chat GPT-4 as Chatbot for Medicine? Let’s study that with Lee et al. (2023)
Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine
Peter Lee, Sebastien Bubeck & Joseph Petro · New England Journal of Medicine · 388(13):1233–1239 · March 30, 2023
GPT-4 demonstrates remarkable capability across medical note-taking, clinical reasoning, and consultation tasks — yet hallucinations, absence of private patient data in training, and unresolved questions about acceptable AI performance mean that careful human oversight remains essential before clinical deployment.
Research Question
What are the practical capabilities, limitations, and risks of GPT-4 when applied to healthcare delivery and medical research — specifically across documentation, innate medical knowledge, and clinical consultation?
What GPT-4 Is — and Is Not
GPT-4 is a general-purpose large language model with a chat interface, trained entirely on openly available internet data including medical texts, research papers, and health websites. Crucially, it has never been trained on private electronic health records or restricted institutional data. It was not designed for medicine specifically — its medical capability emerges from general-purpose cognitive training rather than clinical fine-tuning.
Three Demonstration Scenarios
The paper illustrates GPT-4’s medical potential through three worked examples, all executed in December 2022 with a pre-release version: (1) generating a structured SOAP medical note from a physician-patient encounter transcript, including catching its own BMI hallucination when asked to self-review; (2) correctly answering a USMLE Step 1 question about post-streptococcal glomerulonephritis with full clinical reasoning; and (3) advising a clinician through a COPD exacerbation curbside consult with adaptive follow-up responses.
The Hallucination Problem — and a Partial Solution
In the medical note example, GPT-4 fabricated a BMI of 14.8 not supported by the transcript. However, when a separate GPT-4 session was given the full transcript and note and asked to check for errors, it identified the hallucination. The paper proposes this self-verification approach — using GPT-4 to catch GPT-4’s mistakes — as a practical mitigation strategy for deployment, while acknowledging it is not a complete solution.
Authorship and Conflict of Interest
All three authors are employees of Microsoft Research (Lee, Bubeck) and Nuance Communications (Petro), the company whose DAX ambient documentation product is directly cited as a GPT-4 integration target. The paper explicitly acknowledges this bias. This does not invalidate the findings but frames the paper as an insider exploration rather than independent evaluation — an important critical reading consideration.
■ AI Technology ■ Clinical Applications ■ Risks & Limitations ■ Standards & Tools
AI Technology
Clinical Applications
Risks & Limitations
Standards & Tools
Study Design at a Glance
This is a Special Report — a narrative exploration, not a controlled trial or systematic review. Over 6 months (approximately July–December 2022), researchers at Microsoft Research and Nuance Communications tested GPT-4 across a range of healthcare tasks using a pre-release version of the model. The paper presents three curated scenario examples plus extensive supplementary transcripts.
What Was Tested
- Medical documentation: Generating SOAP notes from physician-patient encounter transcripts; self-verification via a second GPT-4 session
- Innate medical knowledge: Performance on written USMLE questions, including reasoning explanation
- Clinical consultation: Curbside consult questions about COPD exacerbation with adaptive follow-up
- Additional capabilities noted: Research article summarisation, FHIR-compliant order generation, multilingual translation, after-visit summaries, explanation-of-benefits decoding
Data Sources
- USMLE sample questions (publicly available)
- Dataset for Automated Medical Transcription (Kazi et al., Zenodo, 2020) — used instead of actual DAX patient transcripts to protect privacy
- Researcher-designed curbside consult prompts
- GPT-4 was not given access to real patient records at any point
Version Note
- All three main examples were run in December 2022 on a pre-release GPT-4
- The publicly released March 2023 version no longer exhibited the hallucinations shown in the paper’s figures
- Authors re-ran examples with the public version and provide transcripts in the supplementary appendix
- This versioning instability is itself flagged as a key clinical deployment concern
Strengths
- First substantive NEJM report on GPT-4 specifically in clinical scenarios — high visibility and agenda-setting
- Concrete worked examples with full transcripts allow independent evaluation by readers
- Authors honestly disclose their own hallucination example rather than presenting only successful demonstrations
- Supplementary appendix with additional transcripts provides more evidence than the article alone
Limitations
- No independent evaluation — all testing done by authors employed by the entities that created GPT-4 (explicitly acknowledged)
- No quantitative benchmarking methodology — the >90% USMLE figure is the only numerical claim; all other demonstrations are qualitative
- Pre-release model used for main examples — the publicly available version behaved differently
- No patient safety or harm assessment — the paper explores capabilities without measuring actual clinical outcomes
- No comparison condition — there is no baseline (e.g. human clinician performance on same tasks) against which GPT-4 is measured
Cited References — Close-Up. Focus on the anchoring works that drive the paper’s argument. Click any card to expand the full analysis.
Singhal et al. (2022) — Large Language Models Encode Clinical Knowledge (Med-PaLM)
Why this reference matters
Med-PaLM is Google’s medically fine-tuned LLM — GPT-4’s most direct competitor for clinical applications at the time of writing. Its inclusion positions GPT-4 within a landscape of competing general-purpose LLMs being tested in medical settings. Notably, Med-PaLM was specifically trained on medical data, whereas GPT-4’s medical capability emerges purely from general training.
Key contribution to this paper’s argument
Singhal et al. is cited to establish that the field of medical LLM evaluation already exists and that GPT-4 is entering an active research space. By referencing Med-PaLM — a specialist model — alongside GPT-4’s general-purpose architecture, Lee et al. implicitly make the case that general capability may rival or exceed specialised fine-tuning, at least on benchmark tasks.
Connection to the wider citation network
Forms the “AI landscape” cluster with [8] (ChatGPT/GPT-3.5 USMLE performance). These two references collectively establish the lineage: GPT-3.5 passed the USMLE, Med-PaLM was purpose-built for medicine, and GPT-4 outperforms both without medical-specific training. The progression narrative is implicit but central to the paper’s optimism.
Kung et al. (2023) — Performance of ChatGPT on USMLE
Why this reference matters
Kung et al. demonstrated that GPT-3.5 (ChatGPT) could pass all three steps of the USMLE with no specialised training — a landmark result that galvanised medical AI interest. Lee et al.’s claim that GPT-4 answers over 90% of USMLE questions correctly is a direct advancement on this prior result, making [8] the essential baseline for interpreting GPT-4’s medical knowledge benchmark.
Key contribution to this paper’s argument
Without [8], the USMLE performance figure would seem to appear from nowhere. With it, the paper establishes a clear progression: GPT-3.5 passed, GPT-4 exceeds 90%. This trajectory is the empirical backbone for the paper’s broader claim that LLM medical capability is real and accelerating, not a one-off result.
Connection to the wider citation network
Paired with [7] (Singhal/Med-PaLM) to form the “prior medical AI” cluster. Together they establish the shoulders GPT-4 stands on. Notably, both [7] and [8] are from late 2022 / early 2023, reflecting how compressed this literature was — the entire field of benchmarked medical LLM evaluation is less than a year old at the time of publication.
Nuance — Dragon Ambient eXperience (DAX)
Why this reference matters
DAX is the real-world clinical documentation product into which GPT-4 is being integrated by Nuance (a Microsoft subsidiary). Author Joseph Petro is employed by Nuance. This citation is as much a product disclosure as it is an academic reference — it reveals the commercial pipeline the paper is demonstrating capability for.
Key contribution to this paper’s argument
DAX grounds the note-taking scenario in a concrete deployment context. Rather than purely theoretical capability, the paper points toward an imminent real-world application: after patient consent, GPT-4 listens to the encounter via a smart-speaker-style ambient device and produces the note automatically. This makes the paper’s demonstration immediately relevant to practicing clinicians.
Connection to the wider citation network
This is the most commercially significant citation in the paper. Combined with [10] (the Kazi dataset, which was used instead of actual DAX recordings for privacy reasons), it reveals the research setup: DAX provides the real transcripts in practice, but the published example uses a public stand-in. The gap between the published example and the actual intended deployment is an important limitation.
OpenAI — Introducing ChatGPT (November 30, 2022)
Why this reference matters
The ChatGPT launch blog post anchors the public timeline of LLM chatbot availability. ChatGPT (GPT-3.5) launched November 30, 2022 — the paper was published March 30, 2023, exactly 4 months later. This compressed timeline is the context for understanding how rapidly the field moved from public debut to NEJM analysis.
Key contribution to this paper’s argument
Used to position GPT-4 as the successor to the model that had already captured massive public and professional attention. By anchoring to the ChatGPT launch, the paper frames GPT-4 not as a research curiosity but as the next step of a technology already in widespread public use — including by patients and clinicians searching for health information.
Connection to the wider citation network
Alongside [6] (Corbelle et al. on hallucination) and [7]/[8] (medical LLM benchmarks), this forms the “rapid emergence” cluster — citations establishing that this technology arrived fast, spread fast, and is already being used in medicine without systematic evaluation. This urgency justifies the paper’s exploratory rather than rigorous study design.
Corbelle et al. (2022) — Dealing with Hallucination and Omission in Neural NLG
Why this reference matters
This is the formal academic source for the term “hallucination” as used in natural language generation — the technical vocabulary the paper borrows to describe GPT-4’s fabricated outputs. Grounding the term in an NLG conference paper gives it precision: hallucination refers specifically to generated content that is factually ungrounded relative to the input, not mere error.
Key contribution to this paper’s argument
By citing [6] when introducing hallucination, Lee et al. signal that this is a known, studied phenomenon in AI — not an unexpected flaw unique to GPT-4. This framing is double-edged: it normalises hallucination as a documented LLM property (managing expectations), while also implying the problem is unlikely to be quickly solved. The citation subtly supports the paper’s caution even amid its optimism.
Connection to the wider citation network
The only computer science conference paper in the reference list — all other citations are medical journals, regulatory documents, or gray literature. This signals the paper’s bridging function: bringing NLG/AI technical concepts into a clinical audience that may not be familiar with them. Corbelle et al. is the “translation” citation that allows Lee et al. to use a precise technical term without defining it at length.
This paper’s citation network is notably compact (11 references for a NEJM Special Report) and falls into three distinct clusters. The medical AI benchmark cluster — [7] (Med-PaLM) and [8] (ChatGPT USMLE) — provides the empirical baseline against which GPT-4’s performance is measured, establishing a progression narrative from GPT-3.5 to GPT-4. The clinical deployment cluster — [9] (Nuance DAX) and [10] (Kazi dataset) — grounds the note-taking scenario in a concrete commercial pipeline, revealing Microsoft’s intended integration pathway. The AI context cluster — [5] (ChatGPT launch) and [6] (Corbelle on hallucination) — frames GPT-4 within the rapid emergence of public LLMs and introduces the technical vocabulary of hallucination from NLG research. Two features of this network are unusual: first, three of the eleven references are gray literature (a blog post, a product page, and a dataset repository), reflecting the speed of the field and the lack of peer-reviewed precedent. Second, all three authors have direct financial relationships with the companies whose products are featured — Microsoft Research (the model), Nuance (the deployment product), and OpenAI (the underlying system) — making the entire citation network an insider account rather than an independent evaluation. This does not invalidate the work, but it is the most important single fact for critical reading.
Click an answer to reveal feedback. Each question locks after answering.
GPT-4 is not an end in itself — it is the opening of a door to new possibilities and new risks. Used carefully and with appropriate caution, these evolving tools have the potential to help healthcare providers give the best care possible.
-
Documentation Is the Strongest Near-Term Use Case
GPT-4’s ability to generate SOAP notes from encounter transcripts, produce FHIR-compliant orders, and write after-visit summaries represents the most mature and immediately deployable application. The Nuance DAX integration provides an existing commercial channel. The self-verification loop (using GPT-4 to check GPT-4) partially mitigates hallucination risk in this constrained context.
-
Hallucination Is Subtle and Dangerous in Medicine
The BMI hallucination is significant precisely because it was plausible — a specific numerical value, contextually appropriate, stated with full confidence. It would not trigger alarm bells the way an obviously absurd response would. The paper’s warning that errors “can be subtle and are often stated in such a convincing manner” is the most important clinical safety message in the article.
-
Medical Knowledge Emerges Without Medical Training
GPT-4 answers over 90% of USMLE questions correctly despite never being trained on clinical data. This is a remarkable result — and a slightly unsettling one. It suggests that public internet text about medicine is sufficient to encode substantial medical knowledge, but also that this knowledge has unknown gaps, biases, and inconsistencies inherited from whatever was written online.
-
Model Instability Undermines Clinical Standardisation
The same prompt produced different results in December 2022 versus March 2023. For clinical tools, this is not a minor inconvenience — it means a validated workflow can silently become invalid when OpenAI updates the model. No equivalent problem exists for FDA-cleared software or pharmaceutical products, which are fixed at point of approval. This is the deepest structural challenge for GPT-4’s clinical deployment.
-
Insider Evaluation Is Not Independent Evidence
All three authors are employed by entities with direct financial interest in GPT-4’s clinical adoption. The paper is valuable as a demonstration and a provocation — but it is not an independent clinical evaluation. The next step is third-party replication with pre-specified endpoints, comparator conditions, and adversarial testing by researchers without commercial stakes in the outcome.
-
The Evaluation Framework for General AI in Medicine Does Not Yet Exist
This may be the paper’s most important contribution: naming the “acceptable performance problem.” Narrow AI was evaluated against task-specific thresholds. GPT-4’s unbounded general capability makes those frameworks inapplicable. Building a new evaluation methodology for general-purpose clinical AI — one that accounts for task diversity, hallucination rates, versioning instability, and acceptable error thresholds — is the field’s most urgent research need.
