Gilbert et al; Large Language Model AI Chatbots Require Approval as Medical Devices 2023

Gilbert et al. 2023 — LLM Chatbots as Medical Devices — GIVEMEA Study Guide
GIVEMEA Study Guide · Digital Health Regulation

Large Language Model AI Chatbots Require Approval as Medical Devices

Gilbert, Harvey, Melvin, Vollebregt & Wicks · Nature Medicine · 2023

Comment / Commentary Regulatory Analysis AI in Healthcare EU MDR · FDA LLM Safety
6Approval Challenges
9Task Classifications
3Regulatory Tables
5Authors
Central Argument
Current LLM chatbots do not meet the international principles for AI in healthcare — including bias control, explainability, transparency, and oversight — and their near-infinite, unverifiable outputs effectively preclude approval as medical devices under current EU and US regulatory frameworks.

Research Question

When LLM chatbots such as ChatGPT are applied to clinical or patient-facing tasks, do they constitute medical devices under existing EU and US law, and can they realistically achieve regulatory approval in their current form?

The Regulatory Reality

Under EU law (MDR 2017/745) and US FDA guidance, any software that performs more than simple database functions to assist in diagnosis, prevention, monitoring, prediction, prognosis, treatment, or disease alleviation is classified as a medical device. Most proposed clinical LLM applications fall squarely within this definition, making regulatory approval a legal requirement rather than a recommendation.

The Fundamental Problem

LLMs generate “reasonable continuations” of text drawn from vast, unvetted web data. They have near-infinite input/output spaces, no inheritable quality assurance, and a well-documented tendency to “hallucinate” plausible but false information. This makes pre-market verification, risk mitigation, and post-market surveillance technically impossible under current frameworks — ruling out valid marketing as medical devices in the EU.

A Path Forward Exists

The authors outline concrete steps developers can take: limiting intended scope, applying quality management systems immediately, training on validated medical corpora, constraining outputs, conducting clinical trials, and implementing real-time fact-checking. Emerging models with fewer hallucinations (e.g., GatorTron, trained on 82 billion words of de-identified clinical text) suggest progress, though public details remain insufficient to assess current approvability.

Disclaimers Are No Defence

The authors cite DxGPT as an example of a tool already directing users to enter patient data for diagnostic suggestions while calling itself a “research experiment.” They argue such disclaimers do not exempt developers from medical device law: experimentation on human subjects must occur in properly controlled, authorised clinical trials with appropriate patient safeguards.

Task Classification Under EU/US Law

Adapted from Table 2 of Gilbert et al. (2023). Task status under EU MDR and US FDA guidance.

Task EU / UK Status US FDA Status
Assist patient to prepare for telehealth consultation Non-device Non-device
Triage non-critical emergency department patients Medical device Depends on function
Diagnostic decision support (DDS) Medical device Medical device (specific/time-critical)
Therapeutic-planning decision support (TDS) Medical device Medical device (specific/time-critical)
Counseling or talk therapy delivery Medical device Medical device
Generation of clinical reports Depends on purpose Non-device if requirements met
Doctor’s discharge letters Depends on purpose Non-device if requirements met
Specific post-treatment patient information Depends on purpose Non-device if requirements met

■ AI Technology   ■ Regulatory Concepts   ■ Risk & Safety   ■ Legal Frameworks

AI Technology

Large Language Model (LLM)
tap to define
Large Language Model (LLM)
A neural network language model (e.g. GPT-4, PaLM) trained on billions of unidentified web pages and books to generate statistically “reasonable” continuations of text. LLMs reassemble what was most commonly written by humans and cannot guarantee factual accuracy.
Hallucination
tap to define
Hallucination
The tendency of LLMs to produce highly convincing statements that are factually wrong, or to invent plausible but non-existent citations. Regarded by Gilbert et al. as an intrinsic structural property of LLM-based chat models, not merely a correctable error.
Grounding
tap to define
Grounding
A constraint technique (used in Google’s LaMDA) that involves collecting and annotating chats between users and the LLM to improve quality, safety and factual accuracy. The paper notes that grounding is unlikely to fully resolve hallucination because inaccuracy is intrinsic to LLMs.
Tokenized Encoding
tap to define
Tokenized Encoding
The way LLMs represent language: text is broken into tokens (word fragments) and encoded as numerical vectors. The model learns statistical relationships between token sequences — it does not “understand” meaning but predicts probable next tokens.
GatorTron
tap to define
GatorTron
A medical LLM trained exclusively on 82 billion words of de-identified clinical text, showing improved accuracy over general LLMs for medical questions. Cited by the authors as a promising direction, while noting that even the medical literature is not uniformly correct or current.

Regulatory Concepts

Medical Device (SaMD)
tap to define
Medical Device (SaMD)
Software as a Medical Device: any software performing more than simple database functions to assist in diagnosis, prevention, monitoring, prediction, prognosis, treatment, or alleviation of disease. Under EU MDR and US FDA guidance, most clinical LLM applications fall within this definition.
Clinical Decision Support (CDS)
tap to define
Clinical Decision Support (CDS)
Software that provides specific diagnostic or therapeutic advice to clinicians or patients. All patient-facing CDS and most clinician-facing CDS must undergo medical device registration in the EU and US. LLM-based CDS cannot be approved under current law due to verification and reliability failures.
Quality Management System (QMS)
tap to define
Quality Management System (QMS)
A regulated framework (e.g. ISO 13485) that medical device manufacturers must apply throughout the development lifecycle. The EU MDR requires QMS compliance for all medical devices. LLMs “have no inheritable quality assurance from their developers,” making QMS compliance extremely challenging.
Intended Purpose
tap to define
Intended Purpose
The specific clinical use for which a medical device is developed and for which it seeks regulatory approval. The paper notes that search engines are not medical devices because their developers did not intend a medical diagnostic or therapeutic purpose when creating them — a distinction that LLM chatbot developers cannot always claim.
Post-market Surveillance
tap to define
Post-market Surveillance
The ongoing monitoring of a medical device’s safety and performance after regulatory approval and market release. The EU MDR specifically requires this for all medical devices. For LLMs with near-infinite output combinations, the paper argues comprehensive surveillance is practically impossible.

Risk & Safety

Verification (Regulatory)
tap to define
Verification (Regulatory)
The process of confirming that a device’s outputs meet its design requirements across its full range of inputs. For LLMs with a near-infinite input/output space (including hallucinated outputs), verification is described in the paper as effectively untestable — a fundamental barrier to approval.
Explainability
tap to define
Explainability
The capacity of an AI system to provide transparent, understandable reasons for its outputs. LLMs cannot tell patients where advice came from, why it is given, or how ethical trade-offs were considered. Explainability is listed as one of the international consensus principles for AI in healthcare that current LLMs fail to meet.
Provenance
tap to define
Provenance
The ability to trace the origin and quality of information used in an LLM’s training data. Because LLMs are trained on vast, unidentified web text, there is no control over provenance when an LLM is used as the underlying model for a medical device built on top of it via an API.
API Exclusion
tap to define
API Exclusion
The regulatory principle that LLMs, lacking any inheritable quality assurance from their developers, are excluded from use as external “plug-in” components (via API) of a certified medical device under EU law. A medical device cannot simply wrap an unvalidated LLM and claim its approval.

Legal & Policy Frameworks

EU MDR 2017/745
tap to define
EU MDR 2017/745
EU Medical Device Regulation: the primary European regulatory framework for medical devices. Requires QMS, clinical evaluation, post-market surveillance, and clinical follow-up. Gilbert et al. argue that current LLMs cannot meet these requirements, effectively precluding EU market approval.
FDA CDS Guidance (2022)
tap to define
FDA CDS Guidance (2022)
US FDA guidance on clinical decision support software. Distinguishes between non-device CDS (where clinicians can independently verify the basis for recommendations) and regulated CDS. LLMs fail the “non-device” threshold because they provide no genuine sources and offer no certainty or confidence indicators.
Risk Classification
tap to define
Risk Classification
Under EU MDR and FDA frameworks, medical devices are assigned a risk class (I, IIa, IIb, III in EU; Class I, II, III in USA) based on severity of risk and intended use. Class determines the level of pre-market scrutiny required. Clinical decision support typically falls in higher risk classes requiring clinical data.
DxGPT
tap to define
DxGPT
An LLM-powered diagnostic tool (using GPT-3) that the paper cites as a live example of an unapproved medical device operating internationally. DxGPT invites users to enter patient descriptions for diagnostic suggestions while carrying disclaimers calling it a “research experiment” — disclaimers the authors argue do not override medical device law.

Type of Paper

This is a Comment article in Nature Medicine — a structured expert opinion piece, not an empirical study. It draws on regulatory law, published clinical reports, and technological analysis rather than primary data collection. Its authority rests on the expertise of its authors across medical device regulation, clinical AI, and health law.

Author Expertise

  • Stephen Gilbert (TU Dresden): medical device regulation and digital health
  • Hugh Harvey (Hardian Health): clinical AI and radiology
  • Tom Melvin (Trinity College Dublin): former senior medical officer, Health Products Regulatory Authority Ireland; former co-chair, EU Clinical Investigation and Evaluation Working Group
  • Erik Vollebregt (Axon Lawyers, Amsterdam): medical device law, EU and international frameworks
  • Paul Wicks (Wicks Digital Health): digital therapeutics, patient engagement, clinical research

The Six Regulatory Approval Challenges (Table 1)

  • Verification: near-infinite inputs/outputs, including hallucinated outputs, make models untestable
  • Provenance: no control over training data quality when an LLM is used as an underlying API component
  • Changes: LLMs are not fixed models — generative constraints can be adapted on-market without re-approval
  • Usability: near-infinite range of user experiences depending on how questions are phrased
  • Risks: no proven method to prevent harmful outputs
  • Surveillance: near-infinite outputs make ongoing post-market surveillance impossible

The Search Engine Analogy — and Its Limits

  • Search engines are used by ~two-thirds of patients before consultations and by most doctors 1–3 times daily
  • Search engines are not regulated medical devices because their developers did not create them with an intended medical purpose
  • LLM chatbot integration into search engines adds risk: conversational mimicry increases users’ confidence in results, raising the stakes when those results are wrong
  • An LLM chatbot marketed to patients for health decisions cannot claim the same exemption if it has an intended clinical purpose

Steps Toward Approvability (Table 3)

  • Define a clearly limited intended purpose; exclude emergency or critical use cases
  • Design to inform — not drive — medical decisions; choose an appropriate risk class
  • Implement performance benchmarks for narrow use cases; stop or tightly constrain on-market adaptivity
  • Constrain the LLM to prevent harmful advice; control data protection risks
  • Use only self-developed LLMs or external LLMs explicitly documented for medical device use
  • Develop from authoritative medical sources; rigorously test, constrain, retest, and document
  • Link automated real-time fact-checking in feedback to the LLM
  • Conduct comprehensive clinical trials following regulatory frameworks before market release

Strengths of the Commentary

  • Brings together regulatory law, technical AI critique, and clinical risk in a unified argument
  • Provides concrete, actionable guidance for developers (Table 3), avoiding purely prohibitionist stance
  • Published in a high-impact venue (Nature Medicine), ensuring broad reach across clinical and research communities
  • Authors include a practising medical device lawyer and a former national regulatory officer, lending legal precision

Limitations and Considerations

  • Published in early 2023 — the LLM landscape has evolved rapidly; some claims about specific models may be partially outdated
  • Focuses on EU and US frameworks; regulatory variation across other jurisdictions (UK, Canada, Australia, Asia) is not fully addressed
  • Does not quantify the prevalence or severity of harms from currently deployed LLM tools
  • Several authors declare conflicts of interest through advisory and consulting relationships with digital health companies

Key sources cited by Gilbert et al. (2023) and their roles in the argument.

Ref 1

Singhal et al. (2022) — Large Language Models Encode Clinical Knowledge

Preprint · arXiv:2212.13138 · Google Research (Med-PaLM)
Medical LLM Google PaLM
Introduces Med-PaLM, Google’s medically fine-tuned version of PaLM. Gilbert et al. cite this alongside the LaMDA safety and grounding constraints to illustrate that even leading developers acknowledge significant remaining safety challenges. Used as the technical backdrop for the hallucination and grounding discussion.
Ref 3

Lee, Bubeck & Petro (2023) — Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine

New England Journal of Medicine · 388:1233–1239 · 2023
NEJM Commentary GPT-4
The companion NEJM piece describing GPT-4’s potential in medicine. Gilbert et al. use it as the optimistic counterpoint — the claims of transformative potential in CDS and clinical communication — before explaining why these same applications trigger medical device law and currently cannot be approved.
Ref 5

Wolfram (2023) — What Is ChatGPT Doing and Why Does It Work?

Stephen Wolfram Writings · February 2023
Technical Explainer
Provides the technical grounding for the paper’s claim that LLMs simply “reassemble what was most commonly written by humans.” Used to explain why hallucination is intrinsic — the model predicts statistically likely token sequences, not truth — and why no training constraint can fully eliminate fabricated outputs.
Ref 9

EU MDR — Regulation (EU) 2017/745 of the European Parliament and Council

Official Journal of the European Union · 5 April 2017
Primary Legislation Regulatory
The primary legal foundation for the EU medical device classification arguments throughout the paper. Defines the SaMD category, QMS requirements, post-market surveillance obligations, and the API/component rules that exclude unvalidated LLMs from use within certified medical devices.
Ref 10

US FDA — Clinical Decision Support Guidance (2022)

US Food and Drug Administration · 2022
FDA Guidance Regulatory
The US counterpart to EU MDR for software classification. The paper uses FDA guidance to establish that LLMs fail the “non-device CDS” threshold because they do not provide genuine sources and cannot enable clinicians to independently verify the basis of their recommendations — a key non-device carve-out.
Ref 14

Yang et al. (2022) — GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records

NPJ Digital Medicine · 5:194 · 2022
Medical LLM Domain-specific Training
Describes GatorTron, an LLM trained on 82 billion words of de-identified clinical text showing improved medical question-answering. Cited as evidence that domain-specific, controlled training can reduce (but not eliminate) hallucination — and as the closest existing model to the approvability pathway Gilbert et al. describe.

Click an answer to reveal feedback. Each question locks after answering.

Question 1 of 5
According to Gilbert et al., what is the fundamental technical reason why LLMs cannot currently be approved as medical devices under EU law?
✓ Correct. The paper identifies six interlinked problems in Table 1, all rooted in the near-infinite input/output space: verification is untestable, provenance is uncontrollable, on-market changes are undocumented, usability is unpredictable, risks cannot be prevented, and surveillance is impossible. These are structural, not merely logistical, barriers.
Not quite. The correct answer is C. The authors’ core argument is that hallucination and near-infinite output space are intrinsic to how LLMs work — not problems of cost, submission, or trial scale. This makes the EU MDR’s requirements for verification and surveillance technically unachievable with current LLM architecture.
Question 2 of 5
Under EU and UK law, which of the following LLM chatbot use cases would NOT be classified as a medical device?
✓ Correct. Per Table 2 of the paper, helping a patient prepare for a telehealth consultation (e.g. equipment setup) is classified as “Non-device” in both the EU/UK and US. Tasks that do not involve diagnosis, treatment, or clinical decision support fall outside the medical device definition.
Not quite. The correct answer is B. The paper’s Table 2 shows that diagnostic support (A), therapeutic planning (C), and talk therapy (D) are all medical devices under EU law. Only purely administrative or logistical tasks like helping set up equipment for a telehealth call escape device classification.
Question 3 of 5
Why do Gilbert et al. argue that search engines are not regulated medical devices, while LLM chatbots with medical applications are?
✓ Correct. “Intended purpose” is the decisive legal concept: medical device law is triggered by what the developer claims the tool is for, not by how users happen to use it. Because search engines were created as general information retrieval tools — not clinical tools — they are not devices. An LLM marketed for diagnostics or patient advice cannot claim this exemption.
Not quite. The correct answer is C. The paper makes clear the key distinction is “intended purpose,” the legal concept at the heart of EU MDR and FDA CDS guidance. Accuracy (A) is not the regulatory criterion; prior approval (B) would be circular; search engines (D) absolutely use AI algorithms and ranking models.
Question 4 of 5
What does the paper say about disclaimers such as “this is a research experiment” in the context of LLM tools used for clinical tasks?
✓ Correct. The DxGPT example illustrates this: the tool carries a disclaimer calling it a “research experiment” while actively inviting users to enter patient data for diagnostics. The paper is explicit — medical device law applies regardless of such disclaimers, and genuine experimentation must be conducted under controlled clinical trial conditions that protect participants.
Not quite. The correct answer is C. The paper argues that disclaimers simply do not override the intent and function of a tool under medical device law. Whether the audience is patients or professionals is not the sole determinant. The DxGPT case illustrates that “research experiment” framing is insufficient protection.
Question 5 of 5
According to the paper, what is the international consensus on the key principles for AI in healthcare, and how do current LLMs measure against them?
✓ Correct. The paper explicitly states that there is “international agreement on the key principles for AI in healthcare,” which include control of bias, explainability, transparency, systems of oversight, and validation — reflected in both the proposed EU AI Act framework and US FDA guidance. Current LLM chatbots fail to meet all four criteria.
Not quite. The correct answer is C. The paper argues that genuine international consensus already exists around four principles — bias control, explainability, transparency, and oversight — and that these are reflected in emerging regulatory frameworks. Current LLMs fail on all four counts, making them incompatible with approved clinical deployment.
— / 5 Quiz Score
Core Thesis
LLM chatbots applied to clinical and patient-facing tasks are medical devices under existing EU and US law, but their intrinsic architectural properties — hallucination, infinite output space, uncontrollable provenance — make regulatory approval in their current form effectively impossible.
  • ⚖️

    Clinical LLM Use Triggers Medical Device Law — Now

    This is not a future concern. Any LLM chatbot intended for diagnosis, therapeutic planning, triage, or therapy delivery is already a medical device under EU MDR and US FDA frameworks. Regulators in both jurisdictions have previously acted to remove unvalidated clinical software from the market.

  • 🚫

    Hallucination Is Structural, Not Incidental

    The paper’s most important technical claim is that inaccuracy and hallucination are intrinsic to how LLMs work: they predict statistically plausible token sequences, not factual statements. Grounding and constraint techniques reduce but cannot eliminate this. This structural property is what makes current approval impossible, not a fixable engineering bug.

  • 🔍

    “Intended Purpose” Is the Decisive Legal Concept

    Medical device law is triggered by what a developer claims a tool is for, not by how users happen to use it. The search engine analogy shows that general-purpose tools escape classification; LLMs marketed for clinical tasks cannot. “Research experiment” disclaimers provide no legal shelter if the functional design and interface direct clinical use.

  • 🗺️

    A Path to Approvability Exists — but Requires Radical Scope Limitation

    The paper’s Table 3 outlines a realistic route: narrow intended purpose, quality management from day one, training on authoritative medical corpora, constrained adaptivity, real-time fact-checking, and full clinical trials. GatorTron, trained on 82 billion words of de-identified clinical text, points toward what domain-specific, validated LLMs might achieve.

  • 🌐

    International AI Health Principles Already Exist — and LLMs Fail Them All

    Bias control, explainability, transparency, and oversight are the internationally agreed foundations for AI in healthcare, reflected in EU and US regulatory proposals. Current LLMs cannot tell patients where advice comes from, why it is offered, or whether ethical trade-offs were considered — failing every criterion.

  • 🏥

    Regulation Enables Innovation — It Does Not Block It

    The paper’s closing argument is that regulatory approval is not an obstacle but a legitimising process. It levels the playing field between developers, links safety data to promising innovations, and builds the public trust that clinical AI tools need to be adopted. The huge effort put into LLM creativity and plausibility should be matched by equivalent effort on safety validation.

Similar Posts

Leave a Reply