Gilbert et al; Large Language Model AI Chatbots Require Approval as Medical Devices 2023
Large Language Model AI Chatbots Require Approval as Medical Devices
Gilbert, Harvey, Melvin, Vollebregt & Wicks · Nature Medicine · 2023
Current LLM chatbots do not meet the international principles for AI in healthcare — including bias control, explainability, transparency, and oversight — and their near-infinite, unverifiable outputs effectively preclude approval as medical devices under current EU and US regulatory frameworks.
Research Question
When LLM chatbots such as ChatGPT are applied to clinical or patient-facing tasks, do they constitute medical devices under existing EU and US law, and can they realistically achieve regulatory approval in their current form?
The Regulatory Reality
Under EU law (MDR 2017/745) and US FDA guidance, any software that performs more than simple database functions to assist in diagnosis, prevention, monitoring, prediction, prognosis, treatment, or disease alleviation is classified as a medical device. Most proposed clinical LLM applications fall squarely within this definition, making regulatory approval a legal requirement rather than a recommendation.
The Fundamental Problem
LLMs generate “reasonable continuations” of text drawn from vast, unvetted web data. They have near-infinite input/output spaces, no inheritable quality assurance, and a well-documented tendency to “hallucinate” plausible but false information. This makes pre-market verification, risk mitigation, and post-market surveillance technically impossible under current frameworks — ruling out valid marketing as medical devices in the EU.
A Path Forward Exists
The authors outline concrete steps developers can take: limiting intended scope, applying quality management systems immediately, training on validated medical corpora, constraining outputs, conducting clinical trials, and implementing real-time fact-checking. Emerging models with fewer hallucinations (e.g., GatorTron, trained on 82 billion words of de-identified clinical text) suggest progress, though public details remain insufficient to assess current approvability.
Disclaimers Are No Defence
The authors cite DxGPT as an example of a tool already directing users to enter patient data for diagnostic suggestions while calling itself a “research experiment.” They argue such disclaimers do not exempt developers from medical device law: experimentation on human subjects must occur in properly controlled, authorised clinical trials with appropriate patient safeguards.
Task Classification Under EU/US Law
Adapted from Table 2 of Gilbert et al. (2023). Task status under EU MDR and US FDA guidance.
| Task | EU / UK Status | US FDA Status |
|---|---|---|
| Assist patient to prepare for telehealth consultation | Non-device | Non-device |
| Triage non-critical emergency department patients | Medical device | Depends on function |
| Diagnostic decision support (DDS) | Medical device | Medical device (specific/time-critical) |
| Therapeutic-planning decision support (TDS) | Medical device | Medical device (specific/time-critical) |
| Counseling or talk therapy delivery | Medical device | Medical device |
| Generation of clinical reports | Depends on purpose | Non-device if requirements met |
| Doctor’s discharge letters | Depends on purpose | Non-device if requirements met |
| Specific post-treatment patient information | Depends on purpose | Non-device if requirements met |
■ AI Technology ■ Regulatory Concepts ■ Risk & Safety ■ Legal Frameworks
AI Technology
Regulatory Concepts
Risk & Safety
Legal & Policy Frameworks
Type of Paper
This is a Comment article in Nature Medicine — a structured expert opinion piece, not an empirical study. It draws on regulatory law, published clinical reports, and technological analysis rather than primary data collection. Its authority rests on the expertise of its authors across medical device regulation, clinical AI, and health law.
Author Expertise
- Stephen Gilbert (TU Dresden): medical device regulation and digital health
- Hugh Harvey (Hardian Health): clinical AI and radiology
- Tom Melvin (Trinity College Dublin): former senior medical officer, Health Products Regulatory Authority Ireland; former co-chair, EU Clinical Investigation and Evaluation Working Group
- Erik Vollebregt (Axon Lawyers, Amsterdam): medical device law, EU and international frameworks
- Paul Wicks (Wicks Digital Health): digital therapeutics, patient engagement, clinical research
The Six Regulatory Approval Challenges (Table 1)
- Verification: near-infinite inputs/outputs, including hallucinated outputs, make models untestable
- Provenance: no control over training data quality when an LLM is used as an underlying API component
- Changes: LLMs are not fixed models — generative constraints can be adapted on-market without re-approval
- Usability: near-infinite range of user experiences depending on how questions are phrased
- Risks: no proven method to prevent harmful outputs
- Surveillance: near-infinite outputs make ongoing post-market surveillance impossible
The Search Engine Analogy — and Its Limits
- Search engines are used by ~two-thirds of patients before consultations and by most doctors 1–3 times daily
- Search engines are not regulated medical devices because their developers did not create them with an intended medical purpose
- LLM chatbot integration into search engines adds risk: conversational mimicry increases users’ confidence in results, raising the stakes when those results are wrong
- An LLM chatbot marketed to patients for health decisions cannot claim the same exemption if it has an intended clinical purpose
Steps Toward Approvability (Table 3)
- Define a clearly limited intended purpose; exclude emergency or critical use cases
- Design to inform — not drive — medical decisions; choose an appropriate risk class
- Implement performance benchmarks for narrow use cases; stop or tightly constrain on-market adaptivity
- Constrain the LLM to prevent harmful advice; control data protection risks
- Use only self-developed LLMs or external LLMs explicitly documented for medical device use
- Develop from authoritative medical sources; rigorously test, constrain, retest, and document
- Link automated real-time fact-checking in feedback to the LLM
- Conduct comprehensive clinical trials following regulatory frameworks before market release
Strengths of the Commentary
- Brings together regulatory law, technical AI critique, and clinical risk in a unified argument
- Provides concrete, actionable guidance for developers (Table 3), avoiding purely prohibitionist stance
- Published in a high-impact venue (Nature Medicine), ensuring broad reach across clinical and research communities
- Authors include a practising medical device lawyer and a former national regulatory officer, lending legal precision
Limitations and Considerations
- Published in early 2023 — the LLM landscape has evolved rapidly; some claims about specific models may be partially outdated
- Focuses on EU and US frameworks; regulatory variation across other jurisdictions (UK, Canada, Australia, Asia) is not fully addressed
- Does not quantify the prevalence or severity of harms from currently deployed LLM tools
- Several authors declare conflicts of interest through advisory and consulting relationships with digital health companies
Key sources cited by Gilbert et al. (2023) and their roles in the argument.
Singhal et al. (2022) — Large Language Models Encode Clinical Knowledge
Medical LLM Google PaLMLee, Bubeck & Petro (2023) — Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine
NEJM Commentary GPT-4Wolfram (2023) — What Is ChatGPT Doing and Why Does It Work?
Technical ExplainerEU MDR — Regulation (EU) 2017/745 of the European Parliament and Council
Primary Legislation RegulatoryUS FDA — Clinical Decision Support Guidance (2022)
FDA Guidance RegulatoryYang et al. (2022) — GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records
Medical LLM Domain-specific TrainingClick an answer to reveal feedback. Each question locks after answering.
LLM chatbots applied to clinical and patient-facing tasks are medical devices under existing EU and US law, but their intrinsic architectural properties — hallucination, infinite output space, uncontrollable provenance — make regulatory approval in their current form effectively impossible.
-
Clinical LLM Use Triggers Medical Device Law — Now
This is not a future concern. Any LLM chatbot intended for diagnosis, therapeutic planning, triage, or therapy delivery is already a medical device under EU MDR and US FDA frameworks. Regulators in both jurisdictions have previously acted to remove unvalidated clinical software from the market.
-
Hallucination Is Structural, Not Incidental
The paper’s most important technical claim is that inaccuracy and hallucination are intrinsic to how LLMs work: they predict statistically plausible token sequences, not factual statements. Grounding and constraint techniques reduce but cannot eliminate this. This structural property is what makes current approval impossible, not a fixable engineering bug.
-
“Intended Purpose” Is the Decisive Legal Concept
Medical device law is triggered by what a developer claims a tool is for, not by how users happen to use it. The search engine analogy shows that general-purpose tools escape classification; LLMs marketed for clinical tasks cannot. “Research experiment” disclaimers provide no legal shelter if the functional design and interface direct clinical use.
-
A Path to Approvability Exists — but Requires Radical Scope Limitation
The paper’s Table 3 outlines a realistic route: narrow intended purpose, quality management from day one, training on authoritative medical corpora, constrained adaptivity, real-time fact-checking, and full clinical trials. GatorTron, trained on 82 billion words of de-identified clinical text, points toward what domain-specific, validated LLMs might achieve.
-
International AI Health Principles Already Exist — and LLMs Fail Them All
Bias control, explainability, transparency, and oversight are the internationally agreed foundations for AI in healthcare, reflected in EU and US regulatory proposals. Current LLMs cannot tell patients where advice comes from, why it is offered, or whether ethical trade-offs were considered — failing every criterion.
-
Regulation Enables Innovation — It Does Not Block It
The paper’s closing argument is that regulatory approval is not an obstacle but a legitimising process. It levels the playing field between developers, links safety data to promising innovations, and builds the public trust that clinical AI tools need to be adopted. The huge effort put into LLM creativity and plausibility should be matched by equivalent effort on safety validation.
