Consider an illustrative example, constructed but representative of failure modes we see across the language access space. A clinician tells a patient "take this medication twice daily, but never with grapefruit juice." An interpretation platform renders this into the patient's language. The output reads, in back-translation "take this medication twice daily, but never without grapefruit juice." The negation has flipped. Every word is correct. The grammar is correct. The dose is correct. A BLEU score on that exchange would look excellent, because almost every n-gram from the reference matches the candidate. The encounter is unsafe.
β
Literal-overlap metrics miss the entire reason good medical interpretation is hard. Real interpretation is not word-by-word substitution. It is context-aware work: streamlining the message, handling gender ambiguity, picking the right word for the situation and restructuring sentences so they land in the target language the way they would have landed in the source. To measure quality in 2026, we have to measure what good interpretation actually does.
β
1. AI or human is not the question anymore
1.1 The hybrid question has settled
Health systems that have implemented AI medical interpretation have not adopted it as a replacement for human interpreters. They have adopted a hybrid workflow where AI handles routine encounters and human interpreters handle the cases AI is not yet ready for. The debate of three years ago has been overtaken by operational reality. The interesting questions have moved.
β
1.2 The three real questions
For a CMIO or language access director evaluating a platform in 2026, the questions are about three things.
First, the quality of the interpretation itself: how accurately does the AI render the source, in clinical terms, in the patient's dialect, in the right register?
Second, the quality of the integration into the clinical workflow: how fast is the connection, how is it documented, how does it hand off to a human when the encounter calls for it?
Third, how the vendor measures their own accuracy: what does the vendor's internal audit actually look like? What triggers a human review? What happens when a human reviewer disagrees with the AI?
β
As our CEO Eyal Heldenberg puts it: "the vendor has to run internal audits on the quality of interpretation." As Katy Haynes notes in her CHCF report on AI and language access in healthcare, vendors have to cover both generic benchmark testing and live encounter review. On the other end, medical sites have the responsibility to choose the right vendor and to build internal protocols on top.
β
At this stage all parties have to approach this with humility. We are evolving in a new environment. Frameworks are still being built, tested and adjusted. The two-sided responsibility is not a hedge. It is the only model that works in a field this young.
β
β
β
2. What does interpretation quality actually mean in a clinical setting?
2.1 The four properties
β
Four properties show up in every serious framework we have seen. They survived as the working framework because they map cleanly onto what goes wrong in a clinical encounter when interpretation fails. Each property checks a different thing and a single failure can trigger more than one.
- Translation accuracy is a coverage check on the source. Is every clinically relevant piece of information from the source present in the output, with nothing dropped, nothing fabricated and nothing modified? If "twice daily" becomes "as needed," or "never with grapefruit juice" becomes "never without grapefruit juice," the source information has been modified and accuracy has failed.
- Medical-term retention is a vocabulary check on the output. Are clinical terms (conditions, medications, procedures, dosing units, anatomical locations) rendered with the term a clinician in the target language would actually use? "Hypertension" is not "high blood pressure feeling." "NPO after midnight" (Latin nil per os) is not "do not eat at night."
- Semantic adequacy: the translation preserves the meaning of the source, not just its surface. It does not flatten urgency, soften a warning or shift a directive into a suggestion. This is the criterion that catches the metformin example and one of the hardest properties for any interpretation system to guarantee at scale.
- Cultural tone: wording suits the setting, the patient and the cultural context. Pediatric language is not adult language. And dialect depth matters more than language coverage: handling Mexican Spanish but stumbling on Dominican Spanish is not really covering Spanish nuances.
β
β
The four properties overlap deliberately. Accuracy and adequacy in particular can both fail on the same utterance (a negation flip fails both: the source information has been modified and the meaning the patient takes away is the opposite of what the clinician intended). They can also fail independently. A dropped non-clinical word fails accuracy but preserves adequacy. A softened directive fails adequacy but preserves accuracy. A quality program that scores all four separately catches more failure modes than any single composite score.
β
2.2 What is new in 2026
These properties are not new.
What is new is how to grade them when an AI is producing the interpretation in real time, hundreds of thousands of times a week, in 45+ languages. Medical interpretation (live speech) is not medical interpretation (documents).
A miss on translation accuracy is a wrong dose or a flipped negation. A miss on medical-term retention is a clinician misunderstanding the patient. A miss on semantic adequacy is a softened safety warning or a flattened directive. A miss on cultural tone is an eroded encounter. Every quality program in the field has to grade all four.
β
β
β
3. Why a broader definition of quality is overdue
3.1 The hidden assumption in literal metrics
The deeper problem with literal-overlap metrics is the assumption they bake in: that there is a "correct" target-language version of every source sentence and quality is how closely the candidate matches it. That assumption was never quite right and it is now visibly wrong. A skilled medical interpreter, human or AI, does not produce the closest possible word-by-word match. They produce the version that lands correctly in the target language and the target context. That work has at least four dimensions literal metrics cannot see: streamlining redundancy, gender disambiguation, choosing the right word for the situation and restructuring sentences.
β
3.2 Two concrete examples
Gender disambiguation. Many languages mark gender on verbs, adjectives or articles where English does not. Concrete example: a Brazilian Portuguese-speaking patient is in the room with her adult son. The clinician says "she's been managing her blood sugar well this month." A literal translation that defaults to the masculine form (the statistically more common verb agreement in many systems) renders the sentence as if the clinician is praising the son rather than the patient. The patient hears it. The son hears it. Trust takes a hit immediately. A context-aware system carries the established subject through the discourse and uses the correct gendered form.
β
Choosing the right word for the situation. Many clinical terms have multiple valid translations and the right one depends on the patient, the encounter and the register. The clinician says "we'd like to draw some blood for labs." In Spanish, this can be rendered as the formal "vamos a extraerle sangre para los anΓ‘lisis," the everyday "le vamos a sacar sangre para los exΓ‘menes" or the colloquial "le vamos a sacar sangre." A literal metric would treat all three as equally valid because they all map back to the source. A context-aware system picks the one that fits the patient and the encounter type.
β
β
β
4. Why BLEU is not enough
4.1 BLEU is linear, interpretation is context-dependent
BLEU was introduced in 2002 by Papineni and colleagues at IBM as an automatic alternative to slow, expensive human evaluation. It counts how many short word sequences (n-grams) in the candidate translation overlap with one or more human reference translations and produces a score between 0 and 1. The math is linear: more overlap is a higher score, regardless of what the words actually mean in context.
β
That is the central failure for medical interpretation. An English idiom like "cold turkey" illustrates it in one line. A clinician tells a patient "you'll need to stop the medication cold turkey." A literal-overlap metric is satisfied if the target language output contains the closest word-for-word match. In Spanish, that match is "pavo frΓo," literally a cold bird. The correct interpretation is "dejar de golpe" or "abstinencia repentina" (stop abruptly or sudden withdrawal). BLEU rewards the wrong answer because the wrong answer overlaps with the source more literally than the right one.
β
β
4.2 Document translation versus live interpretation
BLEU was designed to grade document translation: a source text and a target text, both static, both reviewable, both with reference translations available. Live medical interpretation is a different problem. It is bidirectional. It is real-time. It involves turn-taking, repair, multi-speaker dynamics and the context-aware behaviors above. Most of what makes live interpretation hard does not exist in the document-translation setting BLEU was built for. Using BLEU to grade live interpretation is like using a written-exam rubric to grade a clinical viva.
β
4.3 BLEU scores errors, it does not annotate them
Even when BLEU correctly identifies that a translation is bad, it does not say why it is bad. The score is a single number that does not distinguish a wrong medication term from a softened safety warning from a register mismatch. For a vendor trying to improve their system, this is the difference between a thermometer and a diagnosis. Modern evaluation frameworks produce structured, per-criterion scores that point to the specific failure mode. BLEU points nowhere. For a longer treatment of how the underlying models have changed, see our earlier post on LLM capabilities and medical terminology accuracy.
β
β
β
5. The modern stack: MQM in software, with human escalation
5.1 MQM is the framework, now running automatically
Between BLEU and the modern evaluation stack sits Multidimensional Quality Metrics (MQM). MQM emerged in the early 2010s as an error-typology framework: a reviewer tags each error against a structured taxonomy (accuracy, fluency, terminology, style) and weights each error by severity (minor, major, critical). MQM is the closest thing the translation industry has to a clinical lab's quality control protocol and the four properties above map directly onto its top-level categories.
β
For most of its history MQM was operationalized through expert human reviewers tagging by hand. That made it accurate but slow, at roughly five to fifteen minutes per segment, which is fine for a research benchmark and impossible for a platform producing hundreds of thousands of utterances a week. What has changed in the last two years is that MQM has gone automated. Systems like GEMBA-MQM and multi-agent LLM evaluators now apply MQM-style tagging without a human in the loop and they form the basis of the modern LLM-as-judge stack. The strongest LLM judges for translation are MQM, implemented in software.
β
Every interpreted utterance can be scored against the four properties and the context-aware behaviors in close to real time. The judge produces a structured score per criterion, not a single number. A semantic-adequacy score below high triggers a review queue. A representative example from Mayo Clinic researchers evaluating LLM-generated EHR summaries found that an LLM-as-Judge framework achieved an intraclass correlation coefficient of 0.818 with human evaluators and completed evaluations in 22 seconds.
β
β
5.2 Human escalation is non-negotiable
LLM judges are not perfect. Recent global health research shows that even the highest-performing LLM-judge achieved human-equivalent evaluations on only four of eleven criteria, which is why a human review layer is required. Credentialed medical interpreters and bilingual clinicians review a sampled percentage of encounters plus all flagged cases and their judgments feed back into the judge's prompts. The evaluation layer mirrors the operational layer where providers escalate from AI to human interpreters: AI at scale, human at the edges. One thing standard MQM was not built for is the live-conversation dimensions that matter most in medical interpretation, including turn-taking, repair and multi-speaker dynamics. Extending MQM to those dimensions is part of what a serious medical interpretation evaluation stack does.
β
β
β
6. What buyers should take away
6.1 BLEU shows its limits
What works for word-for-word document translation does not work for contextual live interpretation. Translation is about matching words. Interpretation is about preserving meaning. BLEU was built for the first job and a generation of statistical machine translation systems was the right context for it. A clinician asking a patient about grapefruit juice interactions is the wrong context. The metric that won the 2000s is not the metric that grades patient safety in the 2020s.
β
6.2 The four properties have not changed, the toolbox has
Translation accuracy, medical-term retention, semantic adequacy and cultural tone are still the four things any serious quality program scores against.Β
What has changed is who does the scoring.Β
MQM, which used to be a slow manual protocol with expert reviewers tagging segments by hand, now mainly runs automatically through LLM-as-judge systems that apply its taxonomy at scale. The framework is the same. The volume is hundreds of thousands of utterances a week instead of dozens. Pairing those automated evaluators with human reviewer escalation on samples and flagged cases is what modern AI medical interpretation actually requires.
β
6.3 The evaluation stack is necessary and still evolving
At this stage all parties have to acknowledge that frameworks are being built, tested and adjusted in real time. Evaluation is not optional. The vendor's responsibility is to run internal audits on quality of interpretation. The medical site's responsibility is to choose the right vendor and to build internal protocols on top. Both sides are operating in a new environment that is changing and rapidly evolving, which is why the two-sided responsibility model is not a hedge but the only one that works.
The operational test for a CMIO or language access director is simple. Ask the vendor to walk you through their evaluation stack the way you'd ask a hospital lab to walk you through quality control. Ask three things: what gets measured on every encounter, what triggers a human review and what happens when a human reviewer disagrees with the AI. This is also the regulatory frame buyers are being held to under Section 1557 and meaningful access compliance which the HHS Office for Civil Rights extends to language assistance services in healthcare. A vendor with real answers is running a quality program. Reach out to us about No Barrierβs quality program.