Our Blog

Left Arrow
Back

How Should We Measure Medical Translation Quality in the AI Era?

AI or human is not the question anymore. The real questions are quality of interpretation, quality of workflow integration and how the vendor measures their own accuracy.

Taumer Baum

Co-founder and CTO, No Barrier

Last Updated:

May 27, 2026

7

Minute Read

Consider an illustrative example, constructed but representative of failure modes we see across the language access space. A clinician tells a patient "take this medication twice daily, but never with grapefruit juice." An interpretation platform renders this into the patient's language. The output reads, in back-translation "take this medication twice daily, but never without grapefruit juice." The negation has flipped. Every word is correct. The grammar is correct. The dose is correct. A BLEU score on that exchange would look excellent, because almost every n-gram from the reference matches the candidate. The encounter is unsafe.

‍

Literal-overlap metrics miss the entire reason good medical interpretation is hard. Real interpretation is not word-by-word substitution. It is context-aware work: streamlining the message, handling gender ambiguity, picking the right word for the situation and restructuring sentences so they land in the target language the way they would have landed in the source. To measure quality in 2026, we have to measure what good interpretation actually does.

‍

1. AI or human is not the question anymore

1.1 The hybrid question has settled

Health systems that have implemented AI medical interpretation have not adopted it as a replacement for human interpreters. They have adopted a hybrid workflow where AI handles routine encounters and human interpreters handle the cases AI is not yet ready for. The debate of three years ago has been overtaken by operational reality. The interesting questions have moved.

‍

1.2 The three real questions

For a CMIO or language access director evaluating a platform in 2026, the questions are about three things.

First, the quality of the interpretation itself: how accurately does the AI render the source, in clinical terms, in the patient's dialect, in the right register?

Second, the quality of the integration into the clinical workflow: how fast is the connection, how is it documented, how does it hand off to a human when the encounter calls for it?

Third, how the vendor measures their own accuracy: what does the vendor's internal audit actually look like? What triggers a human review? What happens when a human reviewer disagrees with the AI?

‍

As our CEO Eyal Heldenberg puts it: "the vendor has to run internal audits on the quality of interpretation." As Katy Haynes notes in her CHCF report on AI and language access in healthcare, vendors have to cover both generic benchmark testing and live encounter review. On the other end, medical sites have the responsibility to choose the right vendor and to build internal protocols on top.

‍

At this stage all parties have to approach this with humility. We are evolving in a new environment. Frameworks are still being built, tested and adjusted. The two-sided responsibility is not a hedge. It is the only model that works in a field this young.

‍

‍

‍

2. What does interpretation quality actually mean in a clinical setting?

2.1 The four properties

‍

Four properties show up in every serious framework we have seen. They survived as the working framework because they map cleanly onto what goes wrong in a clinical encounter when interpretation fails. Each property checks a different thing and a single failure can trigger more than one.

  • Translation accuracy is a coverage check on the source. Is every clinically relevant piece of information from the source present in the output, with nothing dropped, nothing fabricated and nothing modified? If "twice daily" becomes "as needed," or "never with grapefruit juice" becomes "never without grapefruit juice," the source information has been modified and accuracy has failed.
  • Medical-term retention is a vocabulary check on the output. Are clinical terms (conditions, medications, procedures, dosing units, anatomical locations) rendered with the term a clinician in the target language would actually use? "Hypertension" is not "high blood pressure feeling." "NPO after midnight" (Latin nil per os) is not "do not eat at night."
  • Semantic adequacy: the translation preserves the meaning of the source, not just its surface. It does not flatten urgency, soften a warning or shift a directive into a suggestion. This is the criterion that catches the metformin example and one of the hardest properties for any interpretation system to guarantee at scale.
  • Cultural tone: wording suits the setting, the patient and the cultural context. Pediatric language is not adult language. And dialect depth matters more than language coverage: handling Mexican Spanish but stumbling on Dominican Spanish is not really covering Spanish nuances.

‍

‍

The four properties overlap deliberately. Accuracy and adequacy in particular can both fail on the same utterance (a negation flip fails both: the source information has been modified and the meaning the patient takes away is the opposite of what the clinician intended). They can also fail independently. A dropped non-clinical word fails accuracy but preserves adequacy. A softened directive fails adequacy but preserves accuracy. A quality program that scores all four separately catches more failure modes than any single composite score.

‍

2.2 What is new in 2026

These properties are not new.

What is new is how to grade them when an AI is producing the interpretation in real time, hundreds of thousands of times a week, in 45+ languages. Medical interpretation (live speech) is not medical interpretation (documents).

A miss on translation accuracy is a wrong dose or a flipped negation. A miss on medical-term retention is a clinician misunderstanding the patient. A miss on semantic adequacy is a softened safety warning or a flattened directive. A miss on cultural tone is an eroded encounter. Every quality program in the field has to grade all four.

‍

‍

‍

3. Why a broader definition of quality is overdue

3.1 The hidden assumption in literal metrics

The deeper problem with literal-overlap metrics is the assumption they bake in: that there is a "correct" target-language version of every source sentence and quality is how closely the candidate matches it. That assumption was never quite right and it is now visibly wrong. A skilled medical interpreter, human or AI, does not produce the closest possible word-by-word match. They produce the version that lands correctly in the target language and the target context. That work has at least four dimensions literal metrics cannot see: streamlining redundancy, gender disambiguation, choosing the right word for the situation and restructuring sentences.

‍

3.2 Two concrete examples

Gender disambiguation. Many languages mark gender on verbs, adjectives or articles where English does not. Concrete example: a Brazilian Portuguese-speaking patient is in the room with her adult son. The clinician says "she's been managing her blood sugar well this month." A literal translation that defaults to the masculine form (the statistically more common verb agreement in many systems) renders the sentence as if the clinician is praising the son rather than the patient. The patient hears it. The son hears it. Trust takes a hit immediately. A context-aware system carries the established subject through the discourse and uses the correct gendered form.

‍

Choosing the right word for the situation. Many clinical terms have multiple valid translations and the right one depends on the patient, the encounter and the register. The clinician says "we'd like to draw some blood for labs." In Spanish, this can be rendered as the formal "vamos a extraerle sangre para los anΓ‘lisis," the everyday "le vamos a sacar sangre para los exΓ‘menes" or the colloquial "le vamos a sacar sangre." A literal metric would treat all three as equally valid because they all map back to the source. A context-aware system picks the one that fits the patient and the encounter type.

‍

‍

‍

4. Why BLEU is not enough

4.1 BLEU is linear, interpretation is context-dependent

BLEU was introduced in 2002 by Papineni and colleagues at IBM as an automatic alternative to slow, expensive human evaluation. It counts how many short word sequences (n-grams) in the candidate translation overlap with one or more human reference translations and produces a score between 0 and 1. The math is linear: more overlap is a higher score, regardless of what the words actually mean in context.

‍

That is the central failure for medical interpretation. An English idiom like "cold turkey" illustrates it in one line. A clinician tells a patient "you'll need to stop the medication cold turkey." A literal-overlap metric is satisfied if the target language output contains the closest word-for-word match. In Spanish, that match is "pavo frΓ­o," literally a cold bird. The correct interpretation is "dejar de golpe" or "abstinencia repentina" (stop abruptly or sudden withdrawal). BLEU rewards the wrong answer because the wrong answer overlaps with the source more literally than the right one.

‍

‍

4.2 Document translation versus live interpretation

BLEU was designed to grade document translation: a source text and a target text, both static, both reviewable, both with reference translations available. Live medical interpretation is a different problem. It is bidirectional. It is real-time. It involves turn-taking, repair, multi-speaker dynamics and the context-aware behaviors above. Most of what makes live interpretation hard does not exist in the document-translation setting BLEU was built for. Using BLEU to grade live interpretation is like using a written-exam rubric to grade a clinical viva.

‍

4.3 BLEU scores errors, it does not annotate them

Even when BLEU correctly identifies that a translation is bad, it does not say why it is bad. The score is a single number that does not distinguish a wrong medication term from a softened safety warning from a register mismatch. For a vendor trying to improve their system, this is the difference between a thermometer and a diagnosis. Modern evaluation frameworks produce structured, per-criterion scores that point to the specific failure mode. BLEU points nowhere. For a longer treatment of how the underlying models have changed, see our earlier post on LLM capabilities and medical terminology accuracy.

‍

‍

‍

5. The modern stack: MQM in software, with human escalation

5.1 MQM is the framework, now running automatically

Between BLEU and the modern evaluation stack sits Multidimensional Quality Metrics (MQM). MQM emerged in the early 2010s as an error-typology framework: a reviewer tags each error against a structured taxonomy (accuracy, fluency, terminology, style) and weights each error by severity (minor, major, critical). MQM is the closest thing the translation industry has to a clinical lab's quality control protocol and the four properties above map directly onto its top-level categories.

‍

For most of its history MQM was operationalized through expert human reviewers tagging by hand. That made it accurate but slow, at roughly five to fifteen minutes per segment, which is fine for a research benchmark and impossible for a platform producing hundreds of thousands of utterances a week. What has changed in the last two years is that MQM has gone automated. Systems like GEMBA-MQM and multi-agent LLM evaluators now apply MQM-style tagging without a human in the loop and they form the basis of the modern LLM-as-judge stack. The strongest LLM judges for translation are MQM, implemented in software.

‍

Every interpreted utterance can be scored against the four properties and the context-aware behaviors in close to real time. The judge produces a structured score per criterion, not a single number. A semantic-adequacy score below high triggers a review queue. A representative example from Mayo Clinic researchers evaluating LLM-generated EHR summaries found that an LLM-as-Judge framework achieved an intraclass correlation coefficient of 0.818 with human evaluators and completed evaluations in 22 seconds.

‍

‍

5.2 Human escalation is non-negotiable

LLM judges are not perfect. Recent global health research shows that even the highest-performing LLM-judge achieved human-equivalent evaluations on only four of eleven criteria, which is why a human review layer is required. Credentialed medical interpreters and bilingual clinicians review a sampled percentage of encounters plus all flagged cases and their judgments feed back into the judge's prompts. The evaluation layer mirrors the operational layer where providers escalate from AI to human interpreters: AI at scale, human at the edges. One thing standard MQM was not built for is the live-conversation dimensions that matter most in medical interpretation, including turn-taking, repair and multi-speaker dynamics. Extending MQM to those dimensions is part of what a serious medical interpretation evaluation stack does.

‍

‍

‍

6. What buyers should take away

6.1 BLEU shows its limits

What works for word-for-word document translation does not work for contextual live interpretation. Translation is about matching words. Interpretation is about preserving meaning. BLEU was built for the first job and a generation of statistical machine translation systems was the right context for it. A clinician asking a patient about grapefruit juice interactions is the wrong context. The metric that won the 2000s is not the metric that grades patient safety in the 2020s.

‍

6.2 The four properties have not changed, the toolbox has

Translation accuracy, medical-term retention, semantic adequacy and cultural tone are still the four things any serious quality program scores against.Β 

What has changed is who does the scoring.Β 

MQM, which used to be a slow manual protocol with expert reviewers tagging segments by hand, now mainly runs automatically through LLM-as-judge systems that apply its taxonomy at scale. The framework is the same. The volume is hundreds of thousands of utterances a week instead of dozens. Pairing those automated evaluators with human reviewer escalation on samples and flagged cases is what modern AI medical interpretation actually requires.

‍

6.3 The evaluation stack is necessary and still evolving

At this stage all parties have to acknowledge that frameworks are being built, tested and adjusted in real time. Evaluation is not optional. The vendor's responsibility is to run internal audits on quality of interpretation. The medical site's responsibility is to choose the right vendor and to build internal protocols on top. Both sides are operating in a new environment that is changing and rapidly evolving, which is why the two-sided responsibility model is not a hedge but the only one that works.

The operational test for a CMIO or language access director is simple. Ask the vendor to walk you through their evaluation stack the way you'd ask a hospital lab to walk you through quality control. Ask three things: what gets measured on every encounter, what triggers a human review and what happens when a human reviewer disagrees with the AI. This is also the regulatory frame buyers are being held to under Section 1557 and meaningful access compliance which the HHS Office for Civil Rights extends to language assistance services in healthcare. A vendor with real answers is running a quality program. Reach out to us about No Barrier’s quality program.

FAQs

1. What are the four properties that define medical interpretation quality in 2026?

Chevron

The four properties are translation accuracy, medical-term retention, semantic adequacy and cultural tone.

2. Why is BLEU not enough for measuring medical interpretation quality?

Chevron

BLEU is a literal-overlap metric that counts how many word sequences in a translation match a reference text, which makes it structurally blind to context, idioms, gender disambiguation and clinical safety. A translation can score high on BLEU while reversing a negation, softening a safety warning or picking the wrong register for the patient, all of which are unsafe in a clinical encounter. BLEU was designed for grading static document translation in research benchmarks rather than live bidirectional medical interpretation. Even when it correctly identifies that a translation is bad, the single number does not annotate which property failed or why.

3. What is MQM and how does it relate to LLM-as-judge in medical interpretation?

Chevron

Multidimensional Quality Metrics (MQM) is an error-typology framework where each translation error is tagged against a structured taxonomy of accuracy, fluency, terminology and style, then weighted by severity (minor, major, critical). The strongest LLM judges for translation today, including GEMBA-MQM and multi-agent LLM evaluators, are MQM implemented in software: they apply the same structured tagging automatically that human reviewers used to do by hand. The choice today is not between MQM and LLM-as-judge but between manual MQM (slow and accurate) and automated MQM (fast and scalable, paired with human reviewer escalation on samples and flagged cases).

4. What's the maturity of evaluation systems for medical interpretation?

Chevron

The evaluation environment for medical interpretation is new and still being shaped. BLEU has been ruled out as appropriate for live clinical encounters because it measures word overlap rather than meaning. MQM, now running automatically through LLM-as-judge systems like GEMBA-MQM, is the basis of the modern stack and is deployed at scale but the field as a whole is still being tested, calibrated and adjusted. This is why vendors run internal audits and medical sites build their own protocols on top, rather than relying on any single metric as a finished standard.

5. What should a CMIO ask an AI medical interpretation vendor about quality?

Chevron

Ask the vendor to walk you through their evaluation stack.

Author Image
Taumer Baum

Co-founder and CTO, No Barrier

Tomer is a Voice AI engineer passionate about creating intuitive, human-centered applications. He focuses on designing AI-driven tools that make healthcare communication clearer, faster and more accessible for both clinicians and patients. Tomer frequently shares insights on using AI tools, advancing interoperability and fostering technology that empowers clinicians and patients alike.

Share this article

Twitter
Facebook
Pinterest
Linkedin
Linkedin
Telegram
Reddit
Left Arrow
Back

How Should We Measure Medical Translation Quality in the AI Era?

Taumer Baum

Co-founder and CTO, No Barrier

May 26, 2026

7

Minute Read

Consider an illustrative example, constructed but representative of failure modes we see across the language access space. A clinician tells a patient "take this medication twice daily, but never with grapefruit juice." An interpretation platform renders this into the patient's language. The output reads, in back-translation "take this medication twice daily, but never without grapefruit juice." The negation has flipped. Every word is correct. The grammar is correct. The dose is correct. A BLEU score on that exchange would look excellent, because almost every n-gram from the reference matches the candidate. The encounter is unsafe.

‍

Literal-overlap metrics miss the entire reason good medical interpretation is hard. Real interpretation is not word-by-word substitution. It is context-aware work: streamlining the message, handling gender ambiguity, picking the right word for the situation and restructuring sentences so they land in the target language the way they would have landed in the source. To measure quality in 2026, we have to measure what good interpretation actually does.

‍

1. AI or human is not the question anymore

1.1 The hybrid question has settled

Health systems that have implemented AI medical interpretation have not adopted it as a replacement for human interpreters. They have adopted a hybrid workflow where AI handles routine encounters and human interpreters handle the cases AI is not yet ready for. The debate of three years ago has been overtaken by operational reality. The interesting questions have moved.

‍

1.2 The three real questions

For a CMIO or language access director evaluating a platform in 2026, the questions are about three things.

First, the quality of the interpretation itself: how accurately does the AI render the source, in clinical terms, in the patient's dialect, in the right register?

Second, the quality of the integration into the clinical workflow: how fast is the connection, how is it documented, how does it hand off to a human when the encounter calls for it?

Third, how the vendor measures their own accuracy: what does the vendor's internal audit actually look like? What triggers a human review? What happens when a human reviewer disagrees with the AI?

‍

As our CEO Eyal Heldenberg puts it: "the vendor has to run internal audits on the quality of interpretation." As Katy Haynes notes in her CHCF report on AI and language access in healthcare, vendors have to cover both generic benchmark testing and live encounter review. On the other end, medical sites have the responsibility to choose the right vendor and to build internal protocols on top.

‍

At this stage all parties have to approach this with humility. We are evolving in a new environment. Frameworks are still being built, tested and adjusted. The two-sided responsibility is not a hedge. It is the only model that works in a field this young.

‍

‍

‍

2. What does interpretation quality actually mean in a clinical setting?

2.1 The four properties

‍

Four properties show up in every serious framework we have seen. They survived as the working framework because they map cleanly onto what goes wrong in a clinical encounter when interpretation fails. Each property checks a different thing and a single failure can trigger more than one.

  • Translation accuracy is a coverage check on the source. Is every clinically relevant piece of information from the source present in the output, with nothing dropped, nothing fabricated and nothing modified? If "twice daily" becomes "as needed," or "never with grapefruit juice" becomes "never without grapefruit juice," the source information has been modified and accuracy has failed.
  • Medical-term retention is a vocabulary check on the output. Are clinical terms (conditions, medications, procedures, dosing units, anatomical locations) rendered with the term a clinician in the target language would actually use? "Hypertension" is not "high blood pressure feeling." "NPO after midnight" (Latin nil per os) is not "do not eat at night."
  • Semantic adequacy: the translation preserves the meaning of the source, not just its surface. It does not flatten urgency, soften a warning or shift a directive into a suggestion. This is the criterion that catches the metformin example and one of the hardest properties for any interpretation system to guarantee at scale.
  • Cultural tone: wording suits the setting, the patient and the cultural context. Pediatric language is not adult language. And dialect depth matters more than language coverage: handling Mexican Spanish but stumbling on Dominican Spanish is not really covering Spanish nuances.

‍

‍

The four properties overlap deliberately. Accuracy and adequacy in particular can both fail on the same utterance (a negation flip fails both: the source information has been modified and the meaning the patient takes away is the opposite of what the clinician intended). They can also fail independently. A dropped non-clinical word fails accuracy but preserves adequacy. A softened directive fails adequacy but preserves accuracy. A quality program that scores all four separately catches more failure modes than any single composite score.

‍

2.2 What is new in 2026

These properties are not new.

What is new is how to grade them when an AI is producing the interpretation in real time, hundreds of thousands of times a week, in 45+ languages. Medical interpretation (live speech) is not medical interpretation (documents).

A miss on translation accuracy is a wrong dose or a flipped negation. A miss on medical-term retention is a clinician misunderstanding the patient. A miss on semantic adequacy is a softened safety warning or a flattened directive. A miss on cultural tone is an eroded encounter. Every quality program in the field has to grade all four.

‍

‍

‍

3. Why a broader definition of quality is overdue

3.1 The hidden assumption in literal metrics

The deeper problem with literal-overlap metrics is the assumption they bake in: that there is a "correct" target-language version of every source sentence and quality is how closely the candidate matches it. That assumption was never quite right and it is now visibly wrong. A skilled medical interpreter, human or AI, does not produce the closest possible word-by-word match. They produce the version that lands correctly in the target language and the target context. That work has at least four dimensions literal metrics cannot see: streamlining redundancy, gender disambiguation, choosing the right word for the situation and restructuring sentences.

‍

3.2 Two concrete examples

Gender disambiguation. Many languages mark gender on verbs, adjectives or articles where English does not. Concrete example: a Brazilian Portuguese-speaking patient is in the room with her adult son. The clinician says "she's been managing her blood sugar well this month." A literal translation that defaults to the masculine form (the statistically more common verb agreement in many systems) renders the sentence as if the clinician is praising the son rather than the patient. The patient hears it. The son hears it. Trust takes a hit immediately. A context-aware system carries the established subject through the discourse and uses the correct gendered form.

‍

Choosing the right word for the situation. Many clinical terms have multiple valid translations and the right one depends on the patient, the encounter and the register. The clinician says "we'd like to draw some blood for labs." In Spanish, this can be rendered as the formal "vamos a extraerle sangre para los anΓ‘lisis," the everyday "le vamos a sacar sangre para los exΓ‘menes" or the colloquial "le vamos a sacar sangre." A literal metric would treat all three as equally valid because they all map back to the source. A context-aware system picks the one that fits the patient and the encounter type.

‍

‍

‍

4. Why BLEU is not enough

4.1 BLEU is linear, interpretation is context-dependent

BLEU was introduced in 2002 by Papineni and colleagues at IBM as an automatic alternative to slow, expensive human evaluation. It counts how many short word sequences (n-grams) in the candidate translation overlap with one or more human reference translations and produces a score between 0 and 1. The math is linear: more overlap is a higher score, regardless of what the words actually mean in context.

‍

That is the central failure for medical interpretation. An English idiom like "cold turkey" illustrates it in one line. A clinician tells a patient "you'll need to stop the medication cold turkey." A literal-overlap metric is satisfied if the target language output contains the closest word-for-word match. In Spanish, that match is "pavo frΓ­o," literally a cold bird. The correct interpretation is "dejar de golpe" or "abstinencia repentina" (stop abruptly or sudden withdrawal). BLEU rewards the wrong answer because the wrong answer overlaps with the source more literally than the right one.

‍

‍

4.2 Document translation versus live interpretation

BLEU was designed to grade document translation: a source text and a target text, both static, both reviewable, both with reference translations available. Live medical interpretation is a different problem. It is bidirectional. It is real-time. It involves turn-taking, repair, multi-speaker dynamics and the context-aware behaviors above. Most of what makes live interpretation hard does not exist in the document-translation setting BLEU was built for. Using BLEU to grade live interpretation is like using a written-exam rubric to grade a clinical viva.

‍

4.3 BLEU scores errors, it does not annotate them

Even when BLEU correctly identifies that a translation is bad, it does not say why it is bad. The score is a single number that does not distinguish a wrong medication term from a softened safety warning from a register mismatch. For a vendor trying to improve their system, this is the difference between a thermometer and a diagnosis. Modern evaluation frameworks produce structured, per-criterion scores that point to the specific failure mode. BLEU points nowhere. For a longer treatment of how the underlying models have changed, see our earlier post on LLM capabilities and medical terminology accuracy.

‍

‍

‍

5. The modern stack: MQM in software, with human escalation

5.1 MQM is the framework, now running automatically

Between BLEU and the modern evaluation stack sits Multidimensional Quality Metrics (MQM). MQM emerged in the early 2010s as an error-typology framework: a reviewer tags each error against a structured taxonomy (accuracy, fluency, terminology, style) and weights each error by severity (minor, major, critical). MQM is the closest thing the translation industry has to a clinical lab's quality control protocol and the four properties above map directly onto its top-level categories.

‍

For most of its history MQM was operationalized through expert human reviewers tagging by hand. That made it accurate but slow, at roughly five to fifteen minutes per segment, which is fine for a research benchmark and impossible for a platform producing hundreds of thousands of utterances a week. What has changed in the last two years is that MQM has gone automated. Systems like GEMBA-MQM and multi-agent LLM evaluators now apply MQM-style tagging without a human in the loop and they form the basis of the modern LLM-as-judge stack. The strongest LLM judges for translation are MQM, implemented in software.

‍

Every interpreted utterance can be scored against the four properties and the context-aware behaviors in close to real time. The judge produces a structured score per criterion, not a single number. A semantic-adequacy score below high triggers a review queue. A representative example from Mayo Clinic researchers evaluating LLM-generated EHR summaries found that an LLM-as-Judge framework achieved an intraclass correlation coefficient of 0.818 with human evaluators and completed evaluations in 22 seconds.

‍

‍

5.2 Human escalation is non-negotiable

LLM judges are not perfect. Recent global health research shows that even the highest-performing LLM-judge achieved human-equivalent evaluations on only four of eleven criteria, which is why a human review layer is required. Credentialed medical interpreters and bilingual clinicians review a sampled percentage of encounters plus all flagged cases and their judgments feed back into the judge's prompts. The evaluation layer mirrors the operational layer where providers escalate from AI to human interpreters: AI at scale, human at the edges. One thing standard MQM was not built for is the live-conversation dimensions that matter most in medical interpretation, including turn-taking, repair and multi-speaker dynamics. Extending MQM to those dimensions is part of what a serious medical interpretation evaluation stack does.

‍

‍

‍

6. What buyers should take away

6.1 BLEU shows its limits

What works for word-for-word document translation does not work for contextual live interpretation. Translation is about matching words. Interpretation is about preserving meaning. BLEU was built for the first job and a generation of statistical machine translation systems was the right context for it. A clinician asking a patient about grapefruit juice interactions is the wrong context. The metric that won the 2000s is not the metric that grades patient safety in the 2020s.

‍

6.2 The four properties have not changed, the toolbox has

Translation accuracy, medical-term retention, semantic adequacy and cultural tone are still the four things any serious quality program scores against.Β 

What has changed is who does the scoring.Β 

MQM, which used to be a slow manual protocol with expert reviewers tagging segments by hand, now mainly runs automatically through LLM-as-judge systems that apply its taxonomy at scale. The framework is the same. The volume is hundreds of thousands of utterances a week instead of dozens. Pairing those automated evaluators with human reviewer escalation on samples and flagged cases is what modern AI medical interpretation actually requires.

‍

6.3 The evaluation stack is necessary and still evolving

At this stage all parties have to acknowledge that frameworks are being built, tested and adjusted in real time. Evaluation is not optional. The vendor's responsibility is to run internal audits on quality of interpretation. The medical site's responsibility is to choose the right vendor and to build internal protocols on top. Both sides are operating in a new environment that is changing and rapidly evolving, which is why the two-sided responsibility model is not a hedge but the only one that works.

The operational test for a CMIO or language access director is simple. Ask the vendor to walk you through their evaluation stack the way you'd ask a hospital lab to walk you through quality control. Ask three things: what gets measured on every encounter, what triggers a human review and what happens when a human reviewer disagrees with the AI. This is also the regulatory frame buyers are being held to under Section 1557 and meaningful access compliance which the HHS Office for Civil Rights extends to language assistance services in healthcare. A vendor with real answers is running a quality program. Reach out to us about No Barrier’s quality program.

No Barrier - AI Medical Interpreter

Zero waiting time, state-of-the-art medical accuracy, HIPAA compliant