Calculating Word Error Rate for Tibetan Automatic Speech Recognition
Summary
In this post I present a method for calculating Word Error Rate for Tibetan text for the evaluation of Tibetan Automatic Speech Recognition models. I begin by explaining the problem and motivating the desire for Word Error Rate scores for these models. Then, I explain in detail how Word Error Rate can be calculated and averaged to understand model performance. I demonstrate a method for performing these calculations on Tibetan texts and present a Python module I have created, tibetan_wer , that provides this functionality. Finally, I present an example of model training and how the metrics provided by tibetan_wer can provide important insights about model performance.
Introduction
Automatic Speech Recognition (ASR) is a machine learning task in which a model is trained to take in an audio file and generate text transcriptions of that audio. That is to say, when given an audio file of a human voice saying “I like cake.” the model should output the text string “I like cake”.
But, of course, no model is perfect and we need ways of measuring that imperfection. There are two common metrics for this task: Character Error Rate and Word Error Rate. For both of these metrics, a lower score is better.
Character Error Rate (CER) calculates how many character-level mistakes are made in the transcription. Thus, if “I like cake.” is mistakenly rendered as “I like ceke.”, then the Character Error Rate reflects the incorrect substitution of “e” for “a”. There would be an additional penalty then if the output was “I like cekes”, because in addition to the substitution of “e” for “a”, there is also the incorrect addition of an “s”.
Word Error Rate (WER), on the other hand, calculates word-level mistakes in the transcription. Thus the mistaken outputs “I like ceke.” and “I like cekes” would both be counted as having one incorrect substitution, “ceke” for “cake” or “cekes” for “cake”, respectively.
Calculating CER for Tibetan is no different from calculating CER for any other language and can be done with any of the typical libraries for this task, such as jiwer. However, the ambiguity in the beginning and ending of words in written Tibetan makes accurately calculating WER impossible with standard libraries.
In general, we can simply forgo the use of WER for Tibetan ASR, albeit at the cost of meaningful insights into model performance. However, we encounter major problems when we consider using speech input for translation.
Consider the case of translating English to Tibetan. If a speaker says “I like cake.” and the audio is correctly transcribed as “I like cake.” then we might expect a translation model to output “ང་ནི་ཀེཀ་ལ་དགའ།”. If the transcription is improperly rendered as “I like ceke.” then we would probably prefer that the translation model treat this as a typo or a mistake and produce the same translation.
However, if the ASR model provides the text “I like coke.” we must now consider whether the speaker intends to say that they like “cake” and there has been a mistake, or if the speaker intends to convey that they enjoy Coca-cola. The ideal translation would reflect this difference.
Understanding how likely it is that a model has mistaken a single character, as opposed to mistaking an entire word, can be an important factor in how we handle these kinds of mistakes.
We could distinguish between a model that regularly mistakes words from a model that simply makes mistakes about individual characters by looking at the relative WER and CER scores. Thus, there is reason to believe that WER matters, and that it would be beneficial to have a method for calculating WER for Tibetan.
Word Error Rate
Word Error Rate (WER) is a measure of word-level mistakes in the transcription of audio relative to some “correct” transcription. For two strings of text, a prediction and a reference, the WER of the prediction is measured by summing the number of incorrect insertions, deletions, and substitutions relative to the reference and then normalizing this sum by dividing by the total number of words in the reference.
An “insertion” (I) is a word that appears in a place in the prediction but not in the reference. A “deletion” (D) is a word that appears in a place in the reference but not in the prediction. A “substitution” (S) is a swapping of a correct word in the reference for an incorrect word in the prediction.
For example, if the reference is “I like cake”, an insertion would be “I really like cake”, a deletion would be “I cake.” and a substitution would be “I like coke.” The total number of words (N) in the reference is 3.
WER has a simple formula: WER = (I + D + S) / N
In practice, calculating a score computationally can be more complicated, though. This calculation is equivalent to calculating the Levenshtein Distance. From Wikipedia:
The Levenshtein distance between two strings a,b (of length |a| and |b respectively) is given by lev(a,b) where the tail of some string x is a string of all but the first character of x (i.e. tail(x_0, x_1…x_n)=x_1,x_2…x_n), and head(x) is the first character of x (i.e. head(x_0,x_1…x_n)=x_0).
Martin Thoma, on his website provides pseudo-code for calculating the Levenshtein Distance for two strings:
Averaging WER
In ASR applications, a model is typically generating large numbers of predictions to be compared against a large set of references. As such, an average WER score is desirable. There are two common ways of calculating this average: a micro-average, and a macro-average.
The micro-average of a set of WER results sums all errors across the batch and divides by the total number of reference words. This approach is useful for getting a sense of the model’s overall accuracy across an entire corpus, rather than at the level of individual predictions. This is the typical average and is the default implementation.
The macro-average computes WER for each example and then averages the WER scores. This approach treats every prediction as having equal weight.
However, this still requires that Tibetan text be segmented into identifiable words. WER is straightforward for languages whose writing system provides predictable breaks between words (i.e. English) but Tibetan does not do this.
Luckily, the Botok tokenizer performs reliable word segmentation on Tibetan text!
Calculating Word Error Rate for Tibetan Text
The Botok tokenizer takes in a string of Tibetan text and produces a list of “token” objects. Each of these objects consists of a dictinary of information about a given word in the original string.
For example, we can use Botok to tokenize the string ‘"‘འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔’’. Botok will then return a list of tokens, the first of which is shown below.
text: “འཇམ་དཔལ་”
text_cleaned: “འཇམ་དཔལ་”
text_unaffixed: “འཇམ་དཔལ་”
syls: [“འཇམ”, “དཔལ”]
pos: PROPN
lemma: འཇམ་དཔལ་
senses: | pos: PROPN, freq: 12647, affixed: False, lemma: འཇམ་དཔལ་ |
char_types: |CONS|CONS|CONS|TSEK|CONS|CONS|CONS|TSEK|
chunk_type: TEXT
freq: 12647
syls_idx: [[0, 1, 2], [4, 5, 6]]
syls_start_end: [{‘start’: 0, ‘end’: 4}, {‘start’: 4, ‘end’: 8}]
start: 0
len: 8
We can then take the ‘text’ field from each token and construct a version of the original string that has been split into individual words. Once the strings have been split, we can calculate WER as described in the previous section.
Using tibetan_wer
tibetan_wer is a Python module that uses Botok to perform word segmentation on Tibetan text then calculates the WER (both micro and macro-averaged) between a list of ‘prediction’ strings and a list of ‘reference’ strings. It also provides counts of the total number of substitutions, insertions, and deletions. These values are returned as a dictionary.
The module can be installed like so:
pip install --upgrade tibetan_wer
In the simplest case, usage might look like this:
from tibetan_wer.metrics import wer
rediction = ['གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔']
reference = ['འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔']
result = wer(prediction, reference)
print(f'Micro-Average WER Score: {result['micro_wer']}')
print(f'Macro-Average WER Score: {result['macro_wer']}')
print(f'Substitutions: {result['substitutions']}')
print(f'Insertions: {result['insertions']}')
print(f'Deletions: {result['deletions']}')
This will output:
Micro-Average WER Score: 0.111
Macro-Average WER Score: 0.111
Substitutions: 0.000
Insertions: 0.000
Deletions: 1.000
tibetan_wer also supports the Syllable Error Rate (SER). The SER is identical to WER except that the unit of measurement is syllables rather than words. Syllables are extracted from inputted strings by splitting strings at each tsek. This functionality was suggested by Ganga Gyatso. SER can be very similarly to WER, as seen in the code block below.
from tibetan_wer.metrics import ser
prediction = ['གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔']
reference = ['འཇམ་དཔལ་གཞོན་ནུར་གྱུར་པ་ལ་ཕྱག་འཚལ་ལོ༔']
result = ser(prediction, reference)
print(f'Micro-Average SER Score: {result['micro_ser']:.3f}')
print(f'Macro-Average SER Score: {result['macro_ser']:.3f}')
print(f'Substitutions: {result['substitutions']:.3f}')
print(f'Insertions: {result['insertions']:.3f}')
print(f'Deletions: {result['deletions']:.3f}')
Which will output:
Micro-Average SER Score: 0.200
Macro-Average SER Score: 0.200
Substitutions: 0.000
Insertions: 0.000
Deletions: 2.000
However, the intended use-case is as part of assessing model training. To use tibetan_wer for this you can define custom metrics for model training like so:
import evaluate
from tibetan_wer.metrics import wer as tib_wer, ser as tib_ser
cer_metric = evaluate.load("cer")
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
# replace -100 with the pad_token_id
label_ids[label_ids == -100] = tokenizer.pad_token_id
# we do not want to group tokens when computing the metrics
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
cer = cer_metric.compute(predictions=pred_str, references=label_str)
tib_wer_res = tib_wer(predictions=pred_str, references=label_str)
tib_ser_res = tib_ser(predictions=pred_str, references=label_str)
macro_wer = tib_wer_res['macro_wer']
micro_wer = tib_wer_res['micro_wer']
word_subs = tib_wer_res['substitutions']
word_ins = tib_wer_res['insertions']
word_dels = tib_wer_res['deletions']
macro_ser = tib_ser_res['macro_ser']
micro_ser = tib_ser_res['micro_ser']
syl_subs = tib_ser_res['substitutions']
syl_ins = tib_ser_res['insertions']
syl_dels = tib_ser_res['deletions']
return {"cer": cer,
"tib_macro_wer": macro_wer,
"tib_micro_wer": micro_wer,
"word_substitutions": word_subs,
"word_insertions":word_ins,
"word_deletions":word_dels,
"tib_macro_ser": macro_ser,
"tib_micro_ser": micro_ser,
"syllable_substitutions": syl_subs,
"syllable_insertions": syl_ins,
"syllable_deletions": syl_dels
}
You can then set the transformers trainer to use these metrics like so:
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
data_collator=data_collator,
compute_metrics=compute_metrics, # use custom metrics
tokenizer=processor.feature_extractor,
)
trainer.train()
Sample Results
To demonstrate the benefits of tibetan_wer, I’ve trained a Whisper model for 5 epochs using data from openpecha/stt-training-data.
Whisper is a speech recognition model developed by OpenAI. The version used for this demonstration is the ‘tiny’ size, which has 39 million parameters.
The dataset, openpecha/stt-training-data is a dataset of recordings of spoken Tibetan along with their text transcriptions. For this demonstration, only 10,000 samples were used, with 9,000 used for training and 1,000 used for evaluation.
The model was evaluated with the standard CER and WER metrics provided by the typical evaluate and jiwer libraries, as well as with tibetan_wer.
After 5 epochs, the final values were:
| Cer | Standard Wer | Tib Macro Wer | Tib Micro Wer | Substitutions | Insertions | Deletions |
|---|---|---|---|---|---|---|
| 0.533724 | 1.033610 | 0.784910 | 0.784000 | 9159 | 1233 | 3945 |
Note that Standard WER is extremely poor, implying every word is wrong. However, this is a result of the inability of the standard implementation to properly segment Tibetan text. It treats every string of Tibetan that lacks whitespace as a single long word. Thus, the only time a Tibetan output is treated as having a correct word is when the entire output is correct. In contrast, the tibetan_wer values are lower, more closely matching the progress demonstrated by the CER metric.
Tracking these scores over the entirety of the run, we can see in the graph below that the standard WER metric never improves, even appearing to get worse over time, in contradiction with the evident improvement in the CER score. However, both the macro and micro averaged scores from tibetan_wer improve in close alignment with the CER score.
We can also see that the macro-averaged score, which treats every individual prediction as having equal weight, initially is worse than the micro-average, which focuses on measuring performance across the entire evaluation set. However, as training progressed the two metrics come to align more closely. This might tell us that initially the model struggled with certain especially difficult individual samples (which are more heavily weighted by the macro-average) but that it eventually learned to handle those samples, bringing performance on those samples more in line with the model’s performance on the evaluation corpus as whole.
The ability to specifically track substitutions, insertions and deletion is also informative here. We can see in the table above, for example, that the final model is far more likely to miss words than to add them. Tracking these metrics over the entire run, we can see in the graph below that improvements in performance come almost entirely from a reduction in substitutions and insertions. This shows that the model is getting better at correctly identifying words and avoiding adding extra words, but remains pretty constant in how frequently it misses words entirely.
Conclusion
Word Error Rate is an important, informative metric for evaluating Automatic Speech Recognition models. However, standard methods for calculating WER can not be used for Tibetan because of its unique writing system. This results in standard WER evaluation being useless, or even misleading for Tibetan ASR models.
tibetan_wer provides a means of accurately evaluating WER for Tibetan by combining the word segmentation features of the Botok tokenizer with a custom implementation of the Levenshtein distance calculation that allows for finding not only micro and macro averaged WER, but also tracking total numbers of substitutions, insertions, and deletions on evaluation data.
The performance of the model presented above should not be understood as being representative of performance in general. It is a very small model trained on an exceptionally small set of data, but the information about model training that can be gained by using tibetan_wer make it a useful example of how metrics beyond the Character Error Rate can be used to better understand the performance of Tibetan ASR models.
If you are interested in using tibetan_wer for your work, or are interested in adapting it to other use cases, please do not hesitate to contact me either by commenting here or emailing me at billingsmoore [at] gmail [dot] com with “Tibetan WER” in the subject line.



