Consider the power of Ctrl-F to find words in a document, or the usefulness of text-based information retrieval with Google. These tools are indispensable today and it’s hard to imagine life without them.
Keyword search is the task of automatically finding instances of words in speech. In high-resource languages, such as English and Chinese, one approach is to perform speech recognition on the audio to produce a transcript, before searching the transcript for a word of interest. However, this requires speech recognition performance to be at a high enough quality that it gets all the words right in its single-best transcription. If it meets that standard, then the word can simply be searched for in the transcript.
There is another aspect to this problem. Consider the case where we are interested in finding instances of a word such as “eating”. Since we’re likely interested in the concept underlying the word, we probably want to find instances not just of “eating”, but also other inflections of the word such as “eats”, “ate”, “eat”. Again, in languages such as English, the solution is straightforward. For large languages we have well-established lexical resources, which provide all possible inflections, so we can automatically search for all of them.
However, for most languages speech recognition accuracy is limited and curated resources covering all inflections are not available. What do we do for such languages? We may have limited established speech corpora and cannot train a good speech recognition system. Or, we might not have resources that cover inflections of all words we might be interested in. Or, we might be plagued with both problems.
This isn’t just a problem for a few languages. In fact, most of the world’s languages fall in this “long tail”. There are around 7,000 languages in the world, while only the most popular hundred or so have substantial resources for this task.
Such problems have been of interest to programs such as LORELEI (Strassel & Tracey 2016). The goal of that program is to have language technology available “in the context of a rapidly emerging and quickly evolving situation like a natural disaster or disease outbreak”. This sort of problem was also of interest to the DARPA Babel program, which focused on “innovations in how to rapidly model a novel language with significantly less training data that are also much noisier and more heterogeneous than what has been used in the current state-of-the-art.” The DARPA program appears to enjoy a run-on sentence more than LORELEI, but you get the idea. (https://www.iarpa.gov/index.php/research-programs/babel)
I presented a recent paper which makes a step forward on this task at the Special Interest Group on Morphology and Phonology (SIGMORPHON) workshop. This workshop was colocated at the annual meeting for the Association of Computational Linguistics (ACL).
The work was done with collaborators at Johns Hopkins University while I was in my final month as a postdoc there. In the paper we present a pipeline that generates inflections of keywords before searching for them in speech, along with an approach for generating evaluation sets for this task.
Now, let’s get into the details.
Keyword Search
To address the first issue, keyword search is typically framed as searching for instances of a word in a lattice which represent a large number of possible transcriptions of a sentence. Pictured below is a lattice representing possible transcriptions of the utterance “how to recognize speech using common sense”. The lattice includes “recognize speech” along with similar-sounding alternatives such as “wreck a nice beach”
Sometimes, the one-best transcription is unreliable. This is the case when we work with speech recognition systems that don’t perform so well in a target language. In such cases it’s better to search for possibilities in this lattice, as that may highlight instances of the word that weren’t found in a single best transcription.
Inflection-Set Keyword Search
Our contribution was to search not only for a word like “recognize”, but also alternative inflections such as “recognizing”, “recognized”, “recognizes”. We call this Inflection-Set Keyword Search. However, in applying this approach to low-resource languages, we need a way to generate those possible inflections. This is because in general we don’t have them already as a pre-existing list..
The basic idea is illustrated below. We take a Zulu lemma (on the left), and produce a number of possible inflections, some incorrect, some correct (in the middle). We then search for these generated inflections in speech (on the right), finding possible instances of the word.
This is a worthwhile task to examine in the low-resource context. The paper presents a pipeline for inflection-set keyword search— so that approaches to generating inflections can be evaluated in the context of a full pipeline. This gives a meaningful way of evaluating inflection generation approaches, on a real-world downstream task.
The paper compares several approaches to generating such inflections. One approach uses established resources, such as Unimorph (Kirov et al. 2018), to determine inflections. At the other extreme, no inflection resources are available in the language of interest. In that case, one option we evaluated is to try to predict inflections on using a method of cross-lingual distant-supervision.
Here are two findings. The first is that the model is quite robust to overgeneration of inflections. That is, it’s better to err on the side of producing too many candidate inflections than too few. This is because even if you generate an inflection that isn’t valid in the language, it still needs to square up with what occurs in the speech for a keyword to be detected downstream.
The other finding is that the best-performing approach, by far, is the one that uses the Unimorph inflections. It even performs better than an Oracle reference measurement, which knows exactly what form of a word occurs in the test set. When pairing this up with recent findings from work in inflection generation, this hints at what might be the best approach to inflection-set keyword search. For low-resource languages, going forward, invest a modest amount of time to gather inflections (1000 or so) in the target language of interest. This can allow for supervised training of accurate inflection generation models, which can then be plugged into this keyword search pipeline.
Conclusion
In summary, the paper presents an inflection-set keyword search pipeline, along with an evaluation set for assessing inflection generation approaches. A preliminary evaluation of inflection generation approaches, using this keyword search pipeline, indicates two key things. Firstly, keyword search robustness to morphological overgeneration: you can generate some incorrect inflections without it hurting results. Secondly, results suggest it’s likely worth investing in gathering a modest amount of inflection training data, in a target language, rather than relying purely on cross-lingual distant supervision.
References
Strassel, S., & Tracey, J. (2016). LORELEI language packs: data, tools, and resources for technology development in low resource languages. LREC, 10–11.
Kirov, C., Cotterell, R., Sylak-Glassman, J., Walther, G., Vylomova, E., Xia, P., Faruqui, M., Mielke, S., McCarthy, A. D., Kübler, S., Yarowsky, D., Eisner, J., & Hulden, M. (2018). UniMorph 2.0: Universal Morphology.