Foreign Language Interpretation Gets a Machine Learning Boost from IARPA

Some of the hottest, trending languages are Kazakh, Swahili and Pashto. Well, at least for the U.S. Intelligence Community (IC), sometimes that's true.

It’s probably safe to say that no organization is more interested in what foreign nationals are saying and writing than the IC. This is especially true for what’s being said in widely spoken languages of U.S. adversaries, like China and Russia. However, it’s also the case for “low resource” languages that are spoken by much smaller populations around the globe, like Kazakh, Swahili and Pashto.

The perennial challenge the IC has faced is how to quickly and accurately interpret those lesser-used languages or any language.

Using human beings to translate the quadrillions of words written and spoken by people around the world every day would be an incredibly time intensive and expensive endeavor. Fortunately, with its Machine Translation for English Retrieval of Information in Any Language (MATERIAL) program, the Intelligence Advanced Research Projects Activity (IARPA) is revolutionizing the way the IC consumes foreign language information.

By using machine learning to turn multilingual text and speech media into useable intelligence information for analysts, regardless of their language expertise, the need for human translation is substantially waning.

“The MATERIAL program has really altered the landscape by making it possible for anyone to efficiently find information in low resource languages,” said MATERIAL Program Manager Dr. Carl Rubino. “This is a game-changer for the IC, revolutionizing the way we access important foreign language data.”

Launched in October 2017, MATERIAL program performers, including Johns Hopkins University, Raytheon BBN Technologies, Columbia University and the University of Southern California Information Sciences Institute, were charged with building robust, automated language capabilities over a four-year period. MATERIAL’s ultimate goal was to build Cross-Language Information Retrieval (CLIR) systems that would find speech and text content in diverse lower-resource languages, using only English search queries, and succinctly relay the retrieved relevant foreign language information in English. Performers exceeded expectations and have successfully done just that.

In addition to Kazakh, Swahili and Pashto, the CLIR systems performers developed include state-of-the-art automatic speech recognition and machine translation systems and models for other languages such as Tagalog, Somali, Lithuanian, Georgian, Bulgarian and Farsi.

MATERIAL technologies were recently deployed in SCALE 2021, a multinational Summer Workshop at Johns Hopkins University that is devoted to exploring topics in human language technology. This summer’s topic was Cross-Language Information Retrieval. Using lessons learned and baseline models from the program, SCALE scientists were able to develop customized CLIR capabilities for Chinese, Russian and Farsi.

“I’m thrilled this technology is taking root,” Dr. Rubino said. “With continued IC investment and championship, this relatively novel approach for data discovery should soon be a standard and reliable tool for our analysts.”