Machine Translation for English Retrieval of Information in Any Language (MATERIAL)

The MATERIAL performers will develop an “English-in, English-out” information retrieval system that, given a domain-sensitive English query, will retrieve relevant data from a large multilingual repository and display the retrieved information in English as query-biased summaries. MATERIAL queries will consist of two parts: a domain specification and an English word (or string of words) that capture the information need of an English-speaking user, e.g., “zika virus” in the domain of GOVERNMENT vs. “zika virus” in the domain of HEALTH, or “asperger’s syndrome” in the domain of EDUCATION vs. “asperger’s syndrome” in the domain of SCIENCE. The English summaries produced by the system should convey the relevance of the retrieved information to the domain-limited query to enable an English-speaking user to determine whether the document meets the information needs of the query.

Current methods to produce similar technologies require a substantial investment in training data and/or language specific development and expertise, entailing many months or years of development. A goal of this program is to drastically decrease the time and data needed to field systems capable of fulfilling an English-in, English out task. Limited machine translation and automatic speech recognition training data will be provided from multiple low resource languages to enable performers to learn how to quickly adapt their methods to a wide variety of materials in various genres and domains. As the program progresses, performers will apply and adapt these methods in increasingly shortened time frames to new languages. Program data will include formal and informal genres of text and speech which will not be fully captured by the training data. Image and video are out of scope for this program.

Performers will be evaluated, relative to a baseline system, on their ability to accurately retrieve materials relevant to an English domain-specific query from a database of multi-domain, multi-genre documents in a low resource language, and their ability to convey the relevance of those documents through summaries presented to English speaking domain experts.

To develop such an end-to-end system, large multi-disciplinary teams will be required with expertise in a number of relevant technical areas including, but not limited to, natural language processing, low resource languages, machine translation, corpora analysis, domain adaptation, computational linguistics, speech recognition, language identification, semantics, summarization, information retrieval, and machine learning. Since language-independent approaches with quick ramp up time are sought, foreign language expertise in the languages of the program is not expected. IARPA anticipates that universities and companies from around the world will participate in this research program. Researchers will be encouraged to publish their findings in publicly-available, academic journals.

Contracting Office Address

Office of the Director of National Intelligence
Intelligence Advanced Research Projects Activity
Washington, DC 20511

Primary Point of Contact

Carl Rubino
Program Manager