MATERIAL

Machine Translation for English Retrieval of Information in Any Language
MATERIAL

Intelligence Value

The MATERIAL program aims to revolutionize the way the Intelligence Community consumes foreign language information, by turning multilingual text and speech media into useable intelligence information for analysts, regardless of their language expertise.

Summary

A large portion of the ever-increasing amounts of text, audio, and video data produced in today’s world is being generated by populations of emerging importance in lower-resource languages. This rich source of data is of little value if the information cannot be effectively searched. Launched in October 2017, The MATERIAL program is a 47-month venture seeking to address this challenge by building robust, automated language capabilities with limited linguistic resources, expertise, and tools.

MATERIAL’s ultimate goal is to build a Cross-Language Information Retrieval (CLIR) systems that find speech and text content in diverse lower-resource languages, using English search queries. This system will allow analysts to submit queries in English and receive short English summaries of relevant foreign language items that saliently display relevance to their information needs. Success is measured by a novel end-to-end retrieval metric that will assess the system’s ability to retrieve all relevant documents, while producing few false alarms.

The MATERIAL program will provide:

  • State-of-the-art Automatic Speech Recognition (ASR) and Machine Translation (MT) systems and models for Tagalog, Swahili, Somali, Lithuanian, Bulgarian, Pashto, Farsi, Kazakh and Georgian
  • Highly competitive models optimized for informal and formal speech and text 
  • Novel end-to-end CLIR systems available with Dockerized, exchangeable component technologies
  • Innovative ways of utilizing transfer learning methods to quickly develop models in new languages, combined with incorporation of large amounts of diverse multilingual unstructured text and speech data to drastically improve model performance on new domains and genres of dataText processing tools to address morphology and divergent spelling 
  • Text, audio, and video data crawlers released for U.S. government use 
  • Annotated, reusable datasets in multiple languages for CLIR, ASR, and MT research
  • Open-source neural MT framework Marian (co-funded) 
  • Bitext harvesting tool Paracrawl (co-funded)
  • Novel cross-language query-biased summary generation technique

Related Publications

To access MATERIAL program-related publications, please visit Google Scholar.


Contact Information

Program Manager

Dr. Carl Rubino

carl.rubino@iarpa.gov

301-243-2081

Contracting Office Address

Office of the Director of National Intelligence

Intelligence Advanced Research Projects Activity

Washington, DC 20511

Research Area(s)

Machine learning , Machine translation, Natural language processing, Automatic speech recognition, Cross-Language Information Retrieval

Related Program(s)

Broad Agency Announcement (BAA)

Link(s) to BAA

IARPA-BAA-16-11

Solicitation Status

CLOSED

Proposer's Day Date

September 27, 2016

BAA Release Date

January 19, 2017

BAA Question Period

January 19, 2017 — February 20, 2017

Proposal Due Date

Monday, 20 March 2017

Program Summary

Testing and Evaluation Partners

  • Massachusetts Institute of Technology Lincoln Laboratory
  • National Institute of Standards and Technology
  • University of Maryland Applied Research Laboratory for Intelligence and Security
  • Tarragon Consulting Corporation

Prime Performers

  • Johns Hopkins University
  • Raytheon BBN Technologies
  • Columbia University
  • University of Southern California Information Sciences Institute