What is Mathematical Information Retrieval (MIR)?

Mathematical Information Retrieval is concerned with finding information in documents that include mathematics. This is important both for technical disciplines that use math frequently (e.g. Physics and Computer Science) and for students and members of the general public looking for introductory information online (e.g. for tutorials, and information from Wikipedia and WolframAlpha). Example use cases include looking up a mathematical concept expressed using keywords and/or formulae (e.g. for the Pythogorean Theorem), finding technical papers that use similar mathematical models, or browsing for documents containing a given formula, adding keywords to help narrow the search towards specific topics/communities (e.g. Geometry, specific publishers or journals) and resources (e.g. tutorials, proofs, or computer programs).

Math Retrieval

Recent surveys on MIR are available from the links below. Early research began in the mid-1990’s, and has been increasingly active in recent years, partly due to the NTCIR math retrieval tasks. The first general math-aware search engine was created in the early 2000’s for the the online NIST Digital Library of Mathematical Functions (DLMF). Additional information on MIR is available from the “References” link above.

Why is MIR difficult, and what are some of the key challenges?

Use of mathematical notation is dialectic. For example, different communities use different conventions for naming variables (e.g. using ‘variance’ vs. ‘v’) and defining operators (e.g. using a horizontal line over an ‘x’ to represent boolean negation, vs. the average of a set of values). Individual authors redefine and adapt notation for their immediate needs. This flexibility is beneficial for authors and readers, but makes automatic interpretation very difficult.

In addition, evaluating or computing a formula requires knowing the definitions and values represented by symbols. Automatically recovering this evaluation environment is a very difficult language processing task involving analysis of both text and mathematical notation.

Given this situation, the following challenging questions arise:

  • How can we find and rank semantically similar math formulae?
  • How can we find and rank visually similar math formulae?
  • Should appearance and semantics be integrated in formula retrieval?
  • How should text and formulae be combined in queries, i.e. what type of query languages should be supported?
  • What role should text and formulae matches play in the final ranking of search results?


How are queries represented?

Queries for the task include some combination of formulae and keywords. Mathematical formulae, in their machine readable forms, are expressed as trees. We use the formula and keywords shown in thought bubbles above to provide an example query encoding.

Our example query is  represented as shown in the XML tree below (click to enlarge). The query formula is represented three ways: as a LaTeX string, and in two XML encodings (MathML). The formula appearance is described by the LaTeX string (demarcated by  <TeXquery> tags) and Presentation MathML (demarcated by <pquery> tags), while the mathematical operations in the expression are represented in Content MathML (demarcated by <cquery> tags).

The <words> tag is used to list the query keywords (‘infinite series conditionally convergent’).

Example of math tree structure

How can I obtain sample topics and test collections?


Where can I learn more about previous NTCIR math retrieval tasks?

Web pages and papers from previous NTCIR Math Tasks may be found at the “References” link above.