Data

For NTCIR-12 MathIR Task related data, please visit NTCIR Project NTCIR-12 MathIR Page

105,120 scientific papers divided automatically into total 8,301,578 search units, total math formulae: about 60 M
Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
Each search unit is stored as independent HTML5 and XHTML5 files. (One of HTML5 and XHTML5 is sufficient for the task. Please select one according to your preference.)
From the following arXiv categories: math, cs, physics:math-ph, stat, physics:hep-th, physics:nlin
WARNING: Requires about 173G for each of HTML5 and XHTML5 directories when being uncompressed.

Note! this corpus was updated in Feb. 2016 during the course of the competition. The links below are for the final version used to produce results for NTCIR-12, along with a new ‘formula appearance only’ version of the data set.
- README (corpus overview, formula representation, description of data construction)
- Dataset (.tar.bz2 archive (450MB compressed))
  - Formula Pres.MathML + LaTeX Only (.tar.bz2 archive (17.2MB compressed)) See 00_README.md in archive for information
The document set contains total 319,689 articles, 592,443 (tagged) formulae.
The collection is broken up into ‘math’ articles containing <math> tags, and ‘text’ articles that do not. Here is a summary of the contents of the corpus:

NTCIR-11 Math Topic data and collections may be found here: http://research.nii.ac.jp/ntcir/permission/ntcir-11/perm-en-MATH.html
NTCIR-10 Math Topic data is downloadable from IDR/NII, Informatics Research Data Repository, at:
http://www.nii.ac.jp/cscenter/idr/en/ntcir/ntcir.html. The NTCIR-10 Test Collection is available here: http://research.nii.ac.jp/ntcir/data/data-en.html

Information about file formats and data sets is provided below.

9,982 scientific papers divided automatically into total 501,156 search units
Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
Each search unit is stored as independent HTML5 and XHTML5 files.

105,120 scientific papers divided automatically into total 8,301,578 search units, total math formulae: about 60 M
Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
Each search unit is stored as independent HTML5 and XHTML5 files. (One of HTML5 and XHTML5 is sufficient for the task. Please select one according to your preference.)
From the following arXiv categories: math, cs, physics:math-ph, stat, physics:hep-th, physics:nlin
WARNING: Requires about 173G for each of HTML5 and XHTML5 directories when being uncompressed.

Each topic includes (i) a list of keywords and (ii) a list of formulae. Both information should be considered in the run.
Topics are distributed in XML form. Submissions can be either tsv or XML forms. Please use XML form if the submission contains “justifications”, i.e., formulae that support the returned document.
Detailed description can be found in Formats for topics and submissions for NTCIR-11 Math-2 Task (Updated 2014/06/02)
Sample topics can be found in “NTCIR11-Math2 Topic examples”. (zip compressed XML file)
Submission validation script is now available : Download

In addition to the regular Math-2 task there is an optional free
Wikipedia subtask that uses the same topic and submission format.
Please refer to the http://ntcir11-wmc.nii.ac.jp for further information, the query and dataset download.

100,000 XHTML documents transformed into XHTML+MathML using the LATEXML converter (7.2G).

The NTCIR-10 document set is available from the KWARC Project; if you wish to obtain it, please contact ntcadm-math@nii.ac.jp.

45 annotated papers to help semantic formula search studies. All the natural language descriptions of mathematical expressions is manually annotated. Also, all the mathematical expressions in the dataset are expressed using MathML Parallel Markup.
For details, please refer to the evaluation documentation for NTCIR-Math Math Understanding Subtask and also http://ntcir-math.nii.ac.jp/descext/

For the distribution of topics and relevance judgment at NTCIR-10 Math Pilot Task, please refer to http://ntcir-math.nii.ac.jp/ntcir10-math/datasets-2/