Data
NTCIR12 MathIR Task
Dataset Release (NTCIR12)
 For NTCIR12 MathIR Task related data, please visit NTCIR Project NTCIR12 MathIR Page
ArXiv document set (the same as NTCIR11 Math2)
 105,120 scientific papers divided automatically into total 8,301,578 search units, total math formulae: about 60 M
 Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
 Each search unit is stored as independent HTML5 and XHTML5 files. (One of HTML5 and XHTML5 is sufficient for the task. Please select one according to your preference.)
 From the following arXiv categories: math, cs, physics:mathph, stat, physics:hepth, physics:nlin
 WARNING: Requires about 173G for each of HTML5 and XHTML5 directories when being uncompressed.
Wikipedia Corpus (NEW)

Note! this corpus was updated in Feb. 2016 during the course of the competition. The links below are for the final version used to produce results for NTCIR12, along with a new ‘formula appearance only’ version of the data set.
 README (corpus overview, formula representation, description of data construction)

Dataset (.tar.bz2 archive (450MB compressed))
 Formula Pres.MathML + LaTeX Only (.tar.bz2 archive (17.2MB compressed)) See 00_README.md in archive for information
 The document set contains total 319,689 articles, 592,443 (tagged) formulae.
 The collection is broken up into ‘math’ articles containing <math> tags, and ‘text’ articles that do not. Here is a summary of the contents of the corpus:
 MathTagArticles (~10% of collection, containing <math> tags)
 31,839 articles (.html files, stored in 16 .tar.bz2 archive files)
 580,068 formulae
 TextArticles (~90% of collection, without <math> tags)
 287,850 articles (.html files, stored in 144 .tar.bz2 archive files)
 12,375 formulae (many are very small, e.g. isolated symbols).
Available Topics and Data Sets from Earlier Competitions (NTCIR10, NTCIR11)
 NTCIR11 Math Topic data and collections may be found here: http://research.nii.ac.jp/ntcir/permission/ntcir11/permenMATH.html
 NTCIR10 Math Topic data is downloadable from IDR/NII, Informatics Research Data Repository, at:
http://www.nii.ac.jp/cscenter/idr/en/ntcir/ntcir.html. The NTCIR10 Test Collection is available here: http://research.nii.ac.jp/ntcir/data/dataen.html
Information about file formats and data sets is provided below.
NTCIR11 Math2 Task
Initial document set
 9,982 scientific papers divided automatically into total 501,156 search units
 Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
 Each search unit is stored as independent HTML5 and XHTML5 files.
Full document set
 105,120 scientific papers divided automatically into total 8,301,578 search units, total math formulae: about 60 M
 Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
 Each search unit is stored as independent HTML5 and XHTML5 files. (One of HTML5 and XHTML5 is sufficient for the task. Please select one according to your preference.)
 From the following arXiv categories: math, cs, physics:mathph, stat, physics:hepth, physics:nlin
 WARNING: Requires about 173G for each of HTML5 and XHTML5 directories when being uncompressed.
Topics and submissions formats
 Each topic includes (i) a list of keywords and (ii) a list of formulae. Both information should be considered in the run.
 Topics are distributed in XML form. Submissions can be either tsv or XML forms. Please use XML form if the submission contains “justifications”, i.e., formulae that support the returned document.
 Detailed description can be found in Formats for topics and submissions for NTCIR11 Math2 Task (Updated 2014/06/02)
 Sample topics can be found in “NTCIR11Math2 Topic examples”. (zip compressed XML file)
 Submission validation script is now available : Download
Wikipedia Open Task Dataset
 In addition to the regular Math2 task there is an optional free
Wikipedia subtask that uses the same topic and submission format.
Please refer to the http://ntcir11wmc.nii.ac.jp for further information, the query and dataset download.
NTCIR10 Math Pilot Task
 100,000 XHTML documents transformed into XHTML+MathML using the LATEXML converter (7.2G).
The NTCIR10 document set is available from the KWARC Project; if you wish to obtain it, please contact ntcadmmath@nii.ac.jp.
 Annotated Dataset for Math Description Extraction
45 annotated papers to help semantic formula search studies. All the natural language descriptions of mathematical expressions is manually annotated. Also, all the mathematical expressions in the dataset are expressed using MathML Parallel Markup.
For details, please refer to the evaluation documentation for NTCIRMath Math Understanding Subtask and also http://ntcirmath.nii.ac.jp/descext/
 For the distribution of topics and relevance judgment at NTCIR10 Math Pilot Task, please refer to http://ntcirmath.nii.ac.jp/ntcir10math/datasets2/