Data

Data


NTCIR-12 MathIR Task

Dataset Release (NTCIR-12)

ArXiv document set (the same as NTCIR-11 Math-2)

  • 105,120 scientific papers divided automatically into total 8,301,578 search units, total math formulae: about 60 M
  • Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
  • Each search unit is stored as independent HTML5 and XHTML5 files. (One of HTML5 and XHTML5 is sufficient for the task. Please select one according to your preference.)
  • From the following arXiv categories: math, cs, physics:math-ph, stat, physics:hep-th, physics:nlin
  • WARNING: Requires about 173G for each of HTML5 and XHTML5 directories when being uncompressed.

Wikipedia Corpus (NEW)

  • Note! this corpus was updated in Feb. 2016 during the course of the competition. The links below are for the final version used to produce results for NTCIR-12, along with a new ‘formula appearance only’ version of the data set.
    • README (corpus overview, formula representation, description of data construction)
    • Dataset (.tar.bz2 archive (450MB compressed))
  • The document set contains total 319,689 articles, 592,443 (tagged) formulae.
  • The collection is broken up into ‘math’ articles containing <math> tags, and ‘text’ articles that do not. Here is a summary of the contents of the corpus:
    • MathTagArticles (~10% of collection, containing <math> tags)
      • 31,839 articles (.html files, stored in 16 .tar.bz2 archive files)
      • 580,068 formulae
    • TextArticles (~90% of collection, without <math> tags)
      • 287,850 articles (.html files, stored in 144 .tar.bz2 archive files)
      • 12,375 formulae (many are very small, e.g. isolated symbols).

Available Topics and Data Sets from Earlier Competitions (NTCIR-10, NTCIR-11)

Information about file formats and data sets is provided below.


 

NTCIR-11 Math-2 Task

Initial document set

  • 9,982 scientific papers divided automatically into total 501,156 search units
  • Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
  • Each search unit is stored as independent HTML5 and XHTML5 files.

Full document set

  • 105,120 scientific papers divided automatically into total 8,301,578 search units, total math formulae: about 60 M
  • Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
  • Each search unit is stored as independent HTML5 and XHTML5 files. (One of HTML5 and XHTML5 is sufficient for the task. Please select one according to your preference.)
  • From the following arXiv categories: math, cs, physics:math-ph, stat, physics:hep-th, physics:nlin
  • WARNING: Requires about 173G for each of HTML5 and XHTML5 directories when being uncompressed.

Topics and submissions formats

  • Each topic includes (i) a list of keywords and (ii) a list of formulae. Both information should be considered in the run.
  • Topics are distributed in XML form. Submissions can be either tsv or XML forms. Please use XML form if the submission contains “justifications”, i.e., formulae that support the returned document.
  • Detailed description can be found in Formats for topics and submissions for NTCIR-11 Math-2 Task (Updated 2014/06/02)
  • Sample topics can be found in “NTCIR11-Math2 Topic examples”. (zip compressed XML file)
  • Submission validation script is now available : Download

Wikipedia Open Task Dataset

  • In addition to the regular Math-2 task there is an optional free
    Wikipedia subtask that uses the same topic and submission format.
    Please refer to the http://ntcir11-wmc.nii.ac.jp for further information, the query and dataset download.

NTCIR-10 Math Pilot Task

    • 100,000 XHTML documents transformed into XHTML+MathML using the LATEXML converter (7.2G).

The NTCIR-10 document set is available from the KWARC Project; if you wish to obtain it, please contact ntcadm-math@nii.ac.jp.

    • Annotated Dataset for Math Description Extraction

45 annotated papers to help semantic formula search studies. All the natural language descriptions of mathematical expressions is manually annotated. Also, all the mathematical expressions in the dataset are expressed using MathML Parallel Markup.
For details, please refer to the evaluation documentation for NTCIR-Math Math Understanding Subtask and also http://ntcir-math.nii.ac.jp/descext/

Annotation example on Brat (http://brat.nlplab.org/)