Data

Data

Initial document set

  • 9,982 scientific papers divided automatically into total 501,156 search units
  • Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
  • Each search unit is stored as independent HTML5 and XHTML5 files.

Full document set

  • 105,120 scientific papers divided automatically into total 8,301,578 search units, total math formulae: about 60 M
  • Converted into HTML5 and XHTML5 formats by the KWARC project (http://kwarc.info/).
  • Each search unit is stored as independent HTML5 and XHTML5 files. (One of HTML5 and XHTML5 is sufficient for the task. Please select one according to your preference.)
  • From the following arXiv categories: math, cs, physics:math-ph, stat, physics:hep-th, physics:nlin
  • WARNING: Requires about 173G for each of HTML5 and XHTML5 directories when being uncompressed.

Topics and submissions formats

  • Each topic includes (i) a list of keywords and (ii) a list of formulae. Both information should be considered in the run.
  • Topics are distributed in XML form. Submissions can be either tsv or XML forms. Please use XML form if the submission contains “justifications”, i.e., formulae that support the returned document.
  • Detailed description can be found in Formats for topics and submissions for NTCIR-11 Math-2 Task (Updated 2014/06/02)
  • Sample topics can be found in “NTCIR11-Math2 Topic examples”. (zip compressed XML file)
  • Submission validation script is now available : Download