The NTCIR-10 Math Understanding Subtask aims for extracting natural
language descriptions of mathematical expressions in a document. The goal
of this task is to identify the natural language descriptions relating to
expressions. For example, a sentence "a variable X is defined as a
probabilistic distribution" shows that the description of "X" is "a
variable" and the description of "a variable X" is "a probabilistic
distribution".
The development data collection includes:
The annotated XML files are made by hand. The task is to design methods
able to make connections between mathematical expressions and their
descriptions.
Attention: The specifications of this task
are subject to change without notice.
In an example below, the annotation includes descriptions for 6 mathematical expressions such as MATH_0801.2412_19, MATH_0801.2412_20, MATH_0801.2412_21, MATH_0801.2412_22, MATH_0801.2412_23 and MATH_0801.2412_24.
Before AnnotationIf a permutation MATH_0801.2412_19 contains the pattern
MATH_0801.2412_20 then clearly the reverse of MATH_0801.2412_21, that is
MATH_0801.2412_22, contains the reverse of MATH_0801.2412_23, which is the
pattern MATH_0801.2412_24.
If <description mid="19" did="1" type="P"
spid="2_1"><sdescription mid="19" did="1" sid="1" spid="2_1">a
permutation</sdescription></description><math
mid="19" spid="2_2">MATH_0801.2412_19</math>
contains <description mid="20" did="1" type="P"
spid="2_3"><sdescription mid="20" did="1" sid="1" spid="2_3">the
pattern</sdescription></description>
<math mid="20" spid="2_4">MATH_0801.2412_20</math>then
clearly <description mid="22" did="1" type="P"
spid="2_5"><sdescription mid="22" did="1" sid="1" spid="2_5">the
reverse of <math mid="21" spid="2_6">MATH_0801.2412_21</math></sdescription></description>,
that is <math mid="22" spid="2_7">MATH_0801.2412_22</math>,
<cdescription mid="24" did="1" cid="1"
spid="2_8">contains the reverse of <math
mid="23" spid="2_9">MATH_0801.2412_23</math></cdescription>,
which is the <description mid="24" did="1"
type="P" spid="2_10"><sdescription mid="24" did="1" sid="1"
spid="2_10">pattern</sdescription></description><math
mid="24" spid="2_11">MATH_0801.2412_24</math>.
We provide annotated datasets. In these datasets, all expressions are
replaced by symbols in the format MATH_{PaperNumber}_{MathID}
.
For example, an original sentence "a variable X is defined as a
probabilistic distribution" is converted to "a variable MATH_1_1 is
defined as a probabilistic distribution". If you want to refer the
content of mathematical expressions, you can use *.math files in which
we provide their MathML formats.
You can get more infomation about the annotation policy from the attached
file (Annotation-Brat.pdf).
The document sets included in the NTCIR-10 Math Understanding test collection are as follows.
Document | Paper ID |
File name | Event | Argument of event | ||||
MATH | MATH-GROUP | Descriiption | C-Description | ShortDescription | Condition | |||
ArXiv |
0810.2412 |
0801.2412_3.xml | 4 |
0 |
1 |
0 |
1 |
0 |
0801.2412_5.xml | 14 |
0 |
5 |
0 |
5 |
0 |
||
0801.2412_6.xml | 26 |
0 |
12 |
1 |
12 |
0 |
||
0801.2412_7.xml | 19 |
0 |
6 |
0 |
6 |
0 |
||
0801.2412_8.xml | 39 |
0 |
16 |
1 |
16 |
0 |
||
0801.2412_9.xml | 16 |
0 |
5 |
0 |
5 |
0 |
||
0801.2412_10.xml | 32 |
0 |
8 |
0 |
8 |
0 |
||
0801.2412_11.xml | 31 |
0 |
11 |
0 |
11 |
0 |
||
0801.2412_12.xml | 27 |
0 |
9 |
2 |
9 |
0 |
||
0801.2412_13.xml | 32 |
0 |
8 |
0 |
8 |
0 |
||
0801.2412_14.xml | 5 |
0 |
1 |
0 |
1 |
0 |
||
0806.4135 |
0806.4135_9.xml |
143 |
3 |
53 |
5 |
53 |
3 |
|
0806.4135_10.xml |
65 |
0 |
13 |
0 |
13 |
3 |
||
00806.4135_11.xml | 150 |
0 |
23 |
6 |
23 |
0 |
||
0 0808.0212 |
0808.0212_3.xml | 0 |
0 |
0 |
0 |
0 |
0 |
|
0808.0212_4.xml | 26 |
0 |
2 |
0 |
2 |
0 |
||
0808.0212_8.xml | 12 |
0 |
0 |
0 |
0 |
0 |
||
0808.0212_9.xml | 141 |
0 |
31 |
1 |
31 |
1 |
||
0808.0212_10.xml | 8 |
0 |
2 |
0 |
2 |
0 |
||
0808.0212_11.xml | 24 |
1 |
2 |
0 |
2 |
0 |
||
0811.2449 |
0811.2449_3.xml |
5 |
0 |
0 |
0 |
0 |
0 |
|
0811.2449_5.xml | 4 |
0 |
0 |
0 |
0 |
0 |
||
0811.2449_6a.xml | 12 |
0 |
2 |
0 |
2 |
0 |
||
0811.2449_6b.xml | 162 |
1 |
31 |
3 |
31 |
1 |
||
0811.2449_6c.xml | 7 |
0 |
4 |
0 |
4 |
0 |
||
0811.2449_6d.xml | 11 |
1 |
3 |
0 |
3 |
0 |
||
0811.2449_6e.xml | 49 |
0 |
10 |
2 |
10 |
0 |
||
0811.2449_6f.xml | 36 |
0 |
3 |
0 |
3 |
0 |
||
0811.2449_7.xml | 12 |
0 |
1 |
0 |
1 |
0 |
||
0902.4089 |
0902.4089_3.xml | 7 |
0 |
1 |
0 |
1 |
0 |
|
0902.4089_6.xml | 16 |
0 |
9 |
3 |
9 |
0 |
||
0902.4089_7.xml | 24 |
0 |
8 |
3 |
7 |
0 |
||
0902.4089_9.xml | 16 |
0 |
4 |
0 |
4 |
|||
0902.4089_11a.xml | 7 |
0 |
2 |
1 |
2 |
0 |
||
0902.4089_11b.xml | 127 |
2 |
31 |
3 |
31 |
0 |
||
0902.4089_11c.xml | 25 |
0 |
6 |
2 |
6 |
0 |
||
0902.4089_12.xml | 56 |
0 |
7 |
1 |
7 |
1 |
||
0904.0684 |
0904.0684_3.xml | 4 |
0 |
0 |
0 |
0 |
0 |
|
0904.0684_5.xml | 2 |
0 |
0 |
0 |
0 |
0 |
||
0904.0684_6.xml | 48 |
3 |
10 |
0 |
10 |
0 |
||
0904.0684_7.xml | 4 |
0 |
0 |
0 |
0 |
0 |
||
0904.0684_8.xml | 88 |
0 |
18 |
0 |
18 |
0 |
||
0904.0684_9.xml | 44 |
0 |
8 |
2 |
8 |
0 |
||
0904.0684_10.xml | 2 |
0 |
0 |
0 |
0 |
0 |
||
0904.0684_11.xml | 38 |
0 |
3 |
0 |
3 |
0 |
||
0905.1426 |
0905.1426_6.xml | 5 |
0 |
2 |
1 |
2 |
0 |
|
0905.1426_10.xml | 66 |
0 |
25 |
4 |
25 |
0 |
||
0905.1426_15.xml | 106 |
0 |
23 |
1 |
21 |
0 |
||
0905.1426_26.xml | 61 |
0 |
14 |
0 |
14 |
0 |
||
0905.1426_29.xml | 2 |
0 |
0 |
0 |
0 |
0 |
||
0906.1240 |
0906.1240_5.xml | 16 |
0 |
9 |
1 |
9 |
0 |
|
0906.1240_6.xml | 11 |
0 |
1 |
0 |
1 |
0 |
||
0906.1240_7.xml | 157 |
0 |
30 |
3 |
30 |
0 |
||
090.1544 | 090.1544_8.xml | 149 |
3 |
64 |
9 |
64 |
0 |
|
0906.1612 |
0906.1612_3.xml | 4 |
0 |
2 |
1 |
2 |
0 |
|
0906.1612_6.xml | 48 |
3 |
21 |
3 |
21 |
1 |
||
0906.1612_7.xml | 166 |
7 |
63 |
6 |
63 |
0 |
||
0906.1612_8.xml | 282 |
10 |
77 |
11 |
77 |
1 |
||
0906.1612_9.xml | 111 |
1 |
23 |
1 |
23 |
1 |
||
0906.1612_11.xml | 1 |
0 |
0 |
0 |
0 |
0 |
Tag | Description |
|
<section> |
</section> |
The section tag is a root tag of the document. |
<math mid="MID" spid="SPID"> |
</math> |
The content of the math tag is a mathematical expression. Each
mathematical expresion has its own ID indicated by the mid
attribute. The mid assignes the number by 1, 2, ... from the first
of paper to each mathematical
expression. The spid attribute shows an ID unique across
a paper. So we represent it in the form <section
ID>_<span ID> where the section ID is the section
number and the span ID is the span number. The span ID assignes the
number by 1, 2, ... from the first of a file to each span
(mathematical expression, description and related sequential phrase).e.g. <math mid="10" spid="0_23"> |
<description mid="MID1 MID2 ..." did="ID1 ID2 ..."
type="TYPE1 TYPE2 ..." spid="SPID"> |
</description> |
The content of the description tag is the description of the
mathematical expression indicated by the mid tag. The
mid attribute specifies a mathematical expression or
mathematical expressions by MID1, MID2, ... . The did attribute is the ID of the description for a
mathematical expression. The type attribute specifies the type of the
description for mathematical expression. As same as the case of mid
attribute, each value of type attribute, TYPE1,
TYPE2,..., is defined for each mathematical expression. A TYPE
is a a sequence of one or more of the following characters:
<description mid="10 20" did="1 1" type="P A"
spid="2_5"> means that the mathematical expression 10
has the description 1 with the type P and the mathematical expression
20 has the description 1 with the type A. Each mathematical
expression can be with several descriptions specified by 1, 2, ... in
did attribute. This span is in fifth of a file in the section
2.e.g. <description mid="10" did="1" type="P"
spid="0_22"> |
<cdescription mid="MID1 MID2 ..." did="ID1 ID2 ..."
cid="ID1 ID2 ..." spid="SPID"> |
</cdescription> |
The content of the cdescription tag is the discontinued description
of the mathematical expression indicated by the mid
attribute. The did attribute specifies the related
main description. The cid is a unique ID of the
cdescription for a mathematical expression.e.g. <description mid="10" did="1" type="P"
spid="0_22"> |
<sdescription mid="MID1 MID2 ..." did="ID1 ID2 ..."
sid="ID1 ID2 ..." spid="ID"> |
</sdescription> |
The content of the sdescription tag is the short description of a
mathematical expression indicated by the mid
attribute. The did attribute specifies the related
long description. The sid attribute is a unique ID of
the sdecription for a mathematical expression.e.g.
|
<condition mid="MID1 MID2 ..." did="ID1 ID2 ..."
cnid="ID1 ID2 ..." spid="ID"> |
</condition> |
The content of the condition tag is the condition of the
description of a mathematical expression indicated by the mid
attribute. The did attribute specifies the main
description. The cnid attribute is a unique ID of the
condition.e.g. Suppose |
<mathgroup mid="ID" spid="ID"> |
</mathgroup> |
The attributes of the mathgroup tag are the same as
those of the math tag. A mathgroup is usually assigned
to a mathematical expression containing words.e.g. <mathgroup mid="2" spid="0_87"> |
<section>
...
If
<description mid="19" did="1"
type="P" spid="2_1">
<sdescription mid="19" did="1"
sid="1" spid="2_1">
a permutation
</sdescription>
</description>
<math mid="19" spid="2_2">
MATH_0801.2412_19
</math>
contains
<description mid="20" did="1"
type="P" spid="2_3">
<sdescription mid="20" did="1"
sid="1" spid="2_3">
the pattern
</sdescription>
</description>
<math mid="20" spid="2_4">
MATH_0801.2412_20
</math>
then clearly
<description mid="22" did="1"
type="P" spid="2_5">
<sdescription mid="22" did="1"
sid="1" spid="2_5">
the reverse of
<span spid="2_6">
MATH_0801.2412_21
</span>
</sdescription>
</description>
, that is
<math mid="22" spid="2_7">
MATH_0801.2412_22
</math>
,
<cdescription mid="24" did="1"
cid="1" spid="2_8">
contains the reverse of
<math
mid="23" spid="2_9">
MATH_0801.2412_23
</math>
</cdescription>
, which is the
<description mid="24" did="1"
type="P" spid="2_10">
<sdescription mid="24" did="1"
sid="1" spid="2_10">
pattern
</sdescription>
</description>
<math mid="24" spid="2_11">
MATH_0801.2412_24
</math>
.
...
</section>
<section>
...
Namely, Simion and Stanton [57] essentially studied
<description mid="45 46 47 48"
did="1 1 1 1" type="P P P P" spid="3_1">
<sdescription mid="45 46 47 48" did="1 1 1 1" sid="1 1 1 1" spid="3_1">
the patterns
</sdescription>
</description>
<math mid="45" spid="3_2">
MATH_0801.2412_45
</math>
,
<math mid="46" spid="3_3">
MATH_0801.2412_46
</math>
,
<math mid="47" spid="3_4">
MATH_0801.2412_47
</math>
, and
<math mid="48" spid="3_5">
MATH_0801.2412_48
</math>
and their relation to a set of orthogonal polynomials
generalizing the Laguerre polynomials, and one of these patterns also
played a crucial role in the proof by Foata and Zeilberger [31] that
Denert's statistic is Mahonian.
...
</section>
<section>
...
<math mid="13" spid="33">
MATH_C04-1197_22
</math>
and
<math mid="14" spid="34">
MATH_C04-1197_23
</math>
are
<description mid="13 14" did="1
1" type="P P" spid="35">
the numbers
of
</description>
<cdescription mid="13" did="1" cid="1"
spid="36">
inequality
</cdescription>
and
<cdescription mid="14" did="1"
cid="1" spid="37">
equality
</cdescription>
<cdescription mid="13 14" did="1 1" cid="2 2"
spid="38">
constraints
</cdescription>
...
</section>
In this case, a description "the number of" is shared with two
mathematical expressions. cdescription are used to express "the number of
inequality constraints" and "the number of equality constraints". Remark
that "constraints" is shared with two expressions.
Tag |
Description | |
<section> |
</section> |
The section tag is the root tag of the document. |
<content> |
</content> |
The content tag contains the oroginal text. Its parent tag must be
section. |
<annotation> |
</annotation> |
The annotation tag indicates relations between a mathematical
expression and its descriptions. |
<span id="SPID"> |
</span> |
The content of the span element specifies a mathematical
expression, or a description, or a description-related issues. This
tag is only used in the content of the content tag. The id
attribute must be the form of "<section ID>_<span ID>. |
<math
attr ... /> |
The attributes of the math tag specify a relation between a
mathematical expresion and its descriptions. This tag is only used in
the content of the annotation tag. Attributes that can be used in
the tag are shown below. All span references use the span tags' id
attributes. In case where multiple values are allowed, they are
separated by space (" ").
<annotation>
<math mid="67" tid="0_118" count="1" did="0_119" sid="0_120"
type="A" /> |
|
<mathgroup attr ... /> |
The attributes of the mathgroup tag are the same as
those of the math tag. A mathgroup is usually assigned
to a mathematical expression containing words.e.g. <content> </annotation> |
<section>
<content>
If
<span id="2_1">
a permutation
</span>
<span id="2_2">
MATH_0801.2412_19
</span>
contains
<span id="2_3">
the pattern
</span>
<span id="2_4">
MATH_0801.2412_20
</span>
then clearly
<span id="2_5">
the reverse of
<span id="2_6">
MATH_0801.2412_21
</span>
</span>
, that is
<span id="2_7">
MATH_0801.2412_22
</span>
,
<span id="2_8">
contains the reverse of
<span
id="2_9">
MATH_0801.2412_23
</span>
</span>
, which is the
<span id="2_10">
pattern
</span>
<span id="2_11">
MATH_0801.2412_24
</span>
...
</content>
<annotation>
<math mid="19" tid="2_2" count="1" did="2_1"
sid="2_1" type="P"/>
<math mid="20" tid="2_4" count="1" did="2_3"
sid="2_3" type="P"/>
<math mid="21" tid="2_6" count="0"/>
<math mid="22" tid="2_7" count="1" did="2_5"
sid="2_5" type="P"/>
<math mid="23" tid="2_9" count="0"/>
<math mid="24" tid="2_11" count="1" did="2_10"
cid="2_8" sid="2_10" type="P"/>
...
</content>
</section>
<section>
<content>
...
Namely, Simion and Stanton [57] essentially studied
<span id="3_1">
the patterns
</span>
<span
id="3_2">
MATH_0801.2412_45
</span>
,
<span id="3_3">
MATH_0801.2412_46
</span>
,
<span id="3_4">
MATH_0801.2412_47
</span>
, and
<span id="3_5">
MATH_0801.2412_48
</span>
and their relation to a set of orthogonal polynomials
generalizing the Laguerre polynomials, and one of these patterns also
played a crucial role in the proof by Foata and Zeilberger [31] that
Denerts statistic is Mahonian.
</content>
<annotation>
...
<math mid="45"
tid="3_2" count="1" did="3_1" sid="3_1" type="P"/>
<math mid="46" tid="3_3"
count="1" did="3_1" sid="3_1" type="P"/>
<math mid="47" tid="3_4" count="1" did="3_1"
sid="3_1" type="P"/>
<math mid="48" tid="3_5"
count="1" did="3_1" sid="3_1" type="P"/>
...
</annotation>
</section>
<section>
<content>
...
<span id="5_33">
MATH_C04-1197_22
</span>
and
<span id="5_34">
MATH_C04-1197_23
</span>
are
<span id="5_35">
the numbers of
</span>
<span id="5_36">
inequality
</span>
and
<span id="5_37">
equality
</span>
<span id="5_38">
constraints
</span>
...
</content>
<annotation>
...
<math mid="22" tid="5_33" count="1" did="5_35" cid="5_36,5_38" sid="5_35" type="P" /> <math mid="23" tid="5_34" count="1" did="5_35" cid="5_37,5_38" sid="5_35" type="P" />
...</annotation>
</section>
Please see http://research.nii.ac.jp/ntcir/ntcir-10/howto.html.
The test collection has been constructed and used for the NTCIR. It is
usable only for the research purpose use.
The document collection included in the test collection was provided to
NII for use in free of charge. The providers of the document data kindly
understand the importance of test collections in research on information
access technologies, and thus granted the use of the data for research
purposes. Please remember that the document data in the NTCIR test
collection is copyrighted and has commercial value as data. It is
imperative for our continued good relations with the data
producers/providers that we researchers behave as reliable partners and
use the data only for research purposes under the user agreement, and use
them with a care not to violate any of their rights.