NTCIR Project

NTCIR-10 Math Understanding Subtask


The NTCIR-10 Math Understanding Subtask aims for extracting natural language descriptions of mathematical expressions in a document. The goal of this task is to identify the natural language descriptions relating to expressions. For example, a sentence "a variable X is defined as a probabilistic distribution" shows that the description of "X" is "a variable" and the description of "a variable X" is "a probabilistic distribution".

The development data collection includes:

The annotated XML files are made by hand. The task is to design methods able to make connections between mathematical expressions and their descriptions.

Attention: The specifications of this task are subject to change without notice.

Example

In an example below, the annotation includes descriptions for 6 mathematical expressions such as MATH_0801.2412_19, MATH_0801.2412_20, MATH_0801.2412_21, MATH_0801.2412_22, MATH_0801.2412_23 and MATH_0801.2412_24.

Before Annotation
If a permutation MATH_0801.2412_19 contains the pattern MATH_0801.2412_20 then clearly the reverse of MATH_0801.2412_21, that is MATH_0801.2412_22, contains the reverse of MATH_0801.2412_23, which is the pattern MATH_0801.2412_24.

After Annotation

If <description mid="19" did="1" type="P" spid="2_1"><sdescription mid="19" did="1" sid="1" spid="2_1">a permutation</sdescription></description><math mid="19" spid="2_2">MATH_0801.2412_19</math> contains <description mid="20" did="1" type="P" spid="2_3"><sdescription  mid="20" did="1" sid="1" spid="2_3">the pattern</sdescription></description> <math mid="20" spid="2_4">MATH_0801.2412_20</math>then clearly <description mid="22" did="1" type="P" spid="2_5"><sdescription mid="22" did="1" sid="1" spid="2_5">the reverse of <math mid="21" spid="2_6">MATH_0801.2412_21</math></sdescription></description>, that is <math mid="22" spid="2_7">MATH_0801.2412_22</math>, <cdescription mid="24" did="1" cid="1" spid="2_8">contains the reverse of <math mid="23" spid="2_9">MATH_0801.2412_23</math></cdescription>, which is the <description mid="24" did="1" type="P" spid="2_10"><sdescription mid="24" did="1" sid="1" spid="2_10">pattern</sdescription></description><math mid="24" spid="2_11">MATH_0801.2412_24</math>.

��The entire collection is provided by the co-organizers.

Annotation Policy

We provide annotated datasets. In these datasets, all expressions are replaced by symbols in the format MATH_{PaperNumber}_{MathID}. For example, an original sentence "a variable X is defined as a probabilistic distribution" is converted to "a variable MATH_1_1 is defined as a probabilistic distribution". If you want to refer the content of mathematical expressions, you can use *.math files in which we provide their MathML formats.


You can get more infomation about the annotation policy from the attached file (Annotation-Brat.pdf).

Documents

(1) List of document sets

The document sets included in the NTCIR-10 Math Understanding test collection are as follows.

 Document Paper ID
File name  Event Argument of event   
MATH  MATH-GROUP  Descriiption C-Description ShortDescription  Condition 
ArXiv 















































0810.2412
 0801.2412_3.xml 4
0
1
0
1
0
 0801.2412_5.xml 14
0
5
0
5
0
 0801.2412_6.xml 26
0
12
1
12
0
 0801.2412_7.xml 19
0
6
0
6
0
 0801.2412_8.xml 39
0
16
1
16
0
 0801.2412_9.xml 16
0
5
0
5
0
 0801.2412_10.xml 32
0
8
0
8
0
 0801.2412_11.xml 31
0
11
0
11
0
 0801.2412_12.xml 27
0
9
2
9
0
 0801.2412_13.xml 32
0
8
0
8
0
 0801.2412_14.xml 5
0
1
0
1
0
0806.4135


0806.4135_9.xml
143
3
53
5
53
3
0806.4135_10.xml
65
0
13
0
13
3
00806.4135_11.xml 150
0
23
6
23
0
0

0808.0212



0808.0212_3.xml 0
0
0
0
0
0
0808.0212_4.xml 26
0
2
0
2
0
0808.0212_8.xml 12
0
0
0
0
0
0808.0212_9.xml 141
0
31
1
31
1
0808.0212_10.xml 8
0
2
0
2
0
0808.0212_11.xml 24
1
2
0
2
0

0811.2449







0811.2449_3.xml
5
0
0
0
0
0
0811.2449_5.xml 4
0
0
0
0
0
0811.2449_6a.xml 12
0
2
0
2
0
0811.2449_6b.xml 162
1
31
3
31
1
0811.2449_6c.xml 7
0
4
0
4
0
0811.2449_6d.xml 11
1
3
0
3
0
0811.2449_6e.xml 49
0
10
2
10
0
0811.2449_6f.xml 36
0
3
0
3
0
0811.2449_7.xml 12
0
1
0
1
0


0902.4089





0902.4089_3.xml 7
0
1
0
1
0
0902.4089_6.xml 16
0
9
3
9
0
0902.4089_7.xml 24
0
8
3
7
0
0902.4089_9.xml 16
0
4

0
4
0902.4089_11a.xml 7
0
2
1
2
0
0902.4089_11b.xml 127
2
31
3
31
0
0902.4089_11c.xml 25
0
6
2
6
0
0902.4089_12.xml 56
0
7
1
7
1


0904.0684





0904.0684_3.xml 4
0
0
0
0
0
0904.0684_5.xml 2
0
0
0
0
0
0904.0684_6.xml 48
3
10
0
10
0
0904.0684_7.xml 4
0
0
0
0
0
0904.0684_8.xml 88
0
18
0
18
0
0904.0684_9.xml 44
0
8
2
8
0
0904.0684_10.xml 2
0
0
0
0
0
0904.0684_11.xml 38
0
3
0
3
0


0905.1426

0905.1426_6.xml 5
0
2
1
2
0
0905.1426_10.xml 66
0
25
4
25
0
0905.1426_15.xml 106
0
23
1
21
0
0905.1426_26.xml 61
0
14
0
14
0
0905.1426_29.xml 2
0
0
0
0
0

0906.1240

0906.1240_5.xml 16
0
9
1
9
0
0906.1240_6.xml 11
0
1
0
1
0
0906.1240_7.xml 157
0
30
3
30
0
090.1544 090.1544_8.xml 149
3
64
9
64
0
0906.1612




0906.1612_3.xml 4
0
2
1
2
0
0906.1612_6.xml 48
3
21
3
21
1
0906.1612_7.xml 166
7
63
6
63
0
0906.1612_8.xml 282
10
77
11
77
1
0906.1612_9.xml 111
1
23
1
23
1
0906.1612_11.xml 1
0
0
0
0
0

(2) XML Tags used in document records


We present two annotation styles which can be converted into each other. You can use whichever one of them you prefer. We provide 3 kinds of annotations, namely, the full version which includes descriptions and short descriptions, the long version which only includes descriptions, and the short version which only includes short descriptions.

(a) Notation 1 (Simple)

Tag Description
<section>
</section>
The section tag is a root tag of the document.
<math mid="MID" spid="SPID">
</math>
The content of the math tag is a mathematical expression. Each mathematical expresion has its own ID indicated by the mid attribute. The mid assignes the number by 1, 2, ... from the first of paper to each mathematical expression. The spid attribute shows an ID unique across a paper. So we represent it in the form <section ID>_<span ID> where the section ID is the section number and the span ID is the span number. The span ID assignes the number by 1, 2, ... from the first of a file to each span (mathematical expression, description and related sequential phrase).
e.g.
<math mid="10" spid="0_23">
    MATH_0806.4135_10
</math>

<description mid="MID1 MID2 ..." did="ID1 ID2 ..." type="TYPE1 TYPE2 ..." spid="SPID">
</description>
The content of the description tag is the description of the mathematical expression indicated by the mid tag. The mid attribute specifies a mathematical expression or mathematical expressions by MID1, MID2, ... .
The did attribute is the ID of the description for a mathematical expression.
The type attribute specifies the type of the description for mathematical expression. As same as the case of mid attribute,  each value of type attribute, TYPE1, TYPE2,...,  is defined for each mathematical expression. A TYPE is a a sequence of one or more of the following characters:
    1. assumption: A
    2. condition: C
    3. proposition: P
For example, <description mid="10 20" did="1 1" type="P A" spid="2_5"> means that the mathematical expression 10 has the description 1 with the type P and the mathematical expression 20 has the description 1 with the type A. Each mathematical expression can be with several descriptions specified by 1, 2, ... in did attribute.  This span is in fifth of a file in the section 2.
e.g.
<description mid="10" did="1" type="P" spid="0_22">
    A partition
</description>
<math mid="10" spid="0_23">

    MATH_0806.4135_10
</math>

<cdescription mid="MID1 MID2 ..." did="ID1 ID2 ..." cid="ID1 ID2 ..." spid="SPID">
</cdescription>
The content of the cdescription tag is the discontinued description of the mathematical expression indicated by the mid attribute. The did attribute specifies the related main description. The cid is a unique ID of the cdescription for a mathematical expression.
e.g.
<description  mid="10" did="1" type="P" spid="0_22">
    <sdescription mid="10" did="1" sid="1" spid="0_22">

        A partition
    </sdescription>
</description>
<math mid="10" spid="0_23">

    MATH_0806.4135_10
</math>
<cdescription mid="10" did="1" cid="1" spid="0_24">

    of
    <math mid="11" spid="0_25">
        MATH_0806.4135_11
    </math>
</cdescription>

<sdescription mid="MID1 MID2 ..." did="ID1 ID2 ..." sid="ID1 ID2 ..." spid="ID">
</sdescription>
The content of the sdescription tag is the short description of a mathematical expression indicated by the mid attribute. The did attribute specifies the related long description. The sid attribute is a unique ID of the sdecription for a mathematical expression.
e.g.
<description mid="10" did="1" type="P" spid="0_22">
    <sdescription mid="10" did="1" sid="1" spid="0_22">

    A partition
    </sdescription>
</description>
<math mid="10" spid="0_23">

   MATH_0806.4135_10
</math>

<condition mid="MID1 MID2 ..." did="ID1 ID2 ..." cnid="ID1 ID2 ..." spid="ID">
</condition>
The content of the condition tag is the condition of the description of a mathematical expression indicated by the mid attribute. The did attribute specifies the main description. The cnid attribute is a unique ID of the condition.
e.g.
Suppose
<condition mid="69" did="1" cnid="1" spid="0_117">
    <math mid="67" spid="0_118">

        MATH_0806.4135_67
    </math>
    is
    <description mid="67" did="1" type="A" spid="0_119">
        <sdescription mid="67" did="1" sid="1" spid="0_120">
            an additive subgroup
       </sdescription>
       of
       <math mid="68" spid="0_121">
            MATH_0806.4135_68
       </math>
       which contains the inverses of each of its nonzero elements
    </description>
</condition>
.
Then
<math mid="69" spid="0_122">
    MATH_0806.4135_69
</math>
is
<description mid="69" did="1" type="C" spid="0_123">
    <sdescription mid="69" did="1" sid="1" spid="0_124">
        a subfield
    </sdescription>
    of
    <math mid="70" spid="0_125">
        MATH_0806.4135_70
    </math>
</description>
.

<mathgroup mid="ID" spid="ID">
</mathgroup>
The attributes of the mathgroup tag are the same as those of the math tag. A mathgroup is usually assigned to a mathematical expression containing words.
e.g.
<mathgroup mid="2" spid="0_87">
    <description mid="55" did="1" type="P" spid="0_88">
        <sdescription mid="55" did="1" sid="1" spid="0_88">
            block system
        </sdescription>
    </description>
    <math mid="55" spid="0_89">
        MATH_0806.4135_55
    </math>
    <cdescription mid="55" did="1" cid="1" spid="0_90">
        of
        <math mid="56" spid="0_91">
            MATH_0806.4135_56
        </math>
    </cdescription>
</mathgroup>

Example

<section>
    ...
    If
    <description mid="19" did="1" type="P" spid="2_1">
        <sdescription mid="19" did="1" sid="1" spid="2
_1">
            a permutation
        </sdescription>
    </description>
    <math mid="19" spid="2
_2">
        MATH_0801.2412_19
    </math>
    contains
    <description mid="20" did="1" type="P" spid="2_3">
        <sdescription mid="20" did="1" sid="1" spid="2
_3">
            the pattern
        </sdescription>
    </description>
    <math mid="20" spid="2
_4">
        MATH_0801.2412_20
    </math>
    then clearly
    <description mid="22" did="1" type="P" spid="2_5">
        <sdescription mid="22" did="1" sid="1" spid="2
_5">
            the reverse of
            <span spid="2_6">
                MATH_0801.2412_21
            </span>
        </sdescription>
    </description>

    , that is
    <math mid="22" spid="2_7">
        MATH_0801.2412_22
    </math>
    ,
    <cdescription mid="24" did="1" cid="1" spid="2_8">
        contains the reverse of
        <math mid="23" spid="2_9">
            MATH_0801.2412_23
        </math>
    </cdescription>

    , which is the
    <description mid="24" did="1" type="P" spid="2_10">
        <sdescription mid="24" did="1" sid="1" spid="2
_10">
            pattern
        </sdescription>
    </description>

    <math mid="24" spid="2_11">
        MATH_0801.2412_24
    </math>
    .
...
</section>


In this case, the description of MATH_0801.2412_19 is "a permutation". It is also the short description. The final expression MATH_0801.2412_24 has a complicated structure. It has a description "pattern" but also has a cdescription "contains the reverse of MATH_0801.2412_23".

<section>
    ...
    Namely, Simion and Stanton [57] essentially studied
    <description mid="45 46 47 48" did="1 1 1 1" type="P P P P" spid="3_1">
        <sdescription mid="45 46 47 48" did="1 1 1 1" sid="1 1 1 1" spid="3_1">
            the patterns
        </sdescription>
    </description>

    <math mid="45" spid="3_2">
        MATH_0801.2412_45
    </math>
    ,

    <math mid="46" spid="3_3">
        MATH_0801.2412_46
    </math>
    ,

    <math mid="47" spid="3_4">
        MATH_0801.2412_47
    </math>
    , and

    <math mid="48" spid="3_5">
        MATH_0801.2412_48
    </math>
    and their relation to a set of orthogonal polynomials generalizing the Laguerre polynomials, and one of these patterns also played a crucial role in the proof by Foata and Zeilberger [31] that Denert's statistic is Mahonian.
    ...
</section>

In this case, a description "the patterns" covers 4 mathematical expressions.

<section>
...
    <math mid="13" spid="33">
        MATH_C04-1197_22
    </math>
    and
    <math mid="14" spid="34">
        MATH_C04-1197_23
    </math>
    are
    <description mid="13 14" did="1 1" type="P P" spid="35">
        the numbers of
    </description>
    <cdescription mid="13" did="1" cid="1" spid="36">

        inequality
    </cdescription>
    and
    <cdescription mid="14" did="1" cid="1" spid="37">
    equality
    </cdescription>
    <cdescription mid="13 14" did="1 1" cid="2 2" spid="38">

        constraints
    </cdescription>
...
</section>


In this case, a description "the number of" is shared with two mathematical expressions. cdescription are used to express "the number of inequality constraints" and "the number of equality constraints". Remark that "constraints" is shared with two expressions.

(b) Notation 2

 Tag
 Description
<section>
</section>
The section tag is the root tag of the document.
<content>
</content>
The content tag contains the oroginal text. Its parent tag must be section.
<annotation>
</annotation>
The annotation tag indicates relations between a mathematical expression and its descriptions.
<span id="SPID"> </span> The content of the span element specifies a mathematical expression, or a description, or a description-related issues. This tag is only used in the content of the content tag. The id attribute must be the form of "<section ID>_<span ID>.
 <math  attr ... />

 The attributes of the math tag specify a relation between a mathematical expresion and its descriptions. This tag is only used in the content of the annotation tag. Attributes that can be used in the tag are shown below. All span references use the span tags' id attributes. In case where multiple values are allowed, they are separated by space (" ").
  1. mid="ID": Each math tag has a unique ID. The relation between a mathematical expression and the description can be referred to by this ID. The number is assigned like 1, 2, ... from the beginning of a file.
  2. tid="SPID": Specifies the mathematical expression span.
  3. count="N": Specifies the number of descriptions. In the case of N=, there are no descriptions for the mathematical expression.
  4. did="SPID1;SPID2;..."ID2;...: Contains references to the description spans.
  5. cid="SPID11,SPID12,...;SPID21,SPID22,...;..."...;...: Contains references to discontinuous description spans. In this case, SPID11 and SPID12 become discontinuous descriptions of the description SPID1.
  6. sid="SPID11,SPID12,...;SPID21,SPID22,...;...": Contains references to short descriptions.
  7. cnid="SPID11,SPID12,...;SPID21,SPID22,...;..."...;...": Contains references to conditions.
  8. type="TYPE1;TYPE2;...: Specify the type of descriptions of a mathematical expression. Each TYPE must be the sequence of one or more of the following characters:
    1. assumption: mption: A
    2. condition: C
    3. proposition: P
e.g.
<content>
<span id="0_22">
    A partition
</span>
<span id="0_23">
    MATH_0806.4135_10
</span>
<span id="0_24">
    of
    <span id="0_25">
        MATH_0806.4135_11
    </span>
</span>
</content>
<annotation>
    <math mid="10" tid="0_23" count="1" did="0_22" cid="0_24" sid="0_22" type="P" />
</annotation>

<content>
Suppose
<span id="0_117">
    <span id="0_118">
        MATH_0806.4135_67
    </span>
    is
    <span id="0_119">
        <span id="0_120">
            an additive subgroup
        </span>
        of
        <span id="0_121">
            MATH_0806.4135_68
        </span>
        which contains the inverses of each of its nonzero elements
    </span>
</span>
. Then
<span id="0_122">
    MATH_0806.4135_69
</span>
is
<span id="0_123">
    <span id="0_124">
        a subfield
    </span>
    of
    <span id="0_125">
        MATH_0806.4135_70
    </span>
</span>
.
</content>
<annotation>
    <math mid="67" tid="0_118" count="1" did="0_119" sid="0_120" type="A" />
    <math mid="68" tid="0_121" count="0" />
    <math mid="69" tid="0_122" count="1" did="0_123" sid="0_124" cnid="0_117" type="C" />
    <math mid="70" tid="0_125" count="0" />
</annotation>
<mathgroup attr ... />
The attributes of the mathgroup tag are the same as those of the math tag. A mathgroup is usually assigned to a mathematical expression containing words.
e.g.
<content>
Any
<span id="0_87">
    <span id="0_88">

        block system
    </span>
    <span id="0_89">
        MATH_0806.4135_55
    </span>
    <span id="0_90">
 
       of
        <span id="0_91">
 
           MATH_0806.4135_56
        </span>
    </span>
</span>
is
<span id="0_92">
    <span id="0_93">
        the set of translates
    </span>
   of
   <span id="0_94">
       a proper vector subspace
   </span>
   <span id="0_95">
       MATH_0806.4135_57
   </span>
   of
   <span id="0_96">
   MATH_0806.4135_58
   </span>
</span>
, that is,
<span id="0_97">
    MATH_0806.4135_59
</span>
.
</content>
<annotation>
</annotation>
    <math mid="55" tid="0_89" count="1" did="0_88" cid="0_90" sid="0_88" type="P" />
    <math mid="56" tid="0_91" count="0" />
    <math mid="57" tid="0_95" count="1" did="0_94" sid="0_94" type="P" />
    <math mid="58" tid="0_96" count="0" />
    <math mid="59" tid="0_97" count="1" did="0_92" sid="0_93" type="P" />
    <mathgroup mid="2" tid="0_87" count="1" did="0_92" sid="0_93" type="P" />
</annotation>

Example

<section>
<content>
    If
    <span id="2_1">
        a permutation
    </span>
    <span id="2
_2">
        MATH_0801.2412_19
    </span>
    contains
    <span id="2_3">
        the pattern
    </span>
    <span id="2_4">
        MATH_0801.2412_20
    </span>
    then clearly
    <span id="2_5">
        the reverse of
        <span id="2_6">
            MATH_0801.2412_21
        </span>
    </span>

    , that is
    <span id="2_7">
        MATH_0801.2412_22
    </span>
    ,
    <span id="2_8">
        contains the reverse of
        <span id="2_9">
            MATH_0801.2412_23
        </span>
    </span>

    , which is the
    <span id="2_10">
        pattern
    </span>
    <span id="2
_11">
        MATH_0801.2412_24
    </span>
    ...
</content>
<annotation>
    <math mid="19" tid="2_2" count="1" did="2_1" sid="2_1" type="P"/>
    <math mid="20" tid="2_4" count="1" did="2_3" sid="2_3" type="P"/>
    <math mid="21" tid="2_6" count="0"/>
    <math mid="22" tid="2_7" count="1" did="2_5" sid="2_5" type="P"/>
    <math mid="23" tid="2_9" count="0"/>
    <math mid="24" tid="2_11" count="1" did="2_10" cid="2_8" sid="2_10" type="P"/>

    ...
</content>
</section>

<section>
<content>

    ...
    Namely, Simion and Stanton [57] essentially studied
    <span id="3_1">
        the patterns
    </span>
    <span id="3_2">
        MATH_0801.2412_45
    </span>
    ,
    <span id="3_3">
        MATH_0801.2412_46
    </span>
    ,
    <span id="3_4">
        MATH_0801.2412_47
    </span>
    , and
    <span id="3_5">
        MATH_0801.2412_48
    </span>
    and their relation to a set of orthogonal polynomials generalizing the Laguerre polynomials, and one of these patterns also played a crucial role in the proof by Foata and Zeilberger [31] that Denerts statistic is Mahonian.
</content>

<annotation>

    ...
    <math mid="45" tid="3_2" count="1" did="3_1" sid="3_1" type="P"/>
    <math mid="46" tid="3_3" count="1" did="3_1" sid="3_1" type="P"/>
    <math mid="47" tid="3_4" count="1" did="3_1" sid="3_1" type="P"/>

    <math mid="48" tid="3_5" count="1" did="3_1" sid="3_1" type="P"/>
    ...
</annotation>

</section>


<section>
<content>
...
    <span id="5_33">
        MATH_C04-1197_22
    </span>
    and
    <span id="5_34">
        MATH_C04-1197_23
    </span>
    are
    <span id="5_35">
        the numbers of
    </span>
    <span id="5_36">

        inequality
    </span>
    and
    <span id="5_37">
        equality
    </span>
    <span id="5_38">

        constraints
    </span>
...

</content>
<annotation>
...
    <math mid="22" tid="5_33" count="1" did="5_35" cid="5_36,5_38" sid="5_35" type="P" />
    <math mid="23" tid="5_34" count="1" did="5_35" cid="5_37,5_38" sid="5_35" type="P" />
...

</annotation>
</section>

References

More Information

Please see http://research.nii.ac.jp/ntcir/ntcir-10/howto.html.

Notice

The test collection has been constructed and used for the NTCIR. It is usable only for the research purpose use.
The document collection included in the test collection was provided to NII for use in free of charge. The providers of the document data kindly understand the importance of test collections in research on information access technologies, and thus granted the use of the data for research purposes. Please remember that the document data in the NTCIR test collection is copyrighted and has commercial value as data. It is imperative for our continued good relations with the data producers/providers that we researchers behave as reliable partners and use the data only for research purposes under the user agreement, and use them with a care not to violate any of their rights.

Last updated on: 2012-06-04
ntc-admin