﻿ NTCIR-10 Math Understanding Subtask

NTCIR Project

NTCIR-10 Math Understanding Subtask

The NTCIR-10 Math Understanding Subtask aims for extracting natural language descriptions of mathematical expressions in a document. The goal of this task is to identify the natural language descriptions relating to expressions. For example, a sentence "a variable X is defined as a probabilistic distribution" shows that the description of "X" is "a variable" and the description of "a variable X" is "a probabilistic distribution".

The development data collection includes:

• 60 XML files extracted from 10 papers in ArXiv (arxmliv).
• These papers are taken from mathematics and physics domains.
• We provide 2 annotation formats. One is a simple annotation format where all tags are embedded in texts, and the other is an annotation format where texts and descriptions are separated.
• We provide 3 kinds of annotations: description and short description, description only and short description only.

The annotated XML files are made by hand. The task is to design methods able to make connections between mathematical expressions and their descriptions.

Attention: The specifications of this task are subject to change without notice.

Example

In an example below, the annotation includes descriptions for 6 mathematical expressions such as MATH_0801.2412_19, MATH_0801.2412_20, MATH_0801.2412_21, MATH_0801.2412_22, MATH_0801.2412_23 and MATH_0801.2412_24.

Before Annotation
If a permutation MATH_0801.2412_19 contains the pattern MATH_0801.2412_20 then clearly the reverse of MATH_0801.2412_21, that is MATH_0801.2412_22, contains the reverse of MATH_0801.2412_23, which is the pattern MATH_0801.2412_24.

After Annotation

If <description mid="19" did="1" type="P" spid="2_1"><sdescription mid="19" did="1" sid="1" spid="2_1">a permutation</sdescription></description><math mid="19" spid="2_2">MATH_0801.2412_19[/itex] contains <description mid="20" did="1" type="P" spid="2_3"><sdescription  mid="20" did="1" sid="1" spid="2_3">the pattern</sdescription></description> <math mid="20" spid="2_4">MATH_0801.2412_20[/itex]then clearly <description mid="22" did="1" type="P" spid="2_5"><sdescription mid="22" did="1" sid="1" spid="2_5">the reverse of <math mid="21" spid="2_6">MATH_0801.2412_21</math></sdescription></description>, that is <math mid="22" spid="2_7">MATH_0801.2412_22[/itex], <cdescription mid="24" did="1" cid="1" spid="2_8">contains the reverse of <math mid="23" spid="2_9">MATH_0801.2412_23[/itex]</cdescription>, which is the <description mid="24" did="1" type="P" spid="2_10"><sdescription mid="24" did="1" sid="1" spid="2_10">pattern</sdescription></description><math mid="24" spid="2_11">MATH_0801.2412_24[/itex].

��The entire collection is provided by the co-organizers.

Annotation Policy

We provide annotated datasets. In these datasets, all expressions are replaced by symbols in the format MATH_{PaperNumber}_{MathID}. For example, an original sentence "a variable X is defined as a probabilistic distribution" is converted to "a variable MATH_1_1 is defined as a probabilistic distribution". If you want to refer the content of mathematical expressions, you can use *.math files in which we provide their MathML formats.

You can get more infomation about the annotation policy from the attached file (Annotation-Brat.pdf).

Documents

(1) List of document sets

The document sets included in the NTCIR-10 Math Understanding test collection are as follows.

 Document Paper ID File name Event Argument of event MATH MATH-GROUP Descriiption C-Description ShortDescription Condition ArXiv 0810.2412 0801.2412_3.xml 4 0 1 0 1 0 0801.2412_5.xml 14 0 5 0 5 0 0801.2412_6.xml 26 0 12 1 12 0 0801.2412_7.xml 19 0 6 0 6 0 0801.2412_8.xml 39 0 16 1 16 0 0801.2412_9.xml 16 0 5 0 5 0 0801.2412_10.xml 32 0 8 0 8 0 0801.2412_11.xml 31 0 11 0 11 0 0801.2412_12.xml 27 0 9 2 9 0 0801.2412_13.xml 32 0 8 0 8 0 0801.2412_14.xml 5 0 1 0 1 0 0806.4135 0806.4135_9.xml 143 3 53 5 53 3 0806.4135_10.xml 65 0 13 0 13 3 00806.4135_11.xml 150 0 23 6 23 0 0 0808.0212 0808.0212_3.xml 0 0 0 0 0 0 0808.0212_4.xml 26 0 2 0 2 0 0808.0212_8.xml 12 0 0 0 0 0 0808.0212_9.xml 141 0 31 1 31 1 0808.0212_10.xml 8 0 2 0 2 0 0808.0212_11.xml 24 1 2 0 2 0 0811.2449 0811.2449_3.xml 5 0 0 0 0 0 0811.2449_5.xml 4 0 0 0 0 0 0811.2449_6a.xml 12 0 2 0 2 0 0811.2449_6b.xml 162 1 31 3 31 1 0811.2449_6c.xml 7 0 4 0 4 0 0811.2449_6d.xml 11 1 3 0 3 0 0811.2449_6e.xml 49 0 10 2 10 0 0811.2449_6f.xml 36 0 3 0 3 0 0811.2449_7.xml 12 0 1 0 1 0 0902.4089 0902.4089_3.xml 7 0 1 0 1 0 0902.4089_6.xml 16 0 9 3 9 0 0902.4089_7.xml 24 0 8 3 7 0 0902.4089_9.xml 16 0 4 0 4 0902.4089_11a.xml 7 0 2 1 2 0 0902.4089_11b.xml 127 2 31 3 31 0 0902.4089_11c.xml 25 0 6 2 6 0 0902.4089_12.xml 56 0 7 1 7 1 0904.0684 0904.0684_3.xml 4 0 0 0 0 0 0904.0684_5.xml 2 0 0 0 0 0 0904.0684_6.xml 48 3 10 0 10 0 0904.0684_7.xml 4 0 0 0 0 0 0904.0684_8.xml 88 0 18 0 18 0 0904.0684_9.xml 44 0 8 2 8 0 0904.0684_10.xml 2 0 0 0 0 0 0904.0684_11.xml 38 0 3 0 3 0 0905.1426 0905.1426_6.xml 5 0 2 1 2 0 0905.1426_10.xml 66 0 25 4 25 0 0905.1426_15.xml 106 0 23 1 21 0 0905.1426_26.xml 61 0 14 0 14 0 0905.1426_29.xml 2 0 0 0 0 0 0906.1240 0906.1240_5.xml 16 0 9 1 9 0 0906.1240_6.xml 11 0 1 0 1 0 0906.1240_7.xml 157 0 30 3 30 0 090.1544 090.1544_8.xml 149 3 64 9 64 0 0906.1612 0906.1612_3.xml 4 0 2 1 2 0 0906.1612_6.xml 48 3 21 3 21 1 0906.1612_7.xml 166 7 63 6 63 0 0906.1612_8.xml 282 10 77 11 77 1 0906.1612_9.xml 111 1 23 1 23 1 0906.1612_11.xml 1 0 0 0 0 0

(2) XML Tags used in document records

We present two annotation styles which can be converted into each other. You can use whichever one of them you prefer. We provide 3 kinds of annotations, namely, the full version which includes descriptions and short descriptions, the long version which only includes descriptions, and the short version which only includes short descriptions.

(a) Notation 1 (Simple)

 Tag Description
The section tag is a root tag of the document. [/itex] The content of the math tag is a mathematical expression. Each mathematical expresion has its own ID indicated by the mid attribute. The mid assignes the number by 1, 2, ... from the first of paper to each mathematical expression. The spid attribute shows an ID unique across a paper. So we represent it in the form
_ where the section ID is the section number and the span ID is the span number. The span ID assignes the number by 1, 2, ... from the first of a file to each span (mathematical expression, description and related sequential phrase). e.g.     MATH_0806.4135_10 [/itex] The content of the description tag is the description of the mathematical expression indicated by the mid tag. The mid attribute specifies a mathematical expression or mathematical expressions by MID1, MID2, ... . The did attribute is the ID of the description for a mathematical expression. The type attribute specifies the type of the description for mathematical expression. As same as the case of mid attribute,  each value of type attribute, TYPE1, TYPE2,...,  is defined for each mathematical expression. A TYPE is a a sequence of one or more of the following characters: assumption: A condition: C proposition: P For example, means that the mathematical expression 10 has the description 1 with the type P and the mathematical expression 20 has the description 1 with the type A. Each mathematical expression can be with several descriptions specified by 1, 2, ... in did attribute.  This span is in fifth of a file in the section 2. e.g.     A partition     MATH_0806.4135_10 [/itex] The content of the cdescription tag is the discontinued description of the mathematical expression indicated by the mid attribute. The did attribute specifies the related main description. The cid is a unique ID of the cdescription for a mathematical expression. e.g.             A partition         MATH_0806.4135_10 [/itex]     of             MATH_0806.4135_11     [/itex] The content of the sdescription tag is the short description of a mathematical expression indicated by the mid attribute. The did attribute specifies the related long description. The sid attribute is a unique ID of the sdecription for a mathematical expression. e.g.         A partition        MATH_0806.4135_10 [/itex] The content of the condition tag is the condition of the description of a mathematical expression indicated by the mid attribute. The did attribute specifies the main description. The cnid attribute is a unique ID of the condition. e.g. Suppose             MATH_0806.4135_67     [/itex]     is                         an additive subgroup               of                    MATH_0806.4135_68        [/itex]        which contains the inverses of each of its nonzero elements     . Then     MATH_0806.4135_69 [/itex] is             a subfield         of             MATH_0806.4135_70     [/itex] . The attributes of the mathgroup tag are the same as those of the math tag. A mathgroup is usually assigned to a mathematical expression containing words. e.g.                         block system                         MATH_0806.4135_55     [/itex]             of                     MATH_0806.4135_56         [/itex]

Example

<section>
...
If
<description mid="19" did="1" type="P" spid="2_1">
<sdescription mid="19" did="1" sid="1" spid="2
_1">
a permutation
</sdescription>
</description>
<math mid="19" spid="2
_2">
MATH_0801.2412_19
[/itex]
contains
<description mid="20" did="1" type="P" spid="2_3">
<sdescription mid="20" did="1" sid="1" spid="2
_3">
the pattern
</sdescription>
</description>
<math mid="20" spid="2
_4">
MATH_0801.2412_20
[/itex]
then clearly
<description mid="22" did="1" type="P" spid="2_5">
<sdescription mid="22" did="1" sid="1" spid="2
_5">
the reverse of
<span spid="2_6">
MATH_0801.2412_21
</span>
</sdescription>
</description>

, that is
<math mid="22" spid="2_7">
MATH_0801.2412_22
[/itex]
,
<cdescription mid="24" did="1" cid="1" spid="2_8">
contains the reverse of
<math mid="23" spid="2_9">
MATH_0801.2412_23
[/itex]
</cdescription>

, which is the
<description mid="24" did="1" type="P" spid="2_10">
<sdescription mid="24" did="1" sid="1" spid="2
_10">
pattern
</sdescription>
</description>

<math mid="24" spid="2_11">
MATH_0801.2412_24
[/itex]
.
...
</section>

In this case, the description of MATH_0801.2412_19 is "a permutation". It is also the short description. The final expression MATH_0801.2412_24 has a complicated structure. It has a description "pattern" but also has a cdescription "contains the reverse of MATH_0801.2412_23".

<section>
...
Namely, Simion and Stanton  essentially studied
<description mid="45 46 47 48" did="1 1 1 1" type="P P P P" spid="3_1">
<sdescription mid="45 46 47 48" did="1 1 1 1" sid="1 1 1 1" spid="3_1">
the patterns
</sdescription>
</description>

<math mid="45" spid="3_2">
MATH_0801.2412_45
[/itex]
,

<math mid="46" spid="3_3">
MATH_0801.2412_46
[/itex]
,

<math mid="47" spid="3_4">
MATH_0801.2412_47
[/itex]
, and

<math mid="48" spid="3_5">
MATH_0801.2412_48
[/itex]
and their relation to a set of orthogonal polynomials generalizing the Laguerre polynomials, and one of these patterns also played a crucial role in the proof by Foata and Zeilberger  that Denert's statistic is Mahonian.
...
</section>

In this case, a description "the patterns" covers 4 mathematical expressions.

<section>
...
<math mid="13" spid="33">
MATH_C04-1197_22
[/itex]
and
<math mid="14" spid="34">
MATH_C04-1197_23
[/itex]
are
<description mid="13 14" did="1 1" type="P P" spid="35">
the numbers of
</description>
<cdescription mid="13" did="1" cid="1" spid="36">

inequality
</cdescription>
and
<cdescription mid="14" did="1" cid="1" spid="37">
equality
</cdescription>
<cdescription mid="13 14" did="1 1" cid="2 2" spid="38">

constraints
</cdescription>
...
</section>

In this case, a description "the number of" is shared with two mathematical expressions. cdescription are used to express "the number of inequality constraints" and "the number of equality constraints". Remark that "constraints" is shared with two expressions.

(b) Notation 2

 Tag Description
The section tag is the root tag of the document. The content tag contains the oroginal text. Its parent tag must be section. The annotation tag indicates relations between a mathematical expression and its descriptions. The content of the span element specifies a mathematical expression, or a description, or a description-related issues. This tag is only used in the content of the content tag. The id attribute must be the form of "
_. The attributes of the math tag specify a relation between a mathematical expresion and its descriptions. This tag is only used in the content of the annotation tag. Attributes that can be used in the tag are shown below. All span references use the span tags' id attributes. In case where multiple values are allowed, they are separated by space (" "). mid="ID": Each math tag has a unique ID. The relation between a mathematical expression and the description can be referred to by this ID. The number is assigned like 1, 2, ... from the beginning of a file. tid="SPID": Specifies the mathematical expression span. count="N": Specifies the number of descriptions. In the case of N=, there are no descriptions for the mathematical expression. did="SPID1;SPID2;..."ID2;...: Contains references to the description spans. cid="SPID11,SPID12,...;SPID21,SPID22,...;..."...;...: Contains references to discontinuous description spans. In this case, SPID11 and SPID12 become discontinuous descriptions of the description SPID1. sid="SPID11,SPID12,...;SPID21,SPID22,...;...": Contains references to short descriptions. cnid="SPID11,SPID12,...;SPID21,SPID22,...;..."...;...": Contains references to conditions. type="TYPE1;TYPE2;...: Specify the type of descriptions of a mathematical expression. Each TYPE must be the sequence of one or more of the following characters: assumption: mption: A condition: C proposition: P e.g.     A partition     MATH_0806.4135_10     of             MATH_0806.4135_11         Suppose             MATH_0806.4135_67         is                         an additive subgroup                 of                     MATH_0806.4135_68                 which contains the inverses of each of its nonzero elements     . Then     MATH_0806.4135_69 is             a subfield         of             MATH_0806.4135_70     .                 The attributes of the mathgroup tag are the same as those of the math tag. A mathgroup is usually assigned to a mathematical expression containing words. e.g. Any             block system                 MATH_0806.4135_55                 of                     MATH_0806.4135_56             is             the set of translates        of           a proper vector subspace              MATH_0806.4135_57       of       MATH_0806.4135_58    , that is,     MATH_0806.4135_59 .

Example

<section>
<content>
If
<span id="2_1">
a permutation
</span>
<span id="2
_2">
MATH_0801.2412_19
</span>
contains
<span id="2_3">
the pattern
</span>
<span id="2_4">
MATH_0801.2412_20
</span>
then clearly
<span id="2_5">
the reverse of
<span id="2_6">
MATH_0801.2412_21
</span>
</span>

, that is
<span id="2_7">
MATH_0801.2412_22
</span>
,
<span id="2_8">
contains the reverse of
<span id="2_9">
MATH_0801.2412_23
</span>
</span>

, which is the
<span id="2_10">
pattern
</span>
<span id="2
_11">
MATH_0801.2412_24
</span>
...
</content>
<annotation>
<math mid="19" tid="2_2" count="1" did="2_1" sid="2_1" type="P"/>
<math mid="20" tid="2_4" count="1" did="2_3" sid="2_3" type="P"/>
<math mid="21" tid="2_6" count="0"/>
<math mid="22" tid="2_7" count="1" did="2_5" sid="2_5" type="P"/>
<math mid="23" tid="2_9" count="0"/>
<math mid="24" tid="2_11" count="1" did="2_10" cid="2_8" sid="2_10" type="P"/>

...
</content>
</section>

<section>
<content>

...
Namely, Simion and Stanton  essentially studied
<span id="3_1">
the patterns
</span>
<span id="3_2">
MATH_0801.2412_45
</span>
,
<span id="3_3">
MATH_0801.2412_46
</span>
,
<span id="3_4">
MATH_0801.2412_47
</span>
, and
<span id="3_5">
MATH_0801.2412_48
</span>
and their relation to a set of orthogonal polynomials generalizing the Laguerre polynomials, and one of these patterns also played a crucial role in the proof by Foata and Zeilberger  that Denerts statistic is Mahonian.
</content>

<annotation>

...
<math mid="45" tid="3_2" count="1" did="3_1" sid="3_1" type="P"/>
<math mid="46" tid="3_3" count="1" did="3_1" sid="3_1" type="P"/>
<math mid="47" tid="3_4" count="1" did="3_1" sid="3_1" type="P"/>

<math mid="48" tid="3_5" count="1" did="3_1" sid="3_1" type="P"/>
...
</annotation>

</section>

<section>
<content>
...
<span id="5_33">
MATH_C04-1197_22
</span>
and
<span id="5_34">
MATH_C04-1197_23
</span>
are
<span id="5_35">
the numbers of
</span>
<span id="5_36">

inequality
</span>
and
<span id="5_37">
equality
</span>
<span id="5_38">

constraints
</span>
...

</content>
<annotation>
...
<math mid="22" tid="5_33" count="1" did="5_35" cid="5_36,5_38" sid="5_35" type="P" />
<math mid="23" tid="5_34" count="1" did="5_35" cid="5_37,5_38" sid="5_35" type="P" />
...

</annotation>
</section>

References

• Trzeciak, J. (1995). Writing Mathematical Papers in English: A Practical Guide. Warsawa: European Mathematical
Society.
• Zinn, C. W. (2004). Understanding Informal Mathematical Discourse. Nurnberg.

ntc-admin 