PolEval 2019 :: Task 1

Task 1: Recognition and normalization of temporal expressions

Task definition

The aim of this task is to advance research on processing of temporal expressions, which are used in other NLP applications like question answering, summarisation, textual entailment, document classification, etc. This task follows on from previous TempEval events organised for evaluating time expressions for English and Spanish (SemEval-2013). This time we provide corpus of Polish documents fully annotated with temporal expressions. The annotation consists of boundaries, classes and normalised values of temporal expressions. Original TIMEX3 annotation guidelines are available here. The annotation for Polish texts is based on modified version of guidelines at the level of annotating boundaries/types (here) and local/global normalisation (here).

Training data

The training dataset contains 1500 documents from KPWr corpus. Each document is XML file with the given annotations. Example:


<DOCID>344245.xml</DOCID>
<DCT><TIMEX3 tid="t0" functionInDocument="CREATION_TIME" type="DATE" 
value="2006-12-16"></TIMEX3></DCT>
<TEXT>
<TIMEX3 tid="t1" type="DATE" value="2006-12-16">Dziś</TIMEX3> Creative 
Commons obchodzi czwarte urodziny - przedsięwzięcie ruszyło dokładnie 
<TIMEX3 tid="t2" type="DATE" value="2002-12-16">16 grudnia 2002</TIMEX3> 
w San Francisco. (...) Z kolei w <TIMEX3 tid="t4" type="DATE"
value="2006-12-18">poniedziałek</TIMEX3> ogłoszone zostaną wyniki 
głosowanie na najlepsze blogi. W ciągu <TIMEX3 tid="t5" type="DURATION" 
value="P8D">8 dni</TIMEX3> internauci oddali ponad pół miliona głosów. 
Z najnowszego raportu Gartnera wynika, że w <TIMEX3 tid="t6" type="DATE" 
value="2007">przyszłym roku</TIMEX3> blogosfera rozrośnie się 
do rekordowego rozmiaru 100 milionów blogów. (...)
</TEXT>

Training data is available here.

Test data

The test dataset will contain about 100-200 documents. Each document follows the same TimeML format as training data, but only

<TIMEX3 tid="t0" functionInDocument="CREATION_TIME"/>

annotation is given. Example:

<DOCID>344245.xml</DOCID>
<DCT><TIMEX3 tid="t0" functionInDocument="CREATION_TIME" type="DATE" 
value="2006-12-16"></TIMEX3></DCT>
<TEXT>
Dziś Creative Commons obchodzi czwarte urodziny - przedsięwzięcie ruszyło 
dokładnie 16 grudnia 2002 w San Francisco. (...) Z kolei w poniedziałek 
ogłoszone zostaną wyniki głosowanie na najlepsze blogi. W ciągu 8 dni 
internauci oddali ponad pół miliona głosów. Z najnowszego raportu Gartnera 
wynika, że w przyszłym roku blogosfera rozrośnie się do rekordowego 
rozmiaru 100 milionów blogów. (...)
</TEXT>

Test data is available here for DOWNLOAD.

Evaluation metrics

We utilise the same evaluation procedure as described in article [1] (section 4.1: Temporal Entity Extraction). We need to evaluate:

How many entities are correctly identified,
If the extents for the entities are correctly identified,
How many entity attributes are correctly identified.

We use classical precision and recall for the recognition.

How many entities are correctly identified
We evaluate our entities using the entity-based evaluation with the equations below.
Task 1: Precision and Recall
where, Sys_entity contains the entities extracted by the system that we want to evaluate, and Ref_entity contains the entities from the reference annotation that are being compared.

If the extents for the entities are correctly identified:
We compare our entities with both strict match and relaxed match. When there is an exact match between the system entity and gold entity then we call it strict match, e.g. “16 grudnia 2002” vs “16 grudnia 2002”. When there is a overlap between the system entity and gold entity then we call it relaxed match, e.g. “16 grudnia 2002” vs “2002”. When there is a relaxed match, we compare the attribute values.

How many entity attributes are correctly identified:
We evaluate our entity attributes using the attribute F1-score, which captures how well the system identified both the entity and attribute (attr) together.
Task 1: Attribute Precision and Recall

We measure P, R, F1 for both strict and relaxed match and relaxed F1 for value and type attributes. The most important metric is relaxed F1 value.

Evaluation script

We use evaluation script provided by Naushad UzZaman on GitHub:

git clone https://github.com/naushadzaman/tempeval3_toolkit.git

SemEval2013 example usage


cd tempeval3_toolkit
python TE3-evaluation.py data/gold data/system

> === Timex Performance ===
> Strict Match    F1      P       R
>                 87.5    84.85   90.32
> Relaxed Match   F1      P       R
>                 93.75   90.91   96.77
> Attribute F1    Value   Type
>                 50.0    93.75

PolEval2019 example usage


cd tempeval3_toolkit
wget http://tools.clarin-pl.eu/share/PolEval2019_TIMEX3.zip 
7za x PolEval2019_TIMEX3.zip
python TE3-evaluation.py PolEval2019_TIMEX3/example/test   
       PolEval2019_TIMEX3/example/test-gold/

> === Timex Performance ===
> Strict Match    F1      P       R
>                 0.0     0.0     0.0
> Relaxed Match   F1      P       R
>                 0.0     0.0     0.0
> Attribute F1    Value   Type
>                 0.0     0.0

References

[1] UzZaman, N., Llorens, H., Derczynski, L., Allen, J., Verhagen, M., & Pustejovsky, J. (2013). Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) (Vol. 2, pp. 1-9). URL

POLEVAL 2019