Abstract
Discourse relations connect two or more segments. Segmentation is an important step in the process of
annotating discourse relations, but often not one extensively discussed in annotation methods or manuals.
Ideally, implementing segmentation rules results in text segments that correspond to the units of thought
related to each other. However, in many often-used annotation systems this does not always seem to be the
case. Most formalized segmentation rules (e.g. Carlson & Marcu, 2001; Mann & Thompson, 1988; Reese,
Hunter, Asher, Denis, & Baldridge, 2007; Sanders & van Wijk, 1996) would, for instance, not allow segmenting
the conditional relation in (1), either because too many elements in S1 have been elided or because the
segment following if would break up a larger unit. Still, the segmentation indicated in (1) seems very plausible
and exactly captures the two segments related by the connective if.
(1) (context: The virus harms cold-blooded animals.) It does not replicate at temperatures above 25° centigrade and [would,]S2a if [present in fish for human consumption,]S1 [be inactivated when ingested.]S2b (ep 00-03-01)
In this presentation we present fragments encountered during an annotation effort of (explicit) local discourse relations from the Europarl corpus (Koehn, 2005) that are problematic to segment under most segmentation guidelines. We focus on three specific problems: ellipsis, complement structures, and perspective markers. We propose segmentation options that result in segments that do justice to the interpretation of the discourse relation and use translations (from the Europarl Direct corpus, Cartoni, Zufferey, & Meyer; 2013) as additional support for our analysis. Finally, we explore ways to formulate rules that produce text segments that do justice to interpretation. We conclude that segmentation is in part dependent on the propositional content of text fragments, and that completely separating segmentation and annotation (i.e. treating it as a two-step process) does not always yield text segments that correspond to the text units between which a conceptual relationship (potentially signaled by a connective) holds (see also Verhagen, 2001). Although relying partly on the content of a text fragment results in better text segmentation, this does in turn raise problems for (semi-) automatically segmenting texts. Identifying specific problems, such as the ones addressed here, and being more explicit in segmentation strategies used in the annotation of discourse relations are important steps toward solving these problems.
(1) (context: The virus harms cold-blooded animals.) It does not replicate at temperatures above 25° centigrade and [would,]S2a if [present in fish for human consumption,]S1 [be inactivated when ingested.]S2b (ep 00-03-01)
In this presentation we present fragments encountered during an annotation effort of (explicit) local discourse relations from the Europarl corpus (Koehn, 2005) that are problematic to segment under most segmentation guidelines. We focus on three specific problems: ellipsis, complement structures, and perspective markers. We propose segmentation options that result in segments that do justice to the interpretation of the discourse relation and use translations (from the Europarl Direct corpus, Cartoni, Zufferey, & Meyer; 2013) as additional support for our analysis. Finally, we explore ways to formulate rules that produce text segments that do justice to interpretation. We conclude that segmentation is in part dependent on the propositional content of text fragments, and that completely separating segmentation and annotation (i.e. treating it as a two-step process) does not always yield text segments that correspond to the text units between which a conceptual relationship (potentially signaled by a connective) holds (see also Verhagen, 2001). Although relying partly on the content of a text fragment results in better text segmentation, this does in turn raise problems for (semi-) automatically segmenting texts. Identifying specific problems, such as the ones addressed here, and being more explicit in segmentation strategies used in the annotation of discourse relations are important steps toward solving these problems.
Original language | English |
---|---|
Publication status | Published - 25 Jan 2016 |
Event | LPTS2016 - Valencia, Spain Duration: 24 Jan 2016 → 26 Jan 2016 |
Conference
Conference | LPTS2016 |
---|---|
Country/Territory | Spain |
City | Valencia |
Period | 24/01/16 → 26/01/16 |