CET-AT Service Catalogue

LX-Tagger

Provided by:

University of Lisbon (NLX-GROUP) - Natural Language and Speech Group, Department of Informatics, Faculty of Sciences

Function: Language Technologies

Task: Morphological Annotation , Paragraph Splitting , Sentence Splitting , Tokenization

About
More

LX-Tagger splits sentences, tokenizes and annotates with part-of-speech tags from input raw text in Portuguese.

The LX-Tagger service is composed by a set of shallow processing tools:

LX Sentence Splitter:

Marks sentence boundaries with <s>…</s>, and paragraph boundaries with <p>…</p>.
Unwraps sentences split over different lines.
A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.

LX-Tokenizer:

Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more clearly.

um exemplo → |um|exemplo|

Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:

do → |de_|o|

Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:

um, dois e três → |um|,*/|dois|e|três|

5.3 → |5|.|3|

1. 2 → |1|.*/|2|

8 . 6 → |8|\*.*/|6|

Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:

dá-se-lho → |dá|-se|-lhe_|-o|

afirmar-se-ia → |afirmar-CL-ia|-se|

vê-las → |vê#|-las|

This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:

deste → |deste| when occurring as a Verb

deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)

This tool achieves a f-score of 99.72%.

LX-Tagger:

Assigns a single morpho-syntactic tag, from the tagset, to every token. The tag is attached to the token by using a / (slash) symbol as separator:

um exemplo → um/IA exemplo/CN

Each individual token in multi-token expressions gets the tag of that expression prefixed by ""L"" and followed by the number of its position within the expression:

de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4

This tagger was developed using the TnT software on a manually annotated 260k token corpus. An accuracy of 96.87% was obtained under 10-fold cross-evaluation.

These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.

Language Coverage

Portuguese

Target Users

Researchers
Developers
Students
Language professionals

Get Started with the service

: Free for non commercial research. To be negotiated for other usages.

Support

Helpdesk: https://portulanclarin.net/helpdesk/

Access the service

Request - Web Service : https://portulanclarin.net/workbench/lx-tagger/