LX-Tagger

Provided by:
University of Lisbon (NLX-GROUP) - Natural Language and Speech Group, Department of Informatics, Faculty of Sciences


LX-Tagger splits sentences, tokenizes and annotates with part-of-speech tags from input raw text in Portuguese.

The LX-Tagger service is composed by a set of shallow processing tools:

  • LX Sentence Splitter: 
    • Marks sentence boundaries with <s>…</s>, and paragraph boundaries with <p>…</p>.
    • Unwraps sentences split over different lines.
    • A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
  • LX-Tokenizer:
    • Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more clearly.

um exemplo → |um|exemplo|

    • Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:

do → |de_|o|

    • Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:

um, dois e três → |um|,*/|dois|e|três| 

5.3 → |5|.|3|

1. 2 → |1|.*/|2|

8 . 6 → |8|\*.*/|6|

    • Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:

dá-se-lho → |dá|-se|-lhe_|-o| 

afirmar-se-ia → |afirmar-CL-ia|-se| 

vê-las → |vê#|-las| 

    • This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:

deste → |deste| when occurring as a Verb 

deste → |de|este| when occurring as a contraction (Preposition + Demonstrative) 

    • This tool achieves a f-score of 99.72%.
  • LX-Tagger:

    • Assigns a single morpho-syntactic tag, from the tagset, to every token. The tag is attached to the token by using a / (slash) symbol as separator:

um exemplo → um/IA exemplo/CN 

    • Each individual token in multi-token expressions gets the tag of that expression prefixed by ""L"" and followed by the number of its position within the expression:

de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4 

    • This tagger was developed using the TnT software on a manually annotated 260k token corpus. An accuracy of 96.87% was obtained under 10-fold cross-evaluation.

These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.

Language Coverage
Portuguese

Target Users
  • Researchers
  • Developers
  • Students
  • Language professionals
Get Started with the service

: Free for non commercial research. To be negotiated for other usages.

Support

Helpdesk: https://portulanclarin.net/helpdesk/

Access the service

Request - Web Service : https://portulanclarin.net/workbench/lx-tagger/

Web Service