LX-Tagger splits sentences, tokenizes and annotates with part-of-speech tags from input raw text in Portuguese.
The LX-Tagger service is composed by a set of shallow processing tools:
LX Sentence Splitter:
Marks sentence boundaries with <s>…</s>, and paragraph boundaries with <p>…</p>.
Unwraps sentences split over different lines.
A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
LX-Tokenizer:
Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more clearly.
um exemplo → |um|exemplo|
Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:
do → |de_|o|
Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:
um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|
Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:
dá-se-lho → |dá|-se|-lhe_|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|
This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste| when occurring as a Verb
deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)
This tool achieves a f-score of 99.72%.
LX-Tagger:
Assigns a single morpho-syntactic tag, from the tagset, to every token. The tag is attached to the token by using a / (slash) symbol as separator:
um exemplo → um/IA exemplo/CN
Each individual token in multi-token expressions gets the tag of that expression prefixed by ""L"" and followed by the number of its position within the expression:
de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4
This tagger was developed using the TnT software on a manually annotated 260k token corpus. An accuracy of 96.87% was obtained under 10-fold cross-evaluation.
These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.
Language Coverage
Portuguese
Target Users
Researchers
Developers
Students
Language professionals
Get Started with the service
:
Free for non commercial research. To be negotiated for other usages.