LX-Tagger splits sentences, tokenizes and annotates with part-of-speech tags from input raw text in Portuguese.
The LX-Tagger service is composed by a set of shallow processing tools:
LX Sentence Splitter:
Marks sentence boundaries with <s>…</s>, and paragraph boundaries with <p>…</p>.
Unwraps sentences split over different lines.
A f-score of 99.94% was obtained when testing on a 12,000 sentence corpus accurately hand tagged with respect to sentence and paragraph boundaries.
Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more clearly.
um exemplo → |um|exemplo|
Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:
do → |de_|o|
Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:
um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|
Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:
dá-se-lho → |dá|-se|-lhe_|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|
This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste| when occurring as a Verb
deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)
This tool achieves a f-score of 99.72%.
Assigns a single morpho-syntactic tag, from the tagset, to every token. The tag is attached to the token by using a / (slash) symbol as separator:
um exemplo → um/IA exemplo/CN
Each individual token in multi-token expressions gets the tag of that expression prefixed by ""L"" and followed by the number of its position within the expression:
de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4
This tagger was developed using the TnT software on a manually annotated 260k token corpus. An accuracy of 96.87% was obtained under 10-fold cross-evaluation.
These tools work in a pipeline scheme, where each tool takes as input the output of the previous tool.
Language Coverage
Target Users
Language professionals
Get Started with the service
Free for non commercial research. To be negotiated for other usages.