LX-Suite performs shallow processing of Portuguese. It is composed by a set of shallow processing tools:
LX Sentence Splitter:
Marks sentence boundaries with <s>…</s>, and paragraph boundaries with <p>…</p>.
Unwraps sentences split over different lines.
LX-Tokenizer:
Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more clearly.
um exemplo → |um|exemplo|
Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:
do → |de_|o|
Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:
um, dois e três → |um|,*/|dois|e|três|
5.3 → |5|.|3|
1. 2 → |1|.*/|2|
8 . 6 → |8|\*.*/|6|
Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:
dá-se-lho → |dá|-se|-lhe_|-o|
afirmar-se-ia → |afirmar-CL-ia|-se|
vê-las → |vê#|-las|
This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:
deste → |deste| when occurring as a Verb
deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)
LX-Tagger:
Assigns a single morpho-syntactic tag, from the tagset, to every token. The tag is attached to the token, using a / (slash) symbol as separator:
um exemplo → um/IA exemplo/CN
Each individual token in multi-token expressions gets the tag of that expression prefixed by "L" and followed by the number of its position within the expression:
de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4
LX-Featurizer (nominal):
Assigns inflection feature values to words from the nominal categories (Gender (masculine or feminine), Number (singular or plural) and, when applicable, Person (1st, 2nd and 3rd)):
os/DA gatos/CN → os/DA#mp gatos/CN#mp
Assigns degree feature values (diminutive, superlative and comparative) to words from the nominal categories:
os/DA gatinhos/CN → os/DA#mp gatinhos/CN#mp-dim
Sometimes, due to the so-called invariant words, the featurizer is not able to determine a feature value. In those cases, it assigns a g value for an underspecified Gender and n value for an underspecified Number. Note, however, that if provided with an adequate context, the featurizer might resolve such cases:
Assigns a lemma to words from the nominal categories (Adjectives, Common Nouns and Past Participles). This lemma corresponds to the form that one would find in a dictionary, typically the masculine singular form. The lemma is inserted into the token, with / (slash) as a delimiter.
Assigns a lemma and inflection feature values to verbs. The lemma corresponds to the infinitive form of the verb. The lemma is inserted into the token, with / (slash) as a delimiter.
escrevi/V → escrevi/ESCREVER/V#ppi-1s
Language Coverage
Portuguese
Target Users
Researchers
Developers
Students
Language professionals
Get Started with the service
:
Free for non commercial research. To be negotiated for other usages.