LX-Suite

Provided by:
University of Lisbon (NLX-GROUP) - Natural Language and Speech Group, Department of Informatics, Faculty of Sciences


LX-Suite performs shallow processing of Portuguese. It is composed by a set of shallow processing tools:

LX Sentence Splitter:

  • Marks sentence boundaries with <s>…</s>, and paragraph boundaries with <p>…</p>.
  • Unwraps sentences split over different lines.

LX-Tokenizer:

  • Segments text into lexically relevant tokens, using whitespace as the separator. Note that, in these examples, the | (vertical bar) symbol is used to mark the token boundaries more clearly.

um exemplo → |um|exemplo|

  • Expands contractions. Note that the first element of an expanded contraction is marked with an _ (underscore) symbol:

do → |de_|o|

  • Marks spacing around punctuation or symbols. The \* and the */ symbols indicate a space to the left and a space to the right, respectively:

um, dois e três → |um|,*/|dois|e|três|

5.3 → |5|.|3|

1. 2 → |1|.*/|2|

8 . 6 → |8|\*.*/|6|

  • Detaches clitic pronouns from the verb. The detached pronoun is marked with a - (hyphen) symbol. When in mesoclisis, a -CL- mark is used to signal the original position of the detached clitic. Additionally, possible vocalic alterations of the verb form are marked with a # (hash) symbol:

dá-se-lho → |dá|-se|-lhe_|-o|

afirmar-se-ia → |afirmar-CL-ia|-se|

vê-las → |vê#|-las|

  • This tool also handles ambiguous strings. These are words that, depending on their particular occurrence, can be tokenized in different ways. For instance:

 deste → |deste| when occurring as a Verb

deste → |de|este| when occurring as a contraction (Preposition + Demonstrative)

LX-Tagger:

  • Assigns a single morpho-syntactic tag, from the tagset, to every token. The tag is attached to the token, using a / (slash) symbol as separator:

um exemplo → um/IA exemplo/CN

  • Each individual token in multi-token expressions gets the tag of that expression prefixed by "L" and followed by the number of its position within the expression:

de maneira a que → de/LCJ1 maneira/LCJ2 a/LCJ3 que/LCJ4

LX-Featurizer (nominal):

  • Assigns inflection feature values to words from the nominal categories (Gender (masculine or feminine), Number (singular or plural) and, when applicable, Person (1st, 2nd and 3rd)):

os/DA gatos/CN → os/DA#mp gatos/CN#mp

  • Assigns degree feature values (diminutive, superlative and comparative) to words from the nominal categories:

os/DA gatinhos/CN → os/DA#mp gatinhos/CN#mp-dim

  • Sometimes, due to the so-called invariant words, the featurizer is not able to determine a feature value. In those cases, it assigns a g value for an underspecified Gender and n value for an underspecified Number. Note, however, that if provided with an adequate context, the featurizer might resolve such cases:

Vi/V pianistas/CN → Vi/V pianistas/CN#gp

Vi/V as/DA pianistas/CN → Vi/V as/DA#fp pianistas/CN#fp

LX-Lemmatizer (nominal):

  • Assigns a lemma to words from the nominal categories (Adjectives, Common Nouns and Past Participles). This lemma corresponds to the form that one would find in a dictionary, typically the masculine singular form. The lemma is inserted into the token, with / (slash) as a delimiter.

gatas/CN#fp → gatas/GATO/CN#fp

normalíssimo/ADJ#ms-sup → normalíssimo/NORMAL/ADJ#ms-sup

 LX-Lemmatizer and Featurizer (verbal):

  • Assigns a lemma and inflection feature values to verbs. The lemma corresponds to the infinitive form of the verb. The lemma is inserted into the token, with / (slash) as a delimiter.

escrevi/V → escrevi/ESCREVER/V#ppi-1s

Language Coverage
Portuguese

Target Users
  • Researchers
  • Developers
  • Students
  • Language professionals
Get Started with the service

: Free for non commercial research. To be negotiated for other usages.

Support

Helpdesk: https://portulanclarin.net/helpdesk/

Access the service

Request - Web Service : https://portulanclarin.net/workbench/lx-suite/

Web Service