LinA is a software for automatic text processing. It performs linguistic analysis of large volumes of unstructured text, including social media. LinA's available modules are:
Text filters: extract content from text, HTML, PDF, MS Office, Open Office, XML, TMX, XLIFF and many other file formats.
Language identification: automatically determines the language of a document.
Sentence segmentation and tokenization: Identifies sentence and word boundaries.
Truecasing: normalizes the casing of a text, e.g. This Sentence Is in ENGLISH. → this sentence is in English.
Part-of-speech tagging: assigns to every word its part of speech.
Morphological analysis and lemmatization: analyses unknown words according to morphological rules (including compound splitting for German), and generates the baseform of a word in the current context.
Several configurable output methods: XML writer, Lucene index writer, etc.
It can process a large variety of input formats and is available for English and German. Some of the modules are available for other languages (contact the provider).
LinA is completely coded in Java, which makes it fast and run on Unix as well as Windows and MacOS.