ILSP Focused Crawler

Provided by:
ATHENA Research Center (ILSP / ATHENA R.C.) - Institute for Language and Speech Processing


ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. The required input from the user consists of a list of seed URLs pointing to relevant web pages and a list of terms that describe a topic. ILSP-FC integrates modules for text normalization, language identification, document clean-up, text classification, bilingual document alignment (i.e. identification of pairs of documents that are translations of each other) and sentence alignment. If the user does not provide a list of terms, the software can be used as a general crawler.

ILSP-FC is being developed by researchers of the ILSP/Athena RIC and currently being used in the European Language Resource Coordination Data effort. ELRC Data implements the acquisition of language resources and language processing services, as well as their provision to the language resource repository of the Connecting Europe Facility (CEF) eTranslation platform, which helps European and national public administrations exchange information across language barriers in EU.

An initial version of the crawler was produced during PANACEA, an EU FP7 project for the acquisition and production of Language Resources. It was then extended during the QTLaunchPad project, a European Commission-funded collaborative research initiative dedicated to overcoming quality barriers in machine and human translation and in language technologies; and the FP7-PEOPLE Abu-MaTran project for enhancing industry-academia cooperation in the adoption of machine translation technologies.

More information on how to download and use ILSP-FC can be found in the Documentation.

ILSP-FC is a Java project released under the GNU GPL, v. 3.0 license. It depends on open-source libraries for web mining and building data-processing workflows. If you would like to try ILSP-FC as a bilingual crawler before installing it, you can use this web service on small websites.

If you use ILSP-FC in scientific work, please cite: Papavassiliou, V., Prokopidis, P. & G. Thurmair. (2013). A modular open-source focused crawler for mining monolingual and bilingual corpora from the web. In Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, pages 43-51. Sofia, Bulgaria : Association for Computational Linguistics (BibTeX)

The pair detection module of ILSP-FC was used for aligning documents in the WMT16 Bilingual Document Alignment Shared Task. The system reached a recall of 91% in the soft scoring setting prepared by the organizers. More details are presented in the system paper: Papavassiliou, V., Prokopidis, P., and Piperidis, S. (2016). The ILSP/ARC submission to the WMT 2016 bilingual document alignment shared task. In Proceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics (BibTeX)

Features

The current version of ILSP-FC offers the user the option to run all relevant processes in a pipeline or to select a specific process (e.g. export or deduplication, or pair detection, etc.).

In a configuration for acquiring monolingual data, ILSP-FC applies the following processes (one after the other):

  • crawls the web until an expiration criterion is met (i.e. harvests webpages and stores the ones that are in the targeted language, and relevant to a targeted topic if required)
  • exports the stored data (i.e. stores downloaded web pages/documents and for each page generates a CesDoc files with its content and metadata).
  • discards (near) duplicate documents

In a configuration for acquiring parallel data, it applies the following processes (one after the other):

  • crawls a website with content in the targeted languages (i.e. harvests the website and stores pages that are in the targeted languages, and relevant to a targeted topic if required)
  • exports the stored data (i.e. stores downloaded web page/document and generates a CesDoc file with its content and metadata).
  • discards (near) duplicate documents
  • identifies pairs of (candidate) parallel documents and generates a cesAlign file for each detected pair.
  • aligns the segments in each detected document pair and generates a TMX for each document pair
  • merges TMX files corresponding to each document pair in order to create the final output, i.e. a TMX that includes all (or a selection of) segment pairs


Workflow
Input Output
Media Type
Resource type Corpus Corpus
Data format HTML TMX