ILSP Focused Crawler (ILSP-FC) is a research prototype for acquiring domain-specific monolingual and bilingual corpora. The required input from the user consists of a list of seed URLs pointing to relevant web pages and a list of terms that describe a topic. ILSP-FC integrates modules for text normalization, language identification, document clean-up, text classification, bilingual document alignment (i.e. identification of pairs of documents that are translations of each other) and sentence alignment. If the user does not provide a list of terms, the software can be used as a general crawler.
An initial version of the crawler was produced during PANACEA, an EU FP7 project for the acquisition and production of Language Resources. It was then extended during the QTLaunchPad project, a European Commission-funded collaborative research initiative dedicated to overcoming quality barriers in machine and human translation and in language technologies; and the FP7-PEOPLEAbu-MaTran project for enhancing industry-academia cooperation in the adoption of machine translation technologies.
More information on how to download and use ILSP-FC can be found in the Documentation.
The pair detection module of ILSP-FC was used for aligning documents in the WMT16 Bilingual Document Alignment Shared Task. The system reached a recall of 91% in the soft scoring setting prepared by the organizers. More details are presented in the system paper: Papavassiliou, V., Prokopidis, P., and Piperidis, S. (2016). The ILSP/ARC submission to the WMT 2016 bilingual document alignment shared task. In Proceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics (BibTeX)
The current version of ILSP-FC offers the user the option to run all relevant processes in a pipeline or to select a specific process (e.g. export or deduplication, or pair detection, etc.).
In a configuration for acquiring monolingual data, ILSP-FC applies the following processes (one after the other):
crawls the web until an expiration criterion is met (i.e. harvests webpages and stores the ones that are in the targeted language, and relevant to a targeted topic if required)
exports the stored data (i.e. stores downloaded web pages/documents and for each page generates a CesDoc files with its content and metadata).
discards (near) duplicate documents
In a configuration for acquiring parallel data, it applies the following processes (one after the other):
crawls a website with content in the targeted languages (i.e. harvests the website and stores pages that are in the targeted languages, and relevant to a targeted topic if required)
exports the stored data (i.e. stores downloaded web page/document and generates a CesDoc file with its content and metadata).