ParaCrawl OpenSource Pipeline



Bitextor is a pipeline that runs a collection of scripts to produce a parallel corpus from a collection of multilingual websites. The current version (v7) has been released by the Paracrawl project (Broader Web-Scale Provision of Parallel Corpora for European Languages). 

To run Bitextor, it is necessary to provide:

  1. The source where the parallel data will be searched: one or more websites (namely, bitextor needs website hostnames)
  2. The two languages on which the user is interested: language IDs (according to ISO 639-1)
  3. A source of bilingual information between these two languages: either a bilingual lexicon (such as those available at the bitextor-data repository), an MT system, or a parallel corpus to be used to produce either a lexicon or an MT system (depending on the alignment strategy chosen, see below)

Features

The pipeline includes the following stages:

  1. Crawling: documents are downloaded from the specified websites
  2. Pre-processing: downloaded documents are normalized, boilerplates are removed, plain text is extracted, and language is identified
  3. Document alignment: parallel documents are identified. Two strategies are implemented for this stage: one using bilingual lexica and a collection of features extracted from HTML; a linear regressor combines these resources to produce a score in [0,1], and another using machine translation and a TF/IDF strategy to score document pairs
  4. Segment alignment: each pair of documents is processed to identify parallel segments. Again, two strategies are implemented: one using the tool Hunalign, and another using Bleualign, that can only be used if the MT-based-document-alignment strategy is used (machine translations are used for both methods)
  5. Post-processing: final steps that allow to clean the parallel corpus obtained using the tool bicleaner, deduplicates translation units, and computes additional quality metrics


Target Users
  • European Commission, Directorate General for Translation
  • Researchers
  • Developers

    and any third party interested in training MT engines

Get Started with the service

: Free of charge

Support

Helpdesk: https://github.com/bitextor/bitextor/issues

Access the service

Request - Source Code : https://github.com/bitextor/bitextor

Terms of Use: GPL-3.0

Workflow
Input Output
Media Type
Resource type Corpus
Data format TMX
Data format (other) Website hostnames