CET-AT Service Catalogue

ParaCrawl OpenSource Pipeline

Provided by:

Omniscien Technologies (Trading) B.V.

Prompsit Language Engineering S.L.

TAUS

University of Alicante

University of Edinburgh

Function: Language Technologies

Task: Crawling , Alignment

Bitextor is a pipeline that runs a collection of scripts to produce a parallel corpus from a collection of multilingual websites. The current version (v7) has been released by the Paracrawl project (Broader Web-Scale Provision of Parallel Corpora for European Languages).

To run Bitextor, it is necessary to provide:

The source where the parallel data will be searched: one or more websites (namely, bitextor needs website hostnames)
The two languages on which the user is interested: language IDs (according to ISO 639-1)
A source of bilingual information between these two languages: either a bilingual lexicon (such as those available at the bitextor-data repository), an MT system, or a parallel corpus to be used to produce either a lexicon or an MT system (depending on the alignment strategy chosen, see below)

Features

The pipeline includes the following stages:

Crawling: documents are downloaded from the specified websites
Pre-processing: downloaded documents are normalized, boilerplates are removed, plain text is extracted, and language is identified
Document alignment: parallel documents are identified. Two strategies are implemented for this stage: one using bilingual lexica and a collection of features extracted from HTML; a linear regressor combines these resources to produce a score in [0,1], and another using machine translation and a TF/IDF strategy to score document pairs
Segment alignment: each pair of documents is processed to identify parallel segments. Again, two strategies are implemented: one using the tool Hunalign, and another using Bleualign, that can only be used if the MT-based-document-alignment strategy is used (machine translations are used for both methods)
Post-processing: final steps that allow to clean the parallel corpus obtained using the tool bicleaner, deduplicates translation units, and computes additional quality metrics

Target Users

European Commission, Directorate General for Translation
Researchers
Developers
and any third party interested in training MT engines

Get Started with the service

: Free of charge

Support

Helpdesk: https://github.com/bitextor/bitextor/issues

Access the service

Request - Source Code : https://github.com/bitextor/bitextor

Terms of Use: GPL-3.0

Workflow

	Input	Output
Media Type
Resource type		Corpus
Data format		TMX
Data format (other)	Website hostnames

ParaCrawl OpenSource Pipeline

Contact Information

Contact Information

Contact Information

Contact Information

Contact Information

Features

Target Users

Get Started with the service

Support

Access the service