CET-AT Service Catalogue

BSP - Example Sentence Generation Pipeline

Provided by:

Peter Kolb & Prochazkova GbR - Linguatools

Function: Language Technologies

Task: Crawling

About
More

BSP is a webcrawler that identifies web pages that are translations of each other, extracts the page’s contents, splits them into sentences using LinA, and aligns the sentences. Then, the following classifiers are applied:

GarbageFilter: sorts out boilerplate text from navigation elements and menues, but also sentences that consist only of keyword lists and other SEO spam.
Parallelness Scorer: uses a dictionary-based classifier to make sure that the aligned sentences really are translations of each other.
Machine Translation Filter: detects pages that were translated by machine translation engines.
Domain Classification: assigns each page up to three domains from our domain hierarchy.
Duplicate detection: identifies duplicate sentence pairs. Only unique sentence pairs are kept.

BSP is tailored to English-German, but it is being adapted to more language pairs.

Language Coverage

English (Latin), German (Latin)

Get Started with the service

: Contact the provider

Support

Helpdesk: peter.kolb@linguatools.org

Other

BSP - Example Sentence Generation Pipeline

Contact Information

Language Coverage

Get Started with the service

Support