BSP is a webcrawler that identifies web pages that are translations of each other, extracts the page’s contents, splits them into sentences using LinA, and aligns the sentences. Then, the following classifiers are applied:
GarbageFilter: sorts out boilerplate text from navigation elements and menues, but also sentences that consist only of keyword lists and other SEO spam.
Parallelness Scorer: uses a dictionary-based classifier to make sure that the aligned sentences really are translations of each other.
Machine Translation Filter: detects pages that were translated by machine translation engines.
Domain Classification: assigns each page up to three domains from our domain hierarchy.
Duplicate detection: identifies duplicate sentence pairs. Only unique sentence pairs are kept.
BSP is tailored to English-German, but it is being adapted to more language pairs.