XOresearch ASR

Provided by:


Implemented as REST API server, with ability to be deployed on Linux machine with GPU (can also work on CPU). Can process almost any audio or video format, converting its audio into text transcription. Single GPU node can process about 20x sound streams simultaneously. Server software could be  modified to meet any client needs.

Features

  • accent agnostic design, single universal model - no need to switch the model depending on the origin of the audio stream as in competing solutions, which might be not versatile enough for production deploys;
  • channel quality agnostic design - 8KHz telephone and 16+KHz broadcast processing is integrated into the single model without the loss of quality, increasing the versatility further;
  • can be inferred in slower (2.5x slower for 4 additional hypothesis) mode with increased accuracy and additional versions of decoded text, which can be used for human-driven correction of the transcribe, leading to almost 100% top-5 accuracy in most cases; noise resilient, without the need of noise removal pre-pass for moderate amounts of noise; can be easily extended to unseen cases just by adding data into training pipeline, free of charge
  • Built-in noise resilient voice activity detection (VAD) which is using fast neural network.

Language Coverage
English (Latin)

Get Started with the service

: Contact the provider

Web Service