|
NLP CHALLENGE:
Audio recordings transcription
$6,000 in cash prizes in competition to find the best transcriptions of audio recordings.
|
NLP CHALLENGE: Audio recordings transcription
$6,000 in cash prizes in competition to find the best transcriptions of audio recordings.
Overview of the challenge
Participants worked with a large-scale dataset of audio transcriptions. The task was to aggregate multiple transcriptions into a single high-quality transcription. For this contest, we obtained Wikipedia articles vocalized by a popular voice assistant. Then every recording has been transcribed by several Toloka contributors. All the detailed description of the data can be found on the competition GitHub page.

Thanks all the participants and congratulations to the winners!
Task Details
Prizes
Timeline
Data
Speech-to-text projects are ubiquitous on crowdsourcing platforms. However, individual annotators on these platforms are often unskilled or even malicious. Therefore, transcriptions collected on crowdsourcing platforms may be noisy.

To account for this problem, each audio is typically transcribed into text by multiple crowd annotators. But how do we aggregate these multiple transcriptions to obtain the final high-quality transcription? The goal of this competition is to answer this question.

Specifically, the goal of participants is to build a model that aggregates multiple transcriptions of an audio obtained on a crowdsourcing platform into a single high-quality transcription.
$6,000 in cash prizes
The contest was sponsored by Toloka who generously provided $6,000 in cash prizes which were distributed among the top 3 teams:

  • First Place: $3,000
  • Second Place: $2,000
  • Third Place: $1,000

The prizes were distributed in equal installments among all members of the team.
    Research publication
    The top-performing teams were invited to submit the reports on their solution to the VLDB 2021 Crowd Science Workshop.





    1
    April 15. Practice phase started
    We release training data and you can start building your model
    2
    May 5. Competition began
    We release the test data and start evaluating submissions on the public part of the test set
    3
    May 25. Toloka funds were distributed
    We give out Toloka promo codes to the top 20 teams on the public leaderboard
    4
    June 7. Toloka funds were distributed
    Second distribution of Toloka promo codes to the top 20 teams on the public leaderboard submitted their solutions between May, 25 and June, 7
    5
    June 18. Competition ended
    We close the submission portal and evaluate submissions on the private test set
    6
    June 28. VLDB Crowd Science Workshop submission deadline
    Teams are encouraged to submit a report on their solution to the workshop
    We generated a large number of audios using Yandex SpeechKit and transcribed these audios on the Toloka crowdsourcing platform. All the detailed description of the data on the competition GitHub page.

    Outcomes from the contest: the language models for the summarization task won. Although, even ingenious but straightforward variants as median under edit distance performed better than classic approaches ROVER and RASA/HRRASA in the Crowd-Kit.
    Follow the most efficient approach
    the challenge winners share their experience
    Mikhail Orzhenovskii
    the 1st place
    I started with a non-ML baseline, but then the simplest language model based approach achieved a significantly higher score, so I continued fine-tuning the language model. Different augmentation tricks did not help, and I focused on hyperparameter search.
    Sergey Pletenev
    the 2nd place
    This competition for me was a good first step in the direction of Speech-to-text projects. I decided to do this task as a text summarization task. I experimented with different models, such as t5, bart, pegasus etc. The most difficult part was getting the model to generate one and only one sentence per set. After the competition I still had a lot of ideas that can be implemented e.g., generating noise in the text, which is similar to noise from ASR. Such noise is quite different from grammatical errors.
    Ilya Karpov
    the 3d place
    I used a linear combination of mean wer between the current hypothesis and others, pretrained language model (bert-base), hypothesis classifier (tuned from pretrained language model for 1 epoch), hypothesis length and 2 performer features (consistency with other performers and full number of tasks). Additionally, I collected 1.5k phrases from open sources using Toloka. I asked performers to search for the best hypothesis of my model in the first page of Yandex search.