NLP CHALLENGE: Audio recordings transcription
$6,000 in cash prizes is up for grabs. Compete to find the best transcriptions of audio recordings.
NLP CHALLENGE: Audio recordings transcription
$6,000 in cash prizes is up for grabs. Compete to find the best transcriptions of audio recordings
Overview of the challenge
During the challenge, you will be working with a large-scale dataset of audio transcriptions. Your task will be to aggregate multiple transcriptions into a single high-quality transcription.
For this contest, we obtained Wikipedia articles vocalized by a popular voice assistant. Then every recording has been transcribed by several Toloka contributors.
Why to participate?
Large scale dataset of audio transcriptions
$6,000 in cash prizes
The Challenge is a part of VLDB 2021 Crowd Science Workshop
ML and beyond
We'll be handing out Toloka grants to the top 20 teams on the leaderboard twice during the challenge. With all of the crowdsourcing platform's features at your full disposal, you can pursue your creative solution even further.


Timeline
Prizes
Task Details
Getting Started
Speech-to-text projects are ubiquitous on crowdsourcing platforms. However, individual annotators on these platforms are often unskilled or even malicious. Therefore, transcriptions collected on crowdsourcing platforms may be noisy.

To account for this problem, each audio is typically transcribed into text by multiple crowd annotators. But how do we aggregate these multiple transcriptions to obtain the final high-quality transcription? The goal of this competition is to answer this question.

Specifically, the goal of participants is to build a model that aggregates multiple transcriptions of an audio obtained on a crowdsourcing platform into a single high-quality transcription.
$6,000 in cash prizes
The contest is sponsored by Toloka who generously provides $6,000 in cash prizes distributed among the top 3 teams:

  • First Place: $3,000
  • Second Place: $2,000
  • Third Place: $1,000

The prizes are distributed in equal installments among all members of the team.
    Research publication
    The top-performing teams will be invited to submit the reports on their solution to the VLDB 2021 Crowd Science Workshop.





    1
    April 15. Practice phase starts
    We release training data and you can start building your model
    2
    May 5. Competition begins
    We release the test data and start evaluating submissions on the public part of the test set
    3
    May 25. Toloka funds are distributed
    We give out Toloka promo codes to the top 20 teams on the public leaderboard
    4
    June 7. Toloka funds are distributed
    Second distribution of Toloka promo codes to the top 20 teams on the public leaderboard submitted their solutions between May, 25 and June, 7
    5
    June 18. Competition ends
    We close the submission portal and evaluate submissions on the private test set
    6
    June 28. VLDB Crowd Science Workshop submission deadline
    Teams are encouraged to submit a report on their solution to the workshop
    For this competition, we generated a large number of audios using Yandex SpeechKit and transcribed these audios on the Toloka crowdsourcing platform. You can find detailed description of the data on the competition GitHub page.

    You can participate alone or form a team of up to 5 participants. If you participate in a team, create a single account for your team on the challenge portal and use it to submit predictions.

    On our GitHub, we provide a starter kit to help you jump into the competition. Are you ready to make your first submission?

    Keep in touch with the community
    Join our Telegram chat where you can discuss the competition with organizers and other participants!

    If you have a private question, email the organizers at vldb21crowdchallenge@crowdscience.ai.
    Made on
    Tilda