Datathon@IndoML 2026

About the Datathon

Welcome to Datathon@IndoML 2026 - a research-oriented data science competition held in conjunction with IndoML 2026. Building on the success of previous editions, this year's datathon challenges participants to tackle noise event detection and removal in real-world Indic speech - a critical problem for inclusive, robust Automatic Speech Recognition across Indian languages.

The competition is organised into two tightly coupled tracks: Track 1 (Detection) - detect noise events with precise timestamps, effectively utilising data annotated at different levels - and Track 2 (Removal) - suppress detected events while preserving the underlying speech. The dataset is a curated subset of the Vaani corpus, consisting of ~150 hours of real-world Indic audio with three levels of annotation quality.

Top-performing teams will be invited to attend IndoML 2026 and present their solutions to leading researchers and professionals from academia and industry. These teams will also receive exciting cash prizes.

Motivation

Most academic speech enhancement benchmarks are built around stationary noise (white, pink, café hum) or studio-recorded mixtures. Field recordings from rural and semi-urban India contain bursty, semantically rich events — a passing motorbike, a hen, a pressure-cooker whistle, a doorbell, a TV in another room — that current denoisers either smear over or treat as speech.

Two consequences follow:

Downstream ASR fails disproportionately on speakers from these environments, amplifying the existing performance gap between high-resource and low-resource Indic languages.
Standard sound-event detection (SED) datasets (AudioSet, DESED, DCASE) under-represent both Indian acoustic contexts and Indian speech, making transfer learning from them brittle.

This challenge targets that gap directly. It asks participants to treat noise as a first-class, labelled, time-localised object — and then to suppress it without harming the speech.

Datathon Chairs

Dr. Subhajit Datta

Heritage Institute of Technology

Dr. Mahesh Mohan

IIT Kharagpur

Dr. Prasanta Kumar Ghosh

IISc Bangalore

Dr. Debopriyo Banerjee

Inception - G42

Technical Volunteers

Shivay Vadhera

IIT Bombay

Nihar Desai

ARTPARK, IISc

Pavan Kumar J

ARTPARK, IISc

Sujith P

ARTPARK, IISc

Shubhadip Nag

Walmart

Announcements

April 2026

Website Launched!

The official website for Datathon@IndoML 2026 is now live. Registration details, task description, dataset, and timeline will be announced soon.

Coming Soon

Registration Opens

Stay tuned for updates on registration, task details, and important dates.

Task Description

Noise Event Detection & Removal in Indic Speech

Robust, Inclusive Speech Processing for Real-World Indic Speech

Speech recordings collected in real Indian environments are dominated by non-stationary background events - vehicle horns, dogs barking, children crying, doorbells, ringtones, kitchen appliances, and devotional music. These events degrade downstream Automatic Speech Recognition (ASR).

This challenge invites participants to build a two-stage system on the Vaani dataset that (i) detects noise events with precise timestamps, and (ii) removes them while preserving the underlying speech. The challenge is framed under the Responsible AI theme, with explicit emphasis on robustness, linguistic inclusivity across multiple Indian languages, and methodological transparency.

Challenge Tracks

Participants may enter either track independently or both.

01

Detection

Detect Events

Detect noise events in Indic speech recordings with precise onset/offset timestamps. Effectively utilise data annotated at different levels.

Raw Audio

Your Model

Event JSON

{onset: 1.24, offset: 3.81}, {onset: 4.31, offset: 4.71}, {onset: 5.04, offset: 5.41}

Evaluation Metrics: F1 Dice

Event timeline (onset/offset pairs) from Track 1 is passed as conditioning input to Track 2 — guiding the model on where to suppress noise

02

Removal

Suppress & Preserve

Suppress the detected noise events while preserving the underlying speech signal - output clean, intelligible audio.

Audio + Events

Your Model

Clean WAV

16 kHz mono WAV - one per test clip, original filename retained

Evaluation Metrics: SI-SDR Delta WER PESQ (top-5)

Competition Process

Submission

Track 1: Submit a JSON file with onset/offset events per clip.
Track 2: Submit cleaned 16 kHz mono WAV files.

Evaluation

Automated scoring on held-out test clips. Track 1: F1 + Dice. Track 2: SI-SDR + ΔWER, then PESQ for top-5.

Leaderboard

Live rankings published after each submission window. Final standings adjusted by expert Novelty Score for top-5 entries.

Awards

Top teams invited to present at IndoML 2026 and receive cash prizes. Code release required for prize-eligible entries.

Evaluation Metrics

Track 1 — Detection

Primary

Event-based F1

A prediction is correct when its temporal extent overlaps with ground truth within +/-20% of event duration.

Primary

Segment Dice

Temporal overlap between predicted and reference event segments: 2 * |P intersection G| / (|P|+|G|).

Overall ranking Equal-weight average of F1 and Dice scores.

Track 2 — Removal

Primary

SI-SDR

Scale-Invariant Signal-to-Distortion Ratio between enhanced signal and synthetic clean reference.

Primary

Delta WER

A frozen multilingual Indic ASR is run on both noisy and enhanced clips. Delta WER = WER_noisy - WER_enhanced. Higher is better.

Top-5 Only

PESQ

Perceptual Evaluation of Speech Quality - intelligibility & naturalness vs. clean references. Evaluated only for the top-5 initial entries.

Initial ranking Equal-weight average of SI-SDR and Delta WER. Top-5 then evaluated for PESQ.

Novelty Score - Following metric-based ranking, an expert panel applies a novelty factor to the top-5 entries in each track to determine the final standings. Rewards original contributions: new architectures, novel training regimes, principled event-conditioning, or semi-supervised approaches exploiting unlabelled Vaani audio.

Responsible AI Alignment

This challenge is positioned under the Responsible AI track on three explicit axes:

Robustness

Real-world Indic recordings - not curated studio mixtures - are the evaluation distribution. Systems are scored on actual ASR improvement (Delta WER), not signal-level metrics alone.

Inclusivity

Vaani spans multiple Indian languages and a wide range of speakers. The eval set is monitored for language-wise and class-wise balance so no sub-population is under-represented.

Transparency

Top-5 submissions must release code and document pre-trained dependencies. The frozen ASR used for Delta WER is publicly identified for independent reproducibility.

Dataset

Vaani Corpus

A large-scale, openly released Indic speech dataset spanning multiple Indian languages, collected across districts of India. Learn more at vaani.iisc.ac.in or read the paper.

The dataset consists of three types of annotated noise events, totalling ~150 hours of training audio. An additional 10 hours of noise events with clean timestamps will be provided for final evaluation. Effectively utilising data annotated at different quality levels is a key part of the challenge.

Sample Preview on HuggingFace

#	Annotation Type	Duration (hrs)	Description
1	Clean Timestamps (🥇 Gold)	20	Noise events with precise timestamps where mutual agreement between multiple annotators has been verified
2	Noisy Timestamps (🥈 Silver)	100	Annotated noise events with timestamps, but agreement between annotators is not verified
3	No Timestamps (🥉 Bronze)	30	Only noise event tags present in the transcript — no onset/offset timestamps
Training Total		150

Annotation Types & Distribution

Three levels of annotation quality. The chart below shows the distribution across the training set.

Noisy Timestamps (🥈 Silver)

100 hrs

No Timestamps (🥉 Bronze)

30 hrs

Clean Timestamps (🥇 Gold)

20 hrs

Clean timestamp data is the smallest subset. Effectively leveraging noisy and tag-only annotations alongside clean labels is a key part of the challenge.

Clean Timestamps (🥇 Gold)

Noise events with precise onset/offset timestamps where mutual agreement between multiple annotators has been verified.

[ { "category": "vehicle_traffic", "tag": "<horn>", "start": "2.714", "end": "3.761" },
{ "category": "human_non_speech", "tag": "[breathing]", "start": "4.938", "end": "5.410" },
{ "Verification_status": "Verified" } ]

20 hrs

Noisy Timestamps (🥈 Silver)

Annotated noise events with timestamps, but agreement between annotators is not verified.

[ { "category": "vehicle_traffic", "tag": "<horn>", "start": "2.714", "end": "3.761" },
{ "category": "human_non_speech", "tag": "[breathing]", "start": "4.938", "end": "5.410" } ]

100 hrs

No Timestamps (🥉 Bronze)

Only noise event tags present in the transcript — no onset/offset timestamps provided.

<noise> सजावट के <horn> </horn> लिए यहाँ एक गुलाब भी
<horn> </horn> लगाया गया है। </noise>

30 hrs

Important Dates

Registration Opens

TBA

Development Phase

TBA

Test Phase

TBA

Results Announcement

TBA

Presentation at IndoML 2026

TBA

All deadlines will be at 12:00 Noon IST (Indian Standard Time).

Prizes

Cash Prizes

Exciting cash prizes await top-performing teams. Detailed prize distribution will be announced along with the registration opening.

Present at IndoML 2026

Top teams will be invited to present their solutions at IndoML 2026, in front of leading researchers from academia and industry.

Novelty Award

An expert panel award for the most original methodological contribution: new architectures, novel training regimes, or unsupervised approaches.

FAQ

Who can participate?

Students and early-career professionals are welcome. Each team must include at least one member affiliated with an Indian university or research institution.

Is there a team size limit?

There is no restriction on team size. However, each participant may only join one team.

What models and data can we use?

We follow an open model and open data policy. Teams may use any publicly available, closed-source, or proprietary models, along with additional data or augmentation strategies.

How will submissions be evaluated?

Evaluation criteria will be announced alongside the task description. Stay tuned!

Will top teams get travel support?

Top-performing teams will be invited to present at IndoML 2026. Details regarding travel support will be communicated later.

Contact Us

Email

datathon@indoml.in

IndoML

indoml.in

Expected Outcomes

Public Benchmark

A public benchmark for noise-event-aware speech enhancement on Indic audio — a gap that currently has no widely adopted dataset.

Open-Source Systems

Open-sourced winning systems, raising the floor of available denoising tools for Indian-language ASR.

Evaluation Harness

A reusable evaluation harness pairing event-detection metrics with downstream ΔWER on real audio.

Previous Editions

Check out last year's edition: Datathon@IndoML 2025 - Evaluating LLM-Powered AI Tutors