Existing audio deepfake datasets have greatly expanded evaluation across generators, languages, and domains, but most remain generation-centric and provide limited support for studying post-generation delivery. In realistic misuse scenarios, forged audio is often altered by platform re-encoding, telephony transmission, or replay-like recapture before it reaches a listener or an automated system. Recent delivery-related studies usually model such factors as isolated distortions rather than structured, ordered delivery routes, and often lack control over transcript and speaker conditions.
ChainBench-ADD addresses this gap by treating post-generation delivery as a structured benchmark dimension. It models delivery through reusable operators, ordered templates, and realized chains across five delivery families: direct, platform-like, telephony, simulated replay, and hybrid. Each delivered sample remains linked to a clean bona fide or spoof parent under matched-parent control, enabling attribution of detector behavior specifically to delivery rather than to differences in content, speaker, or source quality.
The current release contains 941,201 waveforms derived from 55,813 parents (18,703 bona fide from Common Voice and AISHELL-3; 37,110 spoof from six contemporary TTS systems) across 448 speakers in English and Mandarin Chinese. From the shared metadata, we define five evaluation tasks: in-chain detection, three matched local interventions (operator substitution, parameter perturbation, and order swap), and lineage-based delivery robustness. A leave-one-template-out protocol further tests transfer to unseen templates within a family.
| Operator | Representative Settings | Families |
|---|---|---|
| resample | 16→8, 16→24, 16→8→16, 16→24→16, 16→32→16 kHz | P / T / H |
| band-limit | Narrowband (250–3400 Hz), Wideband (50–7000 Hz) | T / H |
| codec | AAC (24/32/48 kbps), Opus (16/24/32 kbps), GSM, μ-law PCM | P / T / H |
| re-encode | Same/cross-codec AAC/Opus recompression (24/32 kbps) | P / T / H |
| packet loss | 1/3/5/10% loss; burst 2/3/5; repeat-fade, interpolation, or noise-fill | T / H |
| noise | White, pink, brown, babble, hiss, hum; 30/20/15/10 dB SNR | R / H |
| RIR | Small/medium/large rooms; RT60 0.2/0.4/0.6/0.8 s; 0.5/1/2/3 m | R / H |
| call-path | Joint channel filtering, codec, packet loss, jitter (0/8/16 ms), AGC | T / H |
Fam. P = Platform-like, T = Telephony, R = Simulated Replay, H = Hybrid. All operators are waveform-domain transforms, not symbolic tags.
Each card below shows one parent utterance delivered through all five families. Compare how different delivery scenarios alter the same underlying speech.
| Family | Template | Operator Chain | Audio |
|---|---|---|---|
| Direct | direct_clean | identity (no processing) | |
| Platform-like | aac_reencode | codec→re-encode | |
| Telephony | session_nb_mulaw | call-path | |
| Sim. Replay | rir_noise_resample | RIR→noise→resample | |
| Hybrid | reencode_rir_plr | re-encode→RIR→packet loss |
| Family | Template | Operator Chain | Audio |
|---|---|---|---|
| Direct | direct_clean | identity (no processing) | |
| Platform-like | aac_single | codec | |
| Telephony | wb_opus | band-limit→codec | |
| Sim. Replay | noise_rir | noise→RIR | |
| Hybrid | rir_aac | RIR→codec |
| Family | Template | Operator Chain | Audio |
|---|---|---|---|
| Direct | direct_clean | identity (no processing) | |
| Platform-like | aac_single | codec | |
| Telephony | nb_mulaw | band-limit→codec | |
| Sim. Replay | rir_reencode | RIR→re-encode | |
| Hybrid | aac_rir | codec→RIR |
| Family | Template | Operator Chain | Audio |
|---|---|---|---|
| Direct | direct_clean | identity (no processing) | |
| Platform-like | aac_reencode | codec→re-encode | |
| Telephony | nb_mulaw_plr | band-limit→codec→packet loss | |
| Sim. Replay | rir_noise | RIR→noise | |
| Hybrid | bandlimit_codec_rir | band-limit→codec→RIR |
ChainBench-ADD distinguishes between code and the benchmark package. The code repository, including construction scripts, configurations, and baseline implementations, is released under the MIT License. The dataset release is distributed under the ChainBench-ADD Dataset Terms of Use.
ChainBench-ADD is assembled from multiple upstream speech resources and speech-generation systems. All third-party datasets, models, audio, and other external assets remain subject to their original licenses, terms, and attribution requirements. The ChainBench-ADD Dataset Terms of Use apply to this benchmark release as distributed by the authors and do not replace or weaken any applicable upstream obligations.
ChainBench-ADD is released for research on audio deepfake detection, robustness evaluation, forensic analysis, provenance, and benchmarking. It is provided for defensive and scientific use only. The benchmark must not be used to support or enable impersonation, fraud, harassment, social engineering, unauthorized voice cloning, deceptive media generation, biometric surveillance, or the training or improvement of systems intended for deceptive speech generation.
Redistribution of the benchmark package, or any subset that includes third-party material, is permitted only to the extent allowed by all applicable upstream terms and must preserve the relevant notices and this use statement.