Abstract: As a critical sub-task of Automatic Music Transcription (AMT), Automatic Drum Transcription (ADT) remains challenging. The difficulty stems from the inherent characteristics of percussive sound sources: drum sounds, unlike melodic instruments (e.g. violins and guitars), exhibit transient, broadband energy distributions and lack well-defined harmonic series, hindering the extraction of symbolic representations from music audio. Moreover, existing ADT methods often overlook the contextual dependencies of musical performances, failing to capture pattern consistency within musical sections and the evolution of fill passages between them. To address these limitations, we propose Hi-Drum, a novel paradigm combining a self-supervised music understanding model and a Large Language Model (LLM), integrated with a Hierarchical Drum Module (HDM). The HDM adopts a three-branch parallel architecture to explicitly model coarse-to-fine features, which effectively captures both local variations and global similarities. To further explore the critical roles of dynamics and timing, we design a multi-task fine-tuning strategy incorporating three core components of symbolic representations: onset, velocity, and frame detection. Extensive experiments on multiple open-source datasets demonstrate that Hi-Drum achieves state-of-the-art performance.
Overall model architecture. The encoder extracts general music representations from the audio signal. Then it undergoes multi-scale feature aggregation through the three-branch HDM. Next, sequence modeling is performed by the Qwen decoder. Finally, the drum score prediction is generated through the multi-task fine-tuning module.
Case Study
In this section, we present case studies across diverse genres and BPMs to demonstrate Hi-Drum's robustness. By integrating a Hierarchical Drum Module (HDM) with an LLM, Hi-Drum captures coarse-to-fine rhythmic features. Comparisons with Ground Truth and OaF Drums highlight its superior ability to model local variations and global structural consistency in complex scenarios.
[Demo-01]
[Demo-02]
[Demo-03]