arXiv Sound
arxiv-sound.bsky.social
arXiv Sound
@arxiv-sound.bsky.social
Automated posting of sound-related articles uploaded to arxiv.org (eess.AS + cs.SD)

Source: https://github.com/dsuedholt/bsky-paperbot-sound/

Inspired by @paperposterbot.bsky.social and https://twitter.com/ArxivSound
CustNetGC, a CNN with Custom Network Grad-CAM and CatBoost, uses spectral features (L-mHP, Spectral Slopes) from voice to predict Parkinson's Disease with 99.06% accuracy.
A Novel CustNetGC Boosted Model with Spectral Features for Parkinson's Disease Prediction
Abishek Karthik, Pandiyaraju V, Dominic Savio M, Rohit Swaminathan S
arxiv.org
November 20, 2025 at 11:33 AM
LargeSHS, a large-scale dataset of music adaptations from SecondHandSongs, contains 1.7M metadata entries and 900k audio links, enabling research in cover song generation.
LargeSHS: A large-scale dataset of music adaptation
Chih-Pin Tan, Hsuan-Kai Kao, Li Su, Yi-Hsuan Yang
arxiv.org
November 20, 2025 at 11:03 AM
Auden-Voice, a general-purpose voice encoder, balances identity and paralinguistic cues through multi-task training, demonstrating strong performance with LLMs.
Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding
Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu
arxiv.org
November 20, 2025 at 10:33 AM
CASTELLA, a large-scale human-annotated audio dataset for audio moment retrieval, is introduced; fine-tuning a model on CASTELLA improved performance by 10.4 points.
CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries
Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu
arxiv.org
November 20, 2025 at 10:03 AM
Paper advocates for preference alignment in music generation, highlighting challenges in temporal coherence and subjective quality assessment; techniques like MusicRL and DiffRhythm+ are discussed.
Aligning Generative Music AI with Human Preferences: Methods and Challenges
Dorien Herremans, Abhinaba Roy
arxiv.org
November 20, 2025 at 9:33 AM
Quality control pipeline implemented for MELD and IEMOCAP datasets; transfer learning from speaker and face recognition, with MAMBA fusion, achieved 64.8% accuracy.
Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion
Zanxu Wang, Homayoon Beigi
arxiv.org
November 20, 2025 at 9:03 AM
Fine-tuning Audio-MAE and PANNs for COVID-19 detection showed limited generalization despite demographic stratification; small dataset sizes hinder deep learning performance.
Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report
Daniel Oliveira de Brito, Letícia Gabriella de Souza, Marcelo Matheus Gauy, Marcelo Finger, Arnaldo Candido Junior
arxiv.org
November 20, 2025 at 8:33 AM
SpotlightTTS enhances expressive TTS using voiced-aware style extraction and style direction adjustment, improving expressiveness and speech quality over baselines.
Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech
Nam-Gyu Kim
arxiv.org
November 20, 2025 at 8:03 AM
IHearYou detects depression by linking voice features to DSM-5 indicators via a framework, running locally for privacy; validated on DAIC-WOZ dataset, showing consistent feature-indicator associations.
IHearYou: Linking Acoustic Features to DSM-5 Depressive Behavior Indicators
Jonas Länzlinger, Katharina Müller, Bruno Rodrigues
arxiv.org
November 20, 2025 at 7:33 AM
OBHS compresses audio by block-wise Huffman coding with canonical code representation and fallback mechanisms, achieving up to 93.6% compression with low complexity.
OBHS: An Optimized Block Huffman Scheme for Real-Time Audio Compression
Muntahi Safwan Mahfi, Md. Manzurul Hasan, Gahangir Hossain
arxiv.org
November 20, 2025 at 7:03 AM
CPFG-Net, a conditional variational autoencoder, controllably predicts perceptual features and tonal structures from melodies, generating harmonically coherent chord progressions based on a new dataset.
A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder
Dengyun Huang, Yonghua Zhu
arxiv.org
November 19, 2025 at 11:33 AM
IMSE replaces MET with Amplitude-Aware Linear Attention (MALA) and DE with Inception Depthwise Convolution (IDConv), reducing parameters by 16.8% compared to MUSE while maintaining performance.
IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention
Xinxin Tang, Bin Qin, Yufang Li
arxiv.org
November 19, 2025 at 11:03 AM
TTA model, trained on 358k hours of speech data across ASR/ST and speech-text alignment tasks, produces robust cross-lingual speech representations; outperforms Whisper.
TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation
Wei Liu, Jiahong Li, Yiwen Shao, Dong Yu
arxiv.org
November 19, 2025 at 10:33 AM
AQA system uses BEATs for feature extraction and Qwen2.5-7B-Instruct fine-tuned with GRPO for audio question answering, achieving 62.6 accuracy in the DCASE 2025 Challenge.
Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions
Marcel Gibier, Nolwenn Celton, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre
arxiv.org
November 19, 2025 at 10:03 AM
Segmentwise pruning in audio-language models reduces computing costs by selectively retaining tokens; a time-aware strategy achieves a maximum 2% decrease in performance while pruning 75% of tokens.
Segmentwise Pruning in Audio-Language Models
Marcel Gibier, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre
arxiv.org
November 19, 2025 at 9:33 AM
CountEM, a novel AMT framework, uses note event histograms for supervision, refining predictions iteratively via Expectation-Maximization and reducing the need for local alignment.
Count The Notes: Histogram-Based Supervision for Automatic Music Transcription
Jonathan Yaffe, Ben Maman, Meinard Müller, Amit H. Bermano
arxiv.org
November 19, 2025 at 9:03 AM
FxSearcher, a gradient-free framework, uses Bayesian Optimization and a CLAP-based score function to find the best audio effect configurations based on a text prompt, preventing artifacts with a guiding prompt.
FxSearcher: gradient-free text-driven audio transformation
Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim
arxiv.org
November 19, 2025 at 8:33 AM
A systematic review of audio papers reveals preference learning is underexplored despite challenges in evaluating generative models; only 6% of papers consider preference learning.
Preference-Based Learning in Audio Applications: A Systematic Analysis
Aaron Broukhim, Yiran Shen, Prithviraj Ammanabrolu, Nadir Weibel
arxiv.org
November 19, 2025 at 8:03 AM
Principled Coarse-Graining (PCG) verifies speculative decoding proposals at the level of Acoustic Similarity Groups (ASGs), increasing acceptance and throughput on LibriTTS while maintaining intelligibility.
Principled Coarse-Grained Acceptance for Speculative Decoding in Speech
Moran Yanuka, Paul Dixon, Eyal Finkelshtein, Daniel Rotman, Raja Giryes
arxiv.org
November 19, 2025 at 7:33 AM
Speaker identification, knowledge distillation, and hierarchical attention fusion address speaker ambiguity and class imbalance in multi-speaker emotion recognition; achieves 67.75 on MELD and IEMOCAP.
Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion
Xiao Li, Kotaro Funakoshi, Manabu Okumura
arxiv.org
November 19, 2025 at 7:03 AM
CNN model performance in binaural sound source localization evaluated with various combinations of amplitude-based and phase-based features; generalization requires channel spectrograms with both ILD and IPD.
Systematic evaluation of time-frequency features for binaural sound source localization
Davoud Shariat Panah, Alessandro Ragano, Dan Barry, Jan Skoglund, Andrew Hines
arxiv.org
November 18, 2025 at 11:36 AM
PASE, a generative speech enhancer leveraging WavLM, uses representation distillation to mitigate linguistic hallucinations by cleaning final-layer features and reduces acoustic hallucinations with a dual-stream vocoder.
PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement
Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu
arxiv.org
November 18, 2025 at 11:08 AM
AMPBench, a new benchmark, reveals that Audio-Language Models (LALMs) exhibit a motion perception deficit, struggling to infer direction/trajectory of moving sound sources.
Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs
Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang
arxiv.org
November 18, 2025 at 10:41 AM
Analysis of V2A models reveals performance is lacking in accurately generating Foley sounds, and proposes FoleyBench to adress the issue.
FoleyBench: A Benchmark For Video-to-Audio Models
Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, Chris Donahue
arxiv.org
November 18, 2025 at 10:14 AM
Real-Time Single-Path TFC-TDF UNET (RT-STT), a lightweight real-time low-latency model for music demixing, uses channel expansion-based feature fusion and quantization to reduce inference time.
Towards Practical Real-Time Low-Latency Music Source Separation
Junyu Wu, Jie Liu, Tianrui Pan, Jie Tang, Gangshan Wu
arxiv.org
November 18, 2025 at 9:46 AM