Lightnews — Scholar-powered news

arXiv Sound

@arxiv-sound.bsky.social

CustNetGC, a CNN with Custom Network Grad-CAM and CatBoost, uses spectral features (L-mHP, Spectral Slopes) from voice to predict Parkinson's Disease with 99.06% accuracy.

A Novel CustNetGC Boosted Model with Spectral Features for Parkinson's Disease Prediction

Abishek Karthik, Pandiyaraju V, Dominic Savio M, Rohit Swaminathan S

arxiv.org

November 20, 2025 at 11:33 AM

arXiv Sound

@arxiv-sound.bsky.social

LargeSHS, a large-scale dataset of music adaptations from SecondHandSongs, contains 1.7M metadata entries and 900k audio links, enabling research in cover song generation.

LargeSHS: A large-scale dataset of music adaptation

Chih-Pin Tan, Hsuan-Kai Kao, Li Su, Yi-Hsuan Yang

arxiv.org

November 20, 2025 at 11:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Auden-Voice, a general-purpose voice encoder, balances identity and paralinguistic cues through multi-task training, demonstrating strong performance with LLMs.

Auden-Voice: General-Purpose Voice Encoder for Speech and Language Understanding

Mingyue Huo, Wei-Cheng Tseng, Yiwen Shao, Hao Zhang, Dong Yu

arxiv.org

November 20, 2025 at 10:33 AM

arXiv Sound

@arxiv-sound.bsky.social

CASTELLA, a large-scale human-annotated audio dataset for audio moment retrieval, is introduced; fine-tuning a model on CASTELLA improved performance by 10.4 points.

CASTELLA: Long Audio Dataset with Captions and Temporal Boundaries

Hokuto Munakata, Takehiro Imamura, Taichi Nishimura, Tatsuya Komatsu

arxiv.org

November 20, 2025 at 10:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Paper advocates for preference alignment in music generation, highlighting challenges in temporal coherence and subjective quality assessment; techniques like MusicRL and DiffRhythm+ are discussed.

Aligning Generative Music AI with Human Preferences: Methods and Challenges

Dorien Herremans, Abhinaba Roy

arxiv.org

November 20, 2025 at 9:33 AM

arXiv Sound

@arxiv-sound.bsky.social

Quality control pipeline implemented for MELD and IEMOCAP datasets; transfer learning from speaker and face recognition, with MAMBA fusion, achieved 64.8% accuracy.

Quality-Controlled Multimodal Emotion Recognition in Conversations with Identity-Based Transfer Learning and MAMBA Fusion

Zanxu Wang, Homayoon Beigi

arxiv.org

November 20, 2025 at 9:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Fine-tuning Audio-MAE and PANNs for COVID-19 detection showed limited generalization despite demographic stratification; small dataset sizes hinder deep learning performance.

Fine-tuning Pre-trained Audio Models for COVID-19 Detection: A Technical Report

Daniel Oliveira de Brito, Letícia Gabriella de Souza, Marcelo Matheus Gauy, Marcelo Finger, Arnaldo Candido Junior

arxiv.org

November 20, 2025 at 8:33 AM

arXiv Sound

@arxiv-sound.bsky.social

SpotlightTTS enhances expressive TTS using voiced-aware style extraction and style direction adjustment, improving expressiveness and speech quality over baselines.

Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech

Nam-Gyu Kim

arxiv.org

November 20, 2025 at 8:03 AM

arXiv Sound

@arxiv-sound.bsky.social

IHearYou detects depression by linking voice features to DSM-5 indicators via a framework, running locally for privacy; validated on DAIC-WOZ dataset, showing consistent feature-indicator associations.

IHearYou: Linking Acoustic Features to DSM-5 Depressive Behavior Indicators

Jonas Länzlinger, Katharina Müller, Bruno Rodrigues

arxiv.org

November 20, 2025 at 7:33 AM

arXiv Sound

@arxiv-sound.bsky.social

OBHS compresses audio by block-wise Huffman coding with canonical code representation and fallback mechanisms, achieving up to 93.6% compression with low complexity.

OBHS: An Optimized Block Huffman Scheme for Real-Time Audio Compression

Muntahi Safwan Mahfi, Md. Manzurul Hasan, Gahangir Hossain

arxiv.org

November 20, 2025 at 7:03 AM

arXiv Sound

@arxiv-sound.bsky.social

CPFG-Net, a conditional variational autoencoder, controllably predicts perceptual features and tonal structures from melodies, generating harmonically coherent chord progressions based on a new dataset.

A Controllable Perceptual Feature Generative Model for Melody Harmonization via Conditional Variational Autoencoder

Dengyun Huang, Yonghua Zhu

arxiv.org

November 19, 2025 at 11:33 AM

arXiv Sound

@arxiv-sound.bsky.social

IMSE replaces MET with Amplitude-Aware Linear Attention (MALA) and DE with Inception Depthwise Convolution (IDConv), reducing parameters by 16.8% compared to MUSE while maintaining performance.

IMSE: Efficient U-Net-based Speech Enhancement using Inception Depthwise Convolution and Amplitude-Aware Linear Attention

Xinxin Tang, Bin Qin, Yufang Li

arxiv.org

November 19, 2025 at 11:03 AM

arXiv Sound

@arxiv-sound.bsky.social

TTA model, trained on 358k hours of speech data across ASR/ST and speech-text alignment tasks, produces robust cross-lingual speech representations; outperforms Whisper.

TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

Wei Liu, Jiahong Li, Yiwen Shao, Dong Yu

arxiv.org

November 19, 2025 at 10:33 AM

arXiv Sound

@arxiv-sound.bsky.social

AQA system uses BEATs for feature extraction and Qwen2.5-7B-Instruct fine-tuned with GRPO for audio question answering, achieving 62.6 accuracy in the DCASE 2025 Challenge.

Audio Question Answering with GRPO-Based Fine-Tuning and Calibrated Segment-Level Predictions

Marcel Gibier, Nolwenn Celton, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre

arxiv.org

November 19, 2025 at 10:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Segmentwise pruning in audio-language models reduces computing costs by selectively retaining tokens; a time-aware strategy achieves a maximum 2% decrease in performance while pruning 75% of tokens.

Segmentwise Pruning in Audio-Language Models

Marcel Gibier, Raphaël Duroselle, Pierre Serrano, Olivier Boeffard, Jean-François Bonastre

arxiv.org

November 19, 2025 at 9:33 AM

arXiv Sound

@arxiv-sound.bsky.social

CountEM, a novel AMT framework, uses note event histograms for supervision, refining predictions iteratively via Expectation-Maximization and reducing the need for local alignment.

Count The Notes: Histogram-Based Supervision for Automatic Music Transcription

Jonathan Yaffe, Ben Maman, Meinard Müller, Amit H. Bermano

arxiv.org

November 19, 2025 at 9:03 AM

arXiv Sound

@arxiv-sound.bsky.social

FxSearcher, a gradient-free framework, uses Bayesian Optimization and a CLAP-based score function to find the best audio effect configurations based on a text prompt, preventing artifacts with a guiding prompt.

FxSearcher: gradient-free text-driven audio transformation

Hojoon Ki, Jongsuk Kim, Minchan Kwon, Junmo Kim

arxiv.org

November 19, 2025 at 8:33 AM

arXiv Sound

@arxiv-sound.bsky.social

A systematic review of audio papers reveals preference learning is underexplored despite challenges in evaluating generative models; only 6% of papers consider preference learning.

Preference-Based Learning in Audio Applications: A Systematic Analysis

Aaron Broukhim, Yiran Shen, Prithviraj Ammanabrolu, Nadir Weibel

arxiv.org

November 19, 2025 at 8:03 AM

arXiv Sound

@arxiv-sound.bsky.social

Principled Coarse-Graining (PCG) verifies speculative decoding proposals at the level of Acoustic Similarity Groups (ASGs), increasing acceptance and throughput on LibriTTS while maintaining intelligibility.

Principled Coarse-Grained Acceptance for Speculative Decoding in Speech

Moran Yanuka, Paul Dixon, Eyal Finkelshtein, Daniel Rotman, Raja Giryes

arxiv.org

November 19, 2025 at 7:33 AM

arXiv Sound

@arxiv-sound.bsky.social

Speaker identification, knowledge distillation, and hierarchical attention fusion address speaker ambiguity and class imbalance in multi-speaker emotion recognition; achieves 67.75 on MELD and IEMOCAP.

Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion

Xiao Li, Kotaro Funakoshi, Manabu Okumura

arxiv.org

November 19, 2025 at 7:03 AM

arXiv Sound

@arxiv-sound.bsky.social

CNN model performance in binaural sound source localization evaluated with various combinations of amplitude-based and phase-based features; generalization requires channel spectrograms with both ILD and IPD.

Systematic evaluation of time-frequency features for binaural sound source localization

Davoud Shariat Panah, Alessandro Ragano, Dan Barry, Jan Skoglund, Andrew Hines

arxiv.org

November 18, 2025 at 11:36 AM

arXiv Sound

@arxiv-sound.bsky.social

PASE, a generative speech enhancer leveraging WavLM, uses representation distillation to mitigate linguistic hallucinations by cleaning final-layer features and reduces acoustic hallucinations with a dual-stream vocoder.

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement

Xiaobin Rong, Qinwen Hu, Mansur Yesilbursa, Kamil Wojcicki, Jing Lu

arxiv.org

November 18, 2025 at 11:08 AM

arXiv Sound

@arxiv-sound.bsky.social

AMPBench, a new benchmark, reveals that Audio-Language Models (LALMs) exhibit a motion perception deficit, struggling to infer direction/trajectory of moving sound sources.

Spatial Blind Spot: Auditory Motion Perception Deficits in Audio LLMs

Zhe Sun, Yujun Cai, Jiayu Yao, Yiwei Wang

arxiv.org

November 18, 2025 at 10:41 AM

arXiv Sound

@arxiv-sound.bsky.social

Analysis of V2A models reveals performance is lacking in accurately generating Foley sounds, and proposes FoleyBench to adress the issue.

FoleyBench: A Benchmark For Video-to-Audio Models

Satvik Dixit, Koichi Saito, Zhi Zhong, Yuki Mitsufuji, Chris Donahue

arxiv.org

November 18, 2025 at 10:14 AM

arXiv Sound

@arxiv-sound.bsky.social

Real-Time Single-Path TFC-TDF UNET (RT-STT), a lightweight real-time low-latency model for music demixing, uses channel expansion-based feature fusion and quantization to reduce inference time.

Towards Practical Real-Time Low-Latency Music Source Separation

Junyu Wu, Jie Liu, Tianrui Pan, Jie Tang, Gangshan Wu

arxiv.org

November 18, 2025 at 9:46 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news