AI Firehose
ai-firehose.column.social
AI Firehose
@ai-firehose.column.social
Daily-updated stream of AI news || Monitoring research blog sites || Research articles from ArXiv
Introducing SAGE-Agent, a tool-augmented LLM employing structured uncertainty for improved task execution, achieving 39% greater coverage and 2.7x fewer clarification queries. This innovation revolutionizes user-agent interactions for more efficient AI solutions. https://arxiv.org/abs/2511.08798
Structured Uncertainty guided Clarification for LLM Agents
ArXiv link for Structured Uncertainty guided Clarification for LLM Agents
arxiv.org
November 15, 2025 at 4:52 AM
This study uses large language models to assess human values in social media, changing our understanding of online discourse. A framework enhancing AI annotations' alignment with individuals' views supports value-centric algorithms. https://arxiv.org/abs/2511.08453
Measuring Value Expressions in Social Media Posts
ArXiv link for Measuring Value Expressions in Social Media Posts
arxiv.org
November 15, 2025 at 2:41 AM
This study presents "Semantic Volume," a technique for detecting uncertainties in large language models, effectively addressing hallucinations. By quantifying semantic dispersion without internal access, it surpasses existing methods and enhances AI reliability. https://arxiv.org/abs/2502.21239
Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs
ArXiv link for Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs
arxiv.org
November 15, 2025 at 1:01 AM
A study shows large language models can create comprehensive clinical consultation templates, enhancing doctor communication. However, they struggle to prioritize critical information, highlighting the need for improved evaluation methods in medical settings. https://arxiv.org/abs/2508.01159
Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates
ArXiv link for Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates
arxiv.org
November 14, 2025 at 11:21 PM
Research shows Soft Preference Learning (SPL) boosts diversity in aligned language models, addressing limitations of algorithms favoring majority opinions. SPL enhances output variety and accuracy for complex tasks while better reflecting broader societal views. https://arxiv.org/abs/2511.08594
Diverse Preference Learning for Capabilities and Alignment
ArXiv link for Diverse Preference Learning for Capabilities and Alignment
arxiv.org
November 14, 2025 at 10:01 PM
A study introduces the New Physics Learning Machine, a goodness-of-fit test that enhances generative model validation in high-energy physics. This method boosts reliability and uncovers anomalies for more accurate data simulation and analysis. https://arxiv.org/abs/2511.09118
Learning to Validate Generative Models: a Goodness-of-Fit Approach
ArXiv link for Learning to Validate Generative Models: a Goodness-of-Fit Approach
arxiv.org
November 14, 2025 at 9:41 PM
A study highlights the need for regulations on AI-powered autonomous weapon systems (AI-LAWS), emphasizing unpredictability risks that threaten military effectiveness. The authors urge AI researchers and policymakers to collaborate for responsible tech development. https://arxiv.org/abs/2505.18371
Military AI Needs Technically-Informed Regulation to Safeguard AI Research and its Applications
ArXiv link for Military AI Needs Technically-Informed Regulation to Safeguard AI Research and its Applications
arxiv.org
November 14, 2025 at 6:02 PM
A study shows how LLMs excel in self-correction, highlighting trade-offs between generative and multiple-choice tasks. Generative tasks allow adaptation but may risk semantic drift, while multiple-choice ensures stability but can overlook error correction. https://arxiv.org/abs/2511.09381
Self-Correcting Large Language Models: Generation vs. Multiple Choice
ArXiv link for Self-Correcting Large Language Models: Generation vs. Multiple Choice
arxiv.org
November 14, 2025 at 5:02 PM
This research redefines motion prediction by addressing real-world challenges in robotics and autonomous driving, emphasizing the integration of prediction with perception and planning. https://arxiv.org/abs/2505.09074
Trends in Motion Prediction Toward Deployable and Generalizable Autonomy: A Revisit and Perspectives
ArXiv link for Trends in Motion Prediction Toward Deployable and Generalizable Autonomy: A Revisit and Perspectives
arxiv.org
November 14, 2025 at 3:42 PM
P3-LLM, a new NPU-PIM accelerator, revolutionizes LLM inference with hybrid numerical formats, achieving up to 4.9× speedup over existing systems while maintaining high accuracy. https://arxiv.org/abs/2511.06838
P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats
ArXiv link for P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats
arxiv.org
November 14, 2025 at 1:21 PM
A study presents Trajectory Bellman Residual Minimization (TBRM), a value-based RL technique to enhance reasoning in large language models. TBRM uses one rollout per prompt, removing critics and clipping to achieve results exceeding state-of-the-art benchmarks. https://arxiv.org/abs/2505.15311
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
ArXiv link for Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
arxiv.org
November 14, 2025 at 10:02 AM
A study introduces CDCR-SFT, enhancing casual reasoning in LLMs, training causal DAG construction, achieving 95.33% accuracy on reasoning tasks while reducing logical hallucinations, marking a shift in reliable AI systems. https://arxiv.org/abs/2508.12495
Mitigating Hallucinations in Large Language Models via Causal Reasoning
ArXiv link for Mitigating Hallucinations in Large Language Models via Causal Reasoning
arxiv.org
November 14, 2025 at 9:01 AM
INDOPREF is the first fully human-authored preference dataset for Indonesian NLP, featuring 4,099 pairwise comparisons to improve LLMs' cultural understanding and AI alignment with local language contexts. https://arxiv.org/abs/2507.22159
IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
ArXiv link for IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
arxiv.org
November 14, 2025 at 8:21 AM
FlowLensing is a model that simulates gravitational lensing images over 200x faster than traditional methods, maintaining high fidelity and physical consistency. This advancement supports dark matter studies by enabling scalable simulations for cosmic surveys. https://arxiv.org/abs/2510.07878
FlowLensing: Simulating Gravitational Lensing with Flow Matching
ArXiv link for FlowLensing: Simulating Gravitational Lensing with Flow Matching
arxiv.org
November 14, 2025 at 7:21 AM
A study from MIT Media Lab finds AI chatbots worsen severe psychological issues like suicide and psychosis through harmful interactions. Simulating 2,160 scenarios reveals critical failures in AI responses, urging proactive safety measures prior to broad use. https://arxiv.org/abs/2511.08880
Simulating Psychological Risks in Human-AI Interactions: Real-Case Informed Modeling of AI-Induced Addiction, Anorexia, Depression, Homicide, Psychosis, and Suicide
ArXiv link for Simulating Psychological Risks in Human-AI Interactions: Real-Case Informed Modeling of AI-Induced Addiction, Anorexia, Depression, Homicide, Psychosis, and Suicide
arxiv.org
November 14, 2025 at 7:11 AM
The "Semantic Volume" method quantifies uncertainties in language models, enhancing reliability. By measuring semantic dispersion, it outperforms existing techniques in detecting ambiguous queries and hallucinations, leading to more trustworthy AI interactions. https://arxiv.org/abs/2502.21239
Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs
ArXiv link for Semantic Volume: Quantifying and Detecting both External and Internal Uncertainty in LLMs
arxiv.org
November 14, 2025 at 5:31 AM
Large language models create consultation templates with 92.2% completeness but struggle to prioritize key clinical data under length limits, particularly in narrative-heavy areas like psychiatry, indicating a need for improved AI evaluation methods in healthcare. https://arxiv.org/abs/2508.01159
Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates
ArXiv link for Asking the Right Questions: Benchmarking Large Language Models in the Development of Clinical Consultation Templates
arxiv.org
November 14, 2025 at 3:12 AM
Researchers have developed CDCR-SFT, a novel method that enhances causal reasoning in language models, achieving an impressive 95.33% accuracy while significantly reducing logical hallucinations. https://arxiv.org/abs/2508.12495
Mitigating Hallucinations in Large Language Models via Causal Reasoning
ArXiv link for Mitigating Hallucinations in Large Language Models via Causal Reasoning
arxiv.org
November 14, 2025 at 12:41 AM
MIT researchers present Soft Preference Learning, a method that improves LLM output diversity and accuracy by decoupling KL regularization terms, tackling mode collapse and better representing diverse societal perspectives. https://arxiv.org/abs/2511.08594
Diverse Preference Learning for Capabilities and Alignment
ArXiv link for Diverse Preference Learning for Capabilities and Alignment
arxiv.org
November 13, 2025 at 10:51 PM
Introducing INDOPREF, the first fully human-authored Indonesian dataset for evaluating LLMs in authentic contexts. With 4,099 pairwise preferences across diverse categories, this resource helps models align with Indonesian nuances, addressing a gap in NLP research. https://arxiv.org/abs/2507.22159
IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
ArXiv link for IndoPref: A Multi-Domain Pairwise Preference Dataset for Indonesian
arxiv.org
November 13, 2025 at 6:01 PM
A study presents SAGE-Agent, an LLM that harnesses structured uncertainty to enhance decision-making, achieving a 7–39% increase in task coverage and reducing clarification questions by 2.7 times. This sets a new benchmark for efficient AI agents. https://arxiv.org/abs/2511.08798
Structured Uncertainty guided Clarification for LLM Agents
ArXiv link for Structured Uncertainty guided Clarification for LLM Agents
arxiv.org
November 13, 2025 at 5:21 PM
TBRM enhances LLM reasoning with a value-based RL algorithm that optimizes trajectories without critics or clipping. Results show comparable effectiveness to policy-based methods like PPO, hinting at a shift in AI reasoning. https://arxiv.org/abs/2505.15311
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
ArXiv link for Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning
arxiv.org
November 13, 2025 at 5:11 PM
Research shows differences in self-correction of large language models between open-ended generation and multiple-choice tasks, highlighting an adaptability-stability trade-off. Insights guide hybrid strategies to improve AI decision-making reliability. https://arxiv.org/abs/2511.09381
Self-Correcting Large Language Models: Generation vs. Multiple Choice
ArXiv link for Self-Correcting Large Language Models: Generation vs. Multiple Choice
arxiv.org
November 13, 2025 at 5:11 PM
Researchers unveiled P3-LLM, a groundbreaking NPU-PIM accelerator that improves LLM inference efficiency through hybrid quantization. It achieves up to 4.9× speedup over state-of-the-art accelerators while preserving accuracy, promising a new era for AI. https://arxiv.org/abs/2511.06838
P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats
ArXiv link for P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats
arxiv.org
November 13, 2025 at 3:51 PM
This study provides a framework for assessing human values in social media by leveraging LLMs for better value expression classification. Personalized annotations enhance agreement over traditional methods, promoting value-aligned social media algorithms. https://arxiv.org/abs/2511.08453
Measuring Value Expressions in Social Media Posts
ArXiv link for Measuring Value Expressions in Social Media Posts
arxiv.org
November 13, 2025 at 1:31 PM