Lightnews — Scholar-powered news

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

#CUDA #PTX #HIP #Benchmarking #Package

hgpu.org?p=30352

MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. …

hgpu.org

November 16, 2025 at 2:58 PM

HGPU group

@hgpu.bsky.social

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

#CUDA #CodeGeneration #Performance #Package

hgpu.org?p=30343

CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization

Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automati…

hgpu.org

November 9, 2025 at 4:29 PM

HGPU group

@hgpu.bsky.social

Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs

#CUDA #HIP #Compression #Package

hgpu.org?p=30342

Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs

Different compilers can generate code with notably different performance characteristics – even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP …

hgpu.org

November 9, 2025 at 4:28 PM

HGPU group

@hgpu.bsky.social

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

#FP8 #Precision

hgpu.org?p=30341

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computatio…

hgpu.org

November 9, 2025 at 4:28 PM

HGPU group

@hgpu.bsky.social

AMD MI300X GPU Performance Analysis

#AMD #HIP #Benchmarking #Performance

hgpu.org?p=30340

AMD MI300X GPU Performance Analysis

The rapid growth of large language models (LLMs) has driven the need for high-performance, scalable GPU hardware capable of efficiently serving models with hundreds of billions of parameters. While…

hgpu.org

November 9, 2025 at 4:27 PM

HGPU group

@hgpu.bsky.social

RDMA Point-to-Point Communication for LLM Systems

#CUDA #RDMA #LLM #Package

hgpu.org?p=30339

RDMA Point-to-Point Communication for LLM Systems

Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point c…

hgpu.org

November 9, 2025 at 4:26 PM

HGPU group

@hgpu.bsky.social

A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations

#CUDA #DeepLearning #DL #Package

hgpu.org?p=30330

A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations

Deep learning (DL) has already played a significant role in numerous fields, making it crucial to ensure the stability of both training and inference in DL systems. The computation of DL models can…

hgpu.org

November 2, 2025 at 4:05 PM

HGPU group

@hgpu.bsky.social

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

#CUDA #LLM #AutoTuning #PerformancePortability #Package

hgpu.org?p=30329

Enhancing Transformer Performance and Portability through Auto-tuning Frameworks

Abstract Transformer-based models such as BERT and GPT2 have become the foundation of many modern applications, yet their execution requires substantial computational and memory resources. To addre…

hgpu.org

November 2, 2025 at 4:04 PM

HGPU group

@hgpu.bsky.social

Serve Programs, Not Prompts

#LLM #NLP

hgpu.org?p=30328

Serve Programs, Not Prompts

Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible des…

hgpu.org

November 2, 2025 at 4:03 PM

HGPU group

@hgpu.bsky.social

Scalable GPU-Based Integrity Verification for Large Machine Learning Models

#SYCL #oneAPI #Rust #Security #Package

hgpu.org?p=30327

Scalable GPU-Based Integrity Verification for Large Machine Learning Models

We present a security framework that strengthens distributed machine learning by standardizing integrity protections across CPU and GPU platforms and significantly reducing verification overheads. …

hgpu.org

November 2, 2025 at 4:02 PM

HGPU group

@hgpu.bsky.social

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

#CUDA #MachineLearning #ML #Package

hgpu.org?p=30326

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats

Modern AI hardware, such as Nvidia’s Blackwell architecture, is increasingly embracing low-precision floating-point (FP) formats to handle the pervasive activation outliers in Large Language …

hgpu.org

November 2, 2025 at 4:02 PM

HGPU group

@hgpu.bsky.social

Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels

#CUDA #Chemistry #MolecularDocking #Package

hgpu.org?p=30318

Architecting Tensor Core-Based Reductions for Irregular Molecular Docking Kernels

Tensor Cores (TCs) are specialized hardware units designed for efficient matrix multiplication and are widely utilized in deep learning workloads. However, their adoption in more irregular high-per…

hgpu.org

October 26, 2025 at 8:04 PM

HGPU group

@hgpu.bsky.social

STARK: Strategic Team of Agents for Refining Kernels

#CodeGeneration #LLM

hgpu.org?p=30317

STARK: Strategic Team of Agents for Refining Kernels

The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, threa…

hgpu.org

October 26, 2025 at 8:04 PM

HGPU group

@hgpu.bsky.social

A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines

#AMD #FPGA #CodeGeneration #AI

hgpu.org?p=30316

A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines

We present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototy…

hgpu.org

October 26, 2025 at 8:03 PM

HGPU group

@hgpu.bsky.social

Collective Communication for 100k+ GPUs

#CUDA #GPUcluster #LLM #Performance #Package

hgpu.org?p=30315

Collective Communication for 100k+ GPUs

The increasing scale of large language models (LLMs) necessitates highly efficient collective communication frameworks, particularly as training workloads extend to hundreds of thousands of GPUs. T…

hgpu.org

October 26, 2025 at 8:03 PM

HGPU group

@hgpu.bsky.social

Tutoring LLM into a Better CUDA Optimizer

#CUDA #LLM #CodeGeneration #Package

hgpu.org?p=30314

Tutoring LLM into a Better CUDA Optimizer

Recent leaps in large language models (LLMs) caused a revolution in programming tools (like GitHub Copilot) that can help with code generation, debugging, and even performance optimization. In this…

hgpu.org

October 26, 2025 at 8:03 PM

HGPU group

@hgpu.bsky.social

Thesis: Compiler and Runtime Systems for Generative AI Models

#CUDA #LLM #DeepLearnig #DL #Package

hgpu.org?p=30305

Compiler and Runtime Systems for Generative AI Models

Generative AI (GenAI) workloads have rapidly become the predominant data center GPU workload. However, designing efficient GPU kernels for GenAI presents significant challenges due to two central f…

hgpu.org

October 19, 2025 at 8:41 PM

HGPU group

@hgpu.bsky.social

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

#SYCL #HIP #CUDA #Performance #Package

hgpu.org?p=30304

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Specializing kernels by including runtime information during just-in-time (JIT) -compilation can improve performance at the expense of potentially generating more kernels. In this work, we contribu…

hgpu.org

October 19, 2025 at 8:40 PM