HGPU group
banner
hgpu.bsky.social
HGPU group
@hgpu.bsky.social
High performance computing on graphics processing units (GPU): AMD, Nvidia, Intel, CUDA, OpenCL, OpenGL, HPC
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

#CUDA #LLM #CodeGeneration

hgpu.org?p=30354
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization
Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel g…
hgpu.org
November 16, 2025 at 3:00 PM
A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

#CUDA #Compression #Package

hgpu.org?p=30353
A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data
The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leve…
hgpu.org
November 16, 2025 at 2:59 PM
MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies

#CUDA #PTX #HIP #Benchmarking #Package

hgpu.org?p=30352
MT4G: A Tool for Reliable Auto-Discovery of NVIDIA and AMD GPU Compute and Memory Topologies
Understanding GPU topology is essential for performance-related tasks in HPC or AI. Yet, unlike for CPUs with tools like hwloc, GPU information is hard to come by, incomplete, and vendor-specific. …
hgpu.org
November 16, 2025 at 2:58 PM
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs

#CUDA #HIP #Compression #Package

hgpu.org?p=30342
Characterizing the Performance of Parallel Data-Compression Algorithms across Compilers and GPUs
Different compilers can generate code with notably different performance characteristics – even on the same system. Today, GPU developers have three popular options for compiling CUDA or HIP …
hgpu.org
November 9, 2025 at 4:28 PM
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

#FP8 #Precision

hgpu.org?p=30341
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computatio…
hgpu.org
November 9, 2025 at 4:28 PM
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations

#CUDA #DeepLearning #DL #Package

hgpu.org?p=30330
A Study of Floating-Point Precision Tuning in Deep Learning Operators Implementations
Deep learning (DL) has already played a significant role in numerous fields, making it crucial to ensure the stability of both training and inference in DL systems. The computation of DL models can…
hgpu.org
November 2, 2025 at 4:05 PM
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines

#AMD #FPGA #CodeGeneration #AI

hgpu.org?p=30316
A Compute Graph Simulation and Implementation Framework Targeting AMD Versal AI Engines
We present a framework for developing compute graph-based applications targeting the AI Engine (AIE) array of AMD Versal SoCs. This framework enables users to embed AIE-based dataflow graph prototy…
hgpu.org
October 26, 2025 at 8:03 PM
Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

#SYCL #HIP #CUDA #Performance #Package

hgpu.org?p=30304
Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation
Specializing kernels by including runtime information during just-in-time (JIT) -compilation can improve performance at the expense of potentially generating more kernels. In this work, we contribu…
hgpu.org
October 19, 2025 at 8:40 PM