The LLVM Project Blog
The LLVM Project Blog
@blog.llvm.org.web.brid.gy
LLVM Project News and Details from the Trenches

[bridged from https://blog.llvm.org/ on the web: https://fed.brid.gy/web/blog.llvm.org ]
GSoC 2025: Rich Disassembler for LLDB
Hello! I’m Abdullah Amin, and this summer I had the exciting opportunity to participate in Google Summer of Code (GSoC) 2025 with the LLVM Compiler Infrastructure project. My project focused on extending LLDB with a **Rich Disassembler** that annotates machine instructions with source-level variable information. **Mentors:** Adrian Prantl and Jonas Devlieghere * * * ## Project Background LLDB is LLVM’s debugger, capable of source-level debugging across multiple platforms and architectures. While LLDB’s disassembler shows machine instructions, it doesn’t provide much insight into how variables from the original program map onto registers or memory. The goal of this project was to use DWARF debug information to enhance LLDB’s disassembly with **variable lifetime and location annotations**. This allows developers to better understand what each register contains at a given point in time, and how variables flow across instructions. For example, instead of just seeing register usage: 0x100000f80 <+32>: movq (%rbx,%r15,8), %rdi …the rich disassembler can add context: 0x100000f80 <+32>: movq (%rbx,%r15,8), %rdi ; i = r15 This makes it much easier to reason about code, especially in optimized builds. * * * ## What We Accomplished Over the summer, I implemented a prototype that integrates DWARF variable location ranges into LLDB’s disassembly pipeline. The key accomplishments are: * **DWARFExpressionEntry API** Added a new helper (`GetExpressionEntryAtAddress`) to expose variable location ranges from DWARF debug info. _PR:#144238_ * **Register-Resident Variable Annotations** Updated the disassembler to annotate instructions when variables enter, change, or leave registers. _PR:#147460_ * **Stateful Variable Tracking** Extended `Disassembler::PrintInstructions()` to track live variable states across instructions, emitting transitions such as: * `var = RDI` when a variable becomes live in a register * `var = <undef>` when a variable goes out of scope _PR:#152887_ * **Portable Tests** Added new LLDB API tests under `lldb/test/API/functionalities/disassembler-variables/`. These use stable `.s` files (with original C seeds as comments) to generate `.o` files for disassembly. This ensures reliable, portable tests independent of compiler optimizations. _PR:#152887 #155942 #156026_ Example test coverage includes: * Function parameters passed in registers (integer, floating point, and mixed). * Variables live across function calls. * Loop-based register rotation. * Constants and undefined ranges. * * * ## How to use it The annotations are available from LLDB’s disassembler. Enable them with: (lldb) disassemble --variable-annotations``` or ```(lldb) disassemble -v You can also target a specific symbol: (lldb) disassemble -n loop_reg_rotate --variable-annotations or (lldb) disassemble -n loop_reg_rotate -v ### Example C seed (kept minimal but forces interesting register reshuffles): __attribute__((noinline))int loop_reg_rotate(int n, int seed) { volatile int acc = seed; // keep as a named local int i = 0, j = 1, k = 2; // extra pressure but not enough to spill for (int t = 0; t < n; ++t) { // Mix uses so the allocator may reshuffle regs for 'acc' acc = acc + i; asm volatile("" :: "r"(acc)); // pin 'acc' live here acc = acc ^ j; asm volatile("" :: "r"(acc)); // and here acc = acc + k; i ^= acc; j += acc; k ^= j; } asm volatile("" :: "r"(acc)); return acc + i + j + k;} Disassembly with variable annotations (excerpt): loop_reg_rotate.o`loop_reg_rotate:0x0 <+0>: pushq %rbp ; n = RDI, seed = RSI0x1 <+1>: movq %rsp, %rbp0x4 <+4>: movl %esi, -0x4(%rbp)0x7 <+7>: testl %edi, %edi ; j = 1, k = 2, t = 0, i = 00x9 <+9>: jle 0x3d ; <+61> at loop_reg_rotate.c0xb <+11>: xorl %eax, %eax0xd <+13>: movl $0x1, %edx0x12 <+18>: movl $0x2, %ecx0x17 <+23>: nopw (%rax,%rax) ; j = RDX, k = RCX, i = RAX, t = <undef>0x20 <+32>: addl %eax, -0x4(%rbp)0x23 <+35>: movl -0x4(%rbp), %esi0x26 <+38>: xorl %edx, -0x4(%rbp)0x29 <+41>: movl -0x4(%rbp), %esi0x2c <+44>: addl %ecx, -0x4(%rbp)0x2f <+47>: xorl -0x4(%rbp), %eax0x32 <+50>: addl -0x4(%rbp), %edx0x35 <+53>: xorl %edx, %ecx0x37 <+55>: decl %edi0x39 <+57>: jne 0x20 ; <+32> at loop_reg_rotate.c:8:90x3d <+61>: movl $0x2, %ecx ; j = 1, k = 2, i = 00x42 <+66>: movl $0x1, %edx0x47 <+71>: xorl %eax, %eax0x49 <+73>: movl -0x4(%rbp), %esi ; j = <undef>, k = <undef>, i = <undef>0x4c <+76>: addl %edx, %eax0x4e <+78>: addl %ecx, %eax0x50 <+80>: addl -0x4(%rbp), %eax0x53 <+83>: popq %rbp0x54 <+84>: retq In this example: * Function params are annotated at entry (n = RDI, seed = RSI). * Local temporaries (i, j, k) become live in specific registers and later go when they leave scope or change location. * Only transitions are printed (start/change/end), keeping the output concise. * * * ## Current State * **Working prototype complete:** Rich disassembly annotations are now functional for variables that reside fully in registers or constants. * **Tested and validated:** A comprehensive set of tests confirm correctness across multiple scenarios, including register rotation, constants, and live-across-call variables. * **Upstreamed into LLVM:** The core implementation, supporting infrastructure, and final refactoring/formatting changes have all been merged into the main LLVM repository. This means the feature is available in the latest development builds of LLDB. * * * ## What’s Left to Do One original goal of the project was to expose the rich disassembly annotations as **structured data** through LLDB’s scripting API, so that tooling can build on top of it. While the textual annotations and stateful tracking are complete, this structured API exposure remains future work. I plan to continue working on this beyond GSoC as a follow-up contribution. * * * ## Challenges and Learnings * **DWARF complexity:** Navigating DWARF location expressions and ranges was challenging, but I gained a deep understanding of how debuggers map source variables to registers and memory. * **Testing portability:** Early attempts at hand-writing DWARF with `yaml2obj` proved too fragile. Switching to compiler-generated `.s` files provided stable and portable tests. * **Collaboration:** Working with my mentors taught me the value of incremental, reviewable patches and iterative design. * * * ## Conclusion LLDB’s disassembler is a feature aimed at advanced programmers who need detailed insights into optimized machine code. With the new variable annotations, it becomes easier to understand how source-level variables map to registers and how their lifetimes evolve, bridging the gap between source code and raw assembly. Future work will focus on structured API exposure, enabling new tooling to build on these annotations. I am grateful to my mentors, **Adrian Prantl** and **Jonas Devlieghere** , for their guidance and support throughout the project, and to the LLVM community for reviewing and testing my work. * * * ## Related Links * PR #144238: Add DWARFExpressionEntry API * PR #147460: Annotate disassembly with register-resident variables * PR #152887: Stateful variable-location annotations * PR #155942: Fix workflow testing issues (part 1) * PR #156026: Fix workflow testing issues (part 2) * PR #156118: Final code formatting and refactoring * LLVM Repository * My GitHub Profile
blog.llvm.org
November 17, 2025 at 11:37 PM
GSoC 2025: Introducing an ABI Lowering Library
# Introduction In this post I’m going to outline details about a new ABI lowering library I’ve been developing for LLVM as part of GSoC 2025! The aim was to extract the ABI logic from Clang and create a reusable library that any LLVM frontend can use for correct C interoperability. # The Problem We’re Solving At the start of the program, I wrote about the fundamental gap in LLVM’s target abstraction. The promise is simple: frontends emit LLVM IR, and LLVM handles everything else. But this promise completely breaks down when it comes to Application Binary Interface (ABI) lowering. Every LLVM frontend that wants C interoperability has to reimplement thousands of lines of target-specific ABI logic. Here’s what that looks like in practice: struct Point { float x, y; };struct Point add_points(struct Point a, struct Point b); Seems innocent enough, right? But generating correct LLVM IR for this requires knowing: * Are the struct arguments passed in registers or memory? * If in registers, what register class is used? * Are multiple values packed into a single register? * Is the struct returned in registers or using a hidden return parameter? The answer depends on subtle ABI rules that are target-specific, constantly evolving, and absolutely critical to get right. Miss one detail and you get silent memory corruption. This godbolt link shows the same simple struct using six different calling conventions across six different targets. And crucially: a frontend generating IR needs to know ALL of this before it can emit the right function signature. As I outlined in my earlier blog post, LLVM’s type system simply can’t express all the information needed for correct ABI decisions. Two otherwise identical structs with different explicit alignment attributes have different ABIs. `__int128` and `_BitInt(128)` look similar but follow completely different rules. # The Design ## Independent ABI Type System At the heart of the library is `llvm::abi::Type`, a type system designed specifically for ABI decisions: class Type {protected: TypeKind Kind; TypeSize SizeInBits; Align ABIAlignment;public: TypeKind getKind() const { return Kind; } TypeSize getSizeInBits() const { return SizeInBits; } Align getAlignment() const { return ABIAlignment; } bool isInteger() const { return Kind == TypeKind::Integer; } bool isStruct() const { return Kind == TypeKind::Struct; } // ... other predicates that matter for ABI}; It contains **more information than LLVM IR types** (which for instance doesn’t distinguish between `__int128` and `_BitInt(128)`, both just `i128`), but **less information than frontend types** like Clang’s QualType (which carry parsing context, sugar, and other frontend-specific concerns that don’t matter for calling conventions). class IntegerType : public Type {private: bool IsSigned; bool IsBitInt; // Crucially different from __int128!public: IntegerType(uint64_t BitWidth, Align Align, bool Signed, bool BitInt = false);}; ## Frontend-to-ABI Mapping The `QualTypeMapper` class handles the job of converting Clang frontend types to ABI types. **The ABI library is primarily intended to handle the C ABI.** The C type system is relatively simple, and as such the type mapping from frontend types to ABI types is straightforward : integers map to `IntegerType`, pointers map to `PointerType`, and structs map to `StructType` with their fields and offsets preserved. However, Clang also needs support for the C++ ABI, and the type mapping for this case is significantly more complicated. C++ object layout involves vtables, base class subobjects, virtual inheritance, and all sorts of edge cases that need to be preserved for correct ABI decisions. Here’s an excerpt showing how `QualTypeMapper` tackles C++ inheritance: const llvm::abi::StructType *QualTypeMapper::convertCXXRecordType(const CXXRecordDecl *RD, bool canPassInRegs) { const ASTRecordLayout &Layout = ASTCtx.getASTRecordLayout(RD); SmallVector<llvm::abi::FieldInfo, 16> Fields; SmallVector<llvm::abi::FieldInfo, 8> BaseClasses; SmallVector<llvm::abi::FieldInfo, 8> VirtualBaseClasses; // Handle vtable pointer for polymorphic classes if (RD->isPolymorphic()) { const llvm::abi::Type *VtablePointer = createPointerTypeForPointee(ASTCtx.VoidPtrTy); Fields.emplace_back(VtablePointer, 0); } // Process base classes with proper offset calculation for (const auto &Base : RD->bases()) { const llvm::abi::Type *BaseType = convertType(Base.getType()); uint64_t BaseOffset = Layout.getBaseClassOffset( Base.getType()->castAs<RecordType>()->getAsCXXRecordDecl() ).getQuantity() * 8; if (Base.isVirtual()) VirtualBaseClasses.emplace_back(BaseType, BaseOffset); else BaseClasses.emplace_back(BaseType, BaseOffset); } // ... field processing and final struct creation} Other frontends that only need C interoperability will have a much simpler mapping task. ## Target-Specific Classification Each target implements the ABIInfo interface. I’ll show the BPF implementation here since it’s one of the simplest ABIs in LLVM, the classification logic fits in about 50 lines of code with straightforward rules: small aggregates go in registers, larger ones are passed indirectly. Its worth noting that most real-world ABIs are not _this_ simple - for instance targets like X86-64 are significantly more complex. class BPFABIInfo : public ABIInfo {private: TypeBuilder &TB;public: BPFABIInfo(TypeBuilder &TypeBuilder) : TB(TypeBuilder) {} ABIArgInfo classifyArgumentType(const Type *ArgTy) const { if (isAggregateTypeForABI(ArgTy)) { auto SizeInBits = ArgTy->getSizeInBits().getFixedValue(); if (SizeInBits == 0) return ABIArgInfo::getIgnore(); if (SizeInBits <= 128) { const Type *CoerceTy; if (SizeInBits <= 64) { auto AlignedBits = alignTo(SizeInBits, 8); CoerceTy = TB.getIntegerType(AlignedBits, Align(8), false); } else { const Type *RegTy = TB.getIntegerType(64, Align(8), false); CoerceTy = TB.getArrayType(RegTy, 2, 128); } return ABIArgInfo::getDirect(CoerceTy); } return ABIArgInfo::getIndirect(ArgTy->getAlignment().value()); } if (const auto *IntTy = dyn_cast<IntegerType>(ArgTy)) { auto BitWidth = IntTy->getSizeInBits().getFixedValue(); if (IntTy->isBitInt() && BitWidth > 128) return ABIArgInfo::getIndirect(ArgTy->getAlignment().value()); if (isPromotableInteger(IntTy)) return ABIArgInfo::getExtend(ArgTy); } return ABIArgInfo::getDirect(); }}; The key difference is that the ABI classification logic itself is **completely independent of Clang**. Any LLVM frontend can use it by implementing a mapper from their types to `llvm::abi::Type`. The library then performs ABI classification and outputs `llvm::abi::ABIFunctionInfo` with all the lowering decisions. For Clang specifically, the `ABITypeMapper` converts those `llvm::abi::Type` results back into `llvm::Type` and populates `clang::CGFunctionInfo`, which then continues through the normal IR generation pipeline. # Results The library and the new type system are implemented and working in the PR #140112, currently enabled for BPF and X86-64 Linux targets. You can find the implementation under `llvm/lib/ABI/` with Clang integration in `clang/lib/CodeGen/CGCall.cpp`. Here’s what we’ve achieved so far: ## Clean Architecture The three-layer separation is working beautifully. Frontend concerns, ABI classification, and IR generation are now properly separated: // Integration point in Clangif (CGM.shouldUseLLVMABI()) { SmallVector<const llvm::abi::Type *, 8> MappedArgTypes; for (CanQualType ArgType : argTypes) MappedArgTypes.push_back(getMapper().convertType(ArgType)); tempFI.reset(llvm::abi::ABIFunctionInfo::create( CC, getMapper().convertType(resultType), MappedArgTypes)); CGM.fetchABIInfo(getTypeBuilder()).computeInfo(*tempFI);} else { CGM.getABIInfo().computeInfo(*FI); // Legacy path} ## Performance Considerations Addressed My earlier blog post worried about the overhead of “an additional type system.” The caching strategy handles this elegantly: const llvm::abi::Type *QualTypeMapper::convertType(QualType QT) { QT = QT.getCanonicalType().getUnqualifiedType(); auto It = TypeCache.find(QT); if (It != TypeCache.end()) return It->second; // Cache hit - no recomputation const llvm::abi::Type *Result = /* conversion logic */; if (Result) TypeCache[QT] = Result; return Result;} Combined with `BumpPtrAllocator` for type storage, the performance impact is minimal in practice. The results are encouraging. Most compilation stages show essentially no performance difference (well within measurement noise). The 0.20% regression in the final Clang build times is expected - we’ve added new code to the codebase. But the actual compilation performance impact is negligible. # Future Work There’s still plenty to explore: ## Upstreaming the progress so far… The work is being upstreamed to LLVM in stages, starting with PR #158329. This involves addressing reviewer feedback, ensuring compatibility with existing code, and validating that the new system produces identical results to the current implementation for all supported targets. ## Extended Target Support Currently the ABI library supports the BPF and X86-64 SysV ABIs, but the architecture makes adding ARM, Windows calling conventions, and other targets straightforward. ## Cross-Frontend Compatibility The real test will be when other frontends start using the library. We need to ensure that all frontends generate identical calling conventions for the same C function signature. ## Better Integration There are still some rough edges in the Clang integration that could be smoothed out. And other LLVM projects could benefit from adopting the library. # Acknowledgements This work wouldn’t have been possible without my amazing mentors, Nikita Popov and Maksim Levental, who provided invaluable guidance throughout the project. The LLVM community’s feedback on the original RFC was instrumental in shaping the design. Special thanks to everyone who reviewed the code, provided feedback, and helped navigate all the ABI corner cases. The architecture only works because it’s built on decades of accumulated ABI knowledge that was already present in LLVM and Clang. Looking back at my precursor blog post from earlier this year, I’m amazed at how much the design evolved during implementation. What started as a relatively straightforward “extract Clang’s ABI code” became a much more ambitious architectural rework. But the result is something that’s genuinely useful for the entire LLVM ecosystem.
blog.llvm.org
November 3, 2025 at 11:23 PM
GSoC 2025: Usability Improvements for the Undefined Behavior Sanitizer
## Introduction My name is Anthony and I had the pleasure of working on improving the Undefined Behavior Sanitizer this Google Summer of Code 2025. My mentors were Dan Liew and Michael Buch. ## Background Undefined Behavior Sanitizer (UBSan) is a tool for detecting a subset of the undefined behaviors in the C, C++, and Objective-C languages at runtime. This project focused mainly on the trapping variant of UBSan, which is evoked through `-fsanitize-trap=<...>` along with `-fsanitize=<...>`. Trapping UBSan is a lighter-weight version of UBSan because upon detection of undefined behavior a trap instruction is executed rather than calling into a runtime library to handle the undefined behavior. This makes it more appealing for kernel, embedded, and production hardening use cases. For cases of undefined behavior that can be detected by UBSan, check out the official clang documentation. An issue with trapping UBSan prior to my work was that it was much harder to debug undefined behavior when it is detected when compared to the non-trapping mode. To illustrate this consider this C program that reads integers from the command line arguments and adds them. #include <stdlib.h>#include <stdio.h>int add(int a, int b) { return a + b;}int main(int argc, const char** argv) { if (argc < 3) return 1; int a = atoi(argv[1]); int b = atoi(argv[2]); int result = add(a, b); printf("Added %d + %d = %d\n", a, b, result); return 0;} If this program is compiled and executed using UBSan with its userspace runtime it provides helpful output diagnosing the problem and also allows execution to continue. $ bin/clang -fsanitize=undefined add.c -g -o add && ./add 2147483647 1add.c:5:13: runtime error: signed integer overflow: 2147483647 + 1 cannot be represented in type 'int'SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior add.c:5:13 Added 2147483647 + 1 = -2147483648 In contrast when using UBSan in trapping mode the program immediately terminates when undefined behavior is detected as shown below. $ clang -fsanitize=undefined -fsanitize-trap=undefined add.c -g -o add && ./add 2147483647 1[1] 54357 trace trap ./add 2147483647 1 This is the expected behavior of trapping mode but how should a developer debug what happened when a trap is hit? If we attach a debugger and run the example program this is the output LLDB shows. $ lldb ./add -- 2147483647 1 (lldb) target create "./add"(lldb) settings set -- target.run-args "2147483647" "1"(lldb) rProcess 17347 launched: 'add' (arm64)Process 17347 stopped* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT (code=1, subcode=0x100003d3c) frame #0: 0x0000000100003d3c add`add(a=2147483647, b=1) at add.c:5:13 2 #include <stdio.h> 3 4 int add(int a, int b) {-> 5 return a + b; 6 } 7 8 int main(int argc, const char** argv) {(lldb) bt* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT (code=1, subcode=0x100003d3c) * frame #0: 0x0000000100003d3c add`add(a=2147483647, b=1) at add.c:5:13 frame #1: 0x0000000100003eec add`main(argc=3, argv=0x000000016fdff110) at add.c:13:18 frame #2: 0x00000001842bab98 dyld`start + 6076(lldb) dis -padd`add:-> 0x100003d3c <+40>: brk #0x5500 0x100003d40 <+44>: ldr w0, [sp, #0x4] 0x100003d44 <+48>: add sp, sp, #0x10 0x100003d48 <+52>: ret We can see that a `brk` instruction was hit while handling the `a + b` expression but there is no good explanation of what happened. `brk` is the trap instruction on arm64 but it is not particularly clear this has anything to do with UBSan. For this toy example we can speculate that integer overflow occurred because the program was built with trapping UBSan and the trap was hit while handling `a + b`. However, in real programs built with trapping UBSan and potentially other hardening mechanisms it is often far less obvious what happened. For this particular example. The information that this is an integer overflow UBSan check is actually there but it is not very obvious. On x86_64 and arm64 the reason for trapping is actually encoded in the operand to the trap instruction. In this case the `#0x5500` immediate to the brk instruction encodes that this is a UBSan trap for integer overflow. The UBSan immediate is encoded as `('U' << 8) + SanitizerHandler` where `SanitizerHandler` is the enum value from the `SanitizerHandler` enum inside Clang’s internals. As we can see the debugging experience with UBSan traps is not ideal and improving this was the primary goal of the GSoC project. ## Human readable descriptions of UBSan traps in LLDB The natural place to tackle the debugging experience was to look at debugger integration. ### Let the debugger handle most of the work One approach to this would be to teach debuggers (e.g. LLDB) to decode the trap reason encoded in trap instructions in the debugger. However, this approach wasn’t taken for several reasons: * Using the trap reason encoded in trap instructions only works for x86_64 and arm64. The approach that I used works for all targets where debug info is supported (many more). * Relying on decoding the trap reason encoded in the trap instruction creates a tight coupling between the compiler and the debugger because if the encoding ever changes: * The debugger would need to be changed to adapt to the new encoding. * Older versions of the debugger would fail to work with binaries using the new encoding. * New versions of the debugger would fail to work with binaries using the old encoding. * In contrast, encoding the trap reason as a string in the debug info is a much looser coupling because the compiler is free to change the trap reason without changes to the debugger. ### Encoding the trap reason in the debug info The approach I took is based on how `__builtin_verbose_trap` encodes its message into debug info 11] [12], a feature which was implemented in the past for [libc++ hardening. The core idea is that the trap reason string gets encoded directly in the trap’s debug information. To accomplish this, we needed to find a place to “stuff” the string in the DWARF DIE tree. Using a `DW_TAG_subprogram` was deemed the most straightforward and space-efficient location. This means we create a synthetic `DISubprogram` which is not a real function in the compiled program; it exists only in the debug info as a container. While the string could have been placed elsewhere, for reasons outside the scope of this blog post, it resides on this fake function DIE, with the trap reason encoded in the `DW_TAG_subprogram`’s name. For a deeper dive into this design decision, you can see [15]. Let’s look at the LLVM IR of the previous example to see how this is implemented: $ clang -fsanitize=undefined -fsanitize-trap=undefined add.c -g -o - -o - -S -emit-llvm -fsanitize-debug-trap-reasons=basic; Function Attrs: noinline nounwind optnone ssp uwtable(sync)define i32 @add(i32 noundef %a, i32 noundef %b) #0 !dbg !17 !func_sanitize !22 {entry: %a.addr = alloca i32, align 4 %b.addr = alloca i32, align 4 store i32 %a, ptr %a.addr, align 4 #dbg_declare(ptr %a.addr, !23, !DIExpression(), !24) store i32 %b, ptr %b.addr, align 4 #dbg_declare(ptr %b.addr, !25, !DIExpression(), !26) %0 = load i32, ptr %a.addr, align 4, !dbg !27 %1 = load i32, ptr %b.addr, align 4, !dbg !28 %2 = call { i32, i1 } @llvm.sadd.with.overflow.i32(i32 %0, i32 %1), !dbg !29, !nosanitize !21 %3 = extractvalue { i32, i1 } %2, 0, !dbg !29, !nosanitize !21 %4 = extractvalue { i32, i1 } %2, 1, !dbg !29, !nosanitize !21 %5 = xor i1 %4, true, !dbg !29, !nosanitize !21 br i1 %5, label %cont, label %trap, !dbg !29, !prof !30, !nosanitize !21trap: ; preds = %entry call void @llvm.ubsantrap(i8 0) #4, !dbg !31, !nosanitize !21 unreachable, !dbg !31, !nosanitize !21cont: ; preds = %entry ret i32 %3, !dbg !34};...!29 = !DILocation(line: 5, column: 13, scope: !17)!30 = !{!"branch_weights", i32 1048575, i32 1}!31 = !DILocation(line: 0, scope: !32, inlinedAt: !29)!32 = distinct !DISubprogram(name: "__clang_trap_msg$Undefined Behavior Sanitizer$Integer addition overflowed", scope: !2, file: !2, type: !33, flags: DIFlagArtificial, spFlags: DISPFlagDefinition, unit: !14) The debug metadata for the `@llvm.ubsantrap` call is `!31`. That `DILocation` has the scope of the `DISubprogram` assigned to `!32` which is the artificial function which encodes the trap category. This function’s name is formatted as `__clang_trap_msg$<Category>$<TrapMessage>` to encode the trap category (`Undefined Behavior Sanitizer`) and the specific message (`Integer addition overflowed`). This function does not actually exist in the compiled program. It only exists in the debug info as a convenient way to describe the reason for trapping. When a trap is hit in the debugger, the debugger retrieves this string from the debug info and shows it as the reason for trapping. Note that the `DILocation` for `!31` has `inlinedAt:` which tells us that the trap was inlined from `!32` into the location at `!29` which is the location of the `a + b` expression in the `add` function. I implemented this change on this PR. ### Debug info size changes One concern that a reviewer had was the debug info size difference. This was one of the motivations for putting this feature under the new `-fsanitize-debug-trap-reasons` flag because initially (prior to code review), my mentors and I planned to have the trap feature flag accompany the `-fsanitize-trap=` flag. Although the `-fsanitize-debug-trap-reasons` flag is on by default (so long as trapping UBSan is enabled), having the trap reason feature under a flag allows users to opt-out by using the `-fno-sanitize-debug-trap-reasons` flag. Using bloaty, I tested a release build of clang with the `-fsanitize-debug-trap-reasons` flag enabled, and one with it disabled (`-fno-sanitize-debug-trap-reasons`). We found that the size difference was negligible; results are below. FILE SIZE VM SIZE -------------- -------------- +0.3% +6.01Mi +0.3% +6.01Mi ,__debug_info +2.0% +2.26Mi [ = ] 0 [Unmapped] +1.2% +1.35Mi +1.2% +1.35Mi ,__apple_names +0.0% +1.01Mi +0.0% +1.01Mi ,__debug_str +0.8% +636Ki +0.8% +635Ki ,__debug_line +0.4% +161Ki +0.4% +161Ki ,__debug_ranges +0.4% +47.9Ki +0.4% +47.9Ki ,__debug_abbrev +0.0% +14 +0.0% +14 ,__apple_types [ = ] 0 +0.0% +8 ,__common [ = ] 0 +7.1% +4 ,__thread_bss -0.0% -4 -0.0% -4 ,__const -0.0% -1.27Ki -0.0% -1.27Ki ,__cstring +0.2% +11.5Mi +0.1% +9.19Mi TOTAL Note it is likely the code size difference is negligible because because in optimized builds trap instructions in a function get merged together which causes the additional debug info my patch adds to be dropped. Realistically this will add a few more abbreviations into .debug_abbrev (the DWARF abbreviation section) and only a few extra bytes per-UBSAN trap (abbreviation code + 1 ULEB128 for the index into the string offset table) into .debug_info (the DWARF debug-info section). The rest of the DW_TAG_subprogram is encoded in the abbreviation for that fake frame 16]. It would also be contingent on the number of traps emitted since a new `DW_TAG_subprogram` DIE is emitted for each trap with this new feature. [A later comparison on a larger code base (“Big Google Binary”), actually found a rather significant size increase of about 18% with trap reasons enabled. Future work may involve looking into why this is happening, and how such drastic size increases can be reduced. ### Displaying the trap reason in the debugger With the support in the compiler for encoding the trap reasons for UBSan implemented I then turned my attention to displaying these in the LLDB debugger. In this particular case nothing new needs to be implemented in LLDB because the `VerboseTrapFrameRecognizer` in LLDB which was implemented for `__builtin_verbose_trap` is general enough that it already supports any artificial function in the debug info of the form `__clang_trap_msg$<Category>$<TrapMessage>`. So if we take the running example and run it under LLDB, its output now looks like. $ clang -fsanitize=undefined -fsanitize-trap=undefined add.c -g -o add$ lldb ./add -- 2147483647 1(lldb) target create "add"(lldb) settings set -- target.run-args "2147483647" "1"(lldb) rProcess 81705 launched: '/add' (arm64)Process 81705 stopped* thread #1, queue = 'com.apple.main-thread', stop reason = Undefined Behavior Sanitizer: Integer addition overflowed frame #1: 0x0000000100003d3c add`add(a=2147483647, b=1) at add.c:5:13 2 #include <stdio.h> 3 4 int add(int a, int b) {-> 5 return a + b; 6 } 7 8 int main(int argc, const char** argv) {(lldb) bt* thread #1, queue = 'com.apple.main-thread', stop reason = Undefined Behavior Sanitizer: Integer addition overflowed frame #0: 0x0000000100003d3c add`__clang_trap_msg$Undefined Behavior Sanitizer$Integer addition overflowed at add.c:0 [inlined] * frame #1: 0x0000000100003d3c add`add(a=2147483647, b=1) at add.c:5:13 frame #2: 0x0000000100003eec add`main(argc=3, argv=0x000000016fdff110) at add.c:13:18 frame #3: 0x00000001842bab98 dyld`start + 6076(lldb) dis -padd`__clang_trap_msg$Undefined Behavior Sanitizer$Integer addition overflowed:-> 0x100003d3c <+40>: brk #0x5500 0x100003d40 <+44>: ldr w0, [sp, #0x4] 0x100003d44 <+48>: add sp, sp, #0x10 0x100003d48 <+52>: ret Notice that: The stop reason now shows as `Undefined Behavior Sanitizer: Integer addition overflowed`. Previously no helpful stop reason was shown.We are stopped with `frame #1` selected and the artificial frame (`frame #0`) is present in the backtrace. LLDB does this so stopping is not shown in the artificial function which would be confusing.The `dis -pc` output claims we are inside the artificial function. This is an artifact of the implementation that is a little confusing but worth the trade-off. So for this portion of my GSoC project the only thing I need to do was added a test case to ensure LLDB behaved appropriately. This was done on this PR. ## RFC: Add a warning when `-fsanitize=` is passed without associated `-fsanitize-trap=` The next part of my GSoC project was to post an RFC and implement a sketch fix for a usability problem with trapping UBSan. Currently, clang does not warn about cases where `-fsanitize-trap=` does nothing (silent no-op), particularly in the case where `-fsanitize-trap=` is passed without `-fsanitize=`: Ex: `$ clang -fsanitize-trap=undefined foo.c` Emits no warning, even though `-fsanitize-trap=undefined` is not doing anything here. We thought it would be more user-friendly to add a warning for such cases, but due to some initial community pushback, it was decided that an RFC should be opened. I ended up writing a sketch patch that emitted a warning for such cases. Ex: `$ clang -fsanitize-trap=undefined foo.c` Would now emit: `warning: -fsanitize-trap=undefined has no effect because the "undefined" sanitizer is disabled; consider passing "-fsanitize=undefined" to enable the sanitizer` Unfortunately, we found that the emission of such warnings could become exceedingly complicated and a point of contention due to the existence of sanitizer groups, subgroups, and individual sanitizers. Determining the correct behavior for various cases, historical precedence with no-ops, interference with current build systems, prioritization of existing build systems over the user experience, and compatibility with gcc led to the end of the RFC. ## Expand upon the hard-coded strings in `-fsanitize-debug-trap-reasons` to be more specific There were two initial design decisions to pick from here. Either I could: **(a)** Use some string formatting, such as LLVM’s formatvariadic or raw_ostream. For some implementation context, the function in which the trap messages were generated was called in a function called `EmitTrapCheck`, so the idea was to pass down extra information from earlier in the call stack before `EmitTrapCheck` was called. or **(b)** Extend clang’s diagnostic system to accomodate trap reasons. This is explained further below. Due to the time it took to complete the first two tasks, I chose the first option. I deemed the second option to be a large commitment that I wouldn’t have been able to do by the end of the GSoC coding period. Additionally, I was unsure if building on top of the diagnostics subsystem would be approved since diagnostics were originally intended to emit messages within the command line, not debug info. After taking some time to investigate possible cases where extra information in trap messages could be useful, I put up a PR. The patch was admittedly quite messy, so to take a cleaner approach that was more aligned with clang’s frontend, one of my mentors, Dan, ended up following through with option (b) to create the extension to the clang diagnostics subsystem to work with trap messages. By extending the diagnostics subsystem, it allows us to leverage the powerful string formatting engine of the diagnostics system. However, this doesn’t mean that the effort used to write proper hard-coded trap messages under `-fsanitize-debug-trap-reasons` was thrown away. Rather as of Dan’s patch, the flag now has two options: `-fsanitize-debug-trap-reasons=basic` for the hard-coded trap messages and `-fsanitize-debug-trap-reasons=detailed` for the detailed trap messages which utilize the trap reasons diagnostics API. This was done in case users did not want to deal with the larger binary sizes that came with the trap reasons. ## What I’ve Learned Before I started this GSoC, I barely even knew how to build clang and LLVM or use git in a large open-source project. My mentors showed me the ropes on a lot of things (particularly how to properly use git and build and configure clang), and I came out of this summer knowing a lot more of how to get my changes properly reviewed and upstreamed. I was also able to get a firmer understanding of the Undefined Behavior Sanitizer, better C++ programming practices, and the LLVM codebase. ## Work to Do As stated prior, some research needs to be conducted to figure out how size increase can be minimalized. Also stated previously, the diagnostics extension for trap messages has been upstreamed by Dan. As of right now, only signed and unsigned overflow for addition, subtraction, and multiplication are being used by this system. I plan to integrate what I found on my abandoned PR by building on top of what Dan has already done. This will be done after the GSoC coding period. There is an issue where trap messages are not emitted in cases where they should be due to a null check. The purpose of the null check was to prevent a nullptr dereference that occurred in the debug-info prologue. This is a known issue to which there isn’t a concrete solution as of current. ## Conclusion I want to give a special thanks to my mentors, Dan and Michael, for being there for me the whole way. They helped a lot with guiding me through git, the LLVM code base, and even this blog post. I appreciate their commitment to the project and their patience with me. I’m incredibly grateful that I was able to work on this project and I wouldn’t have traded it for anything else. Being a beginner to both LLVM and open-source, I have to admit I was overwhelmed at first, but slowly, along with their help, I was able to gain at least a semblance of understanding of how things worked. I could not have asked for a better set of mentors, so again, a huge thanks to them. I want to also extend my gratitude to the LLVM Foundation for this opportunity. I’ve had a lot of fun with this project and I hope to contribute more to LLVM, or in open-source in general, in the future. ## Landed PRs https://github.com/llvm/llvm-project/commits?author=anthonyhatran ## External Links 1] [https://github.com/llvm/llvm-project/pull/145967 2] [https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#id5 3] [https://discourse.llvm.org/t/clang-gsoc-2025-usability-improvements-for-trapping-undefined-behavior-sanitizer/84568/11 4] [https://github.com/llvm/llvm-project/pull/147997 5] [https://discourse.llvm.org/t/rfc-emit-a-warning-when-fsanitize-trap-is-passed-without-associated-fsanitize/87893 6] [https://github.com/llvm/llvm-project/pull/154618 7] [https://github.com/llvm/llvm-project/pull/153845 8] [https://github.com/llvm/llvm-project/issues/150707 9] [https://maskray.me/blog/2023-01-29-all-about-undefined-behavior-sanitizer 10] [https://github.com/llvm/llvm-project/pull/151231 11] [https://discourse.llvm.org/t/rfc-adding-builtin-verbose-trap-string-literal/75845 12] [https://github.com/llvm/llvm-project/pull/79230 13] [https://discourse.llvm.org/t/rfc-hardening-in-libc/73925 14] [https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html 15] [https://github.com/llvm/llvm-project/pull/145967#issuecomment-3054319138 16] [https://github.com/llvm/llvm-project/pull/145967#issuecomment-3068862442
blog.llvm.org
October 20, 2025 at 11:17 PM
GSoC 2025 - Support simple C++20 modules use from the Clang driver without a build system
Hi, my name is Naveen. For Google Summer of Code 2025, I’ve been working on adding native support for simple Clang and C++20 named module use from the Clang driver.This post outlines the project and its current status. My mentor for this project was Michael Spencer. ## Background Modules solve many of the long-standing problems with the traditional header-based way of sharing code. They prevent leaking macros, let you explicitly choose what to export, and can improve compile time at scale. However, because modules must be precompiled before use, builds that rely on them need to schedule compilations in the right order, according to their imports. At the moment, Clang’s driver lacks native support to do this, which makes even simple tests or tiny programs using modules hard to compile without first setting up a build system. ## Goals The goal of this project is to extend the build system in Clang’s driver to natievly support simple use of Clang or C++20 named modules, by integrating Clang’s existing support for module dependency scanning. This should also support importing the C++ Standard library modules `std` and `std.compat`, and add no overhead to cases where modules are not used. With the feature fully implemented, the following example should compile without any issue: clang++ -std=c++23 main.cpp A.cpp -fmodules-driver -fmodules -fmodule-map-file=module.modulemap // main.cpp#include "MyLib.h"import std;import A;auto main() -> int { std::println("{}", make_greeting("modules")); std::println("The answer is: {}", get_answer());} // A.cppexport module A;import std;export auto make_greeting(std::string_view Name) -> std::string { return std::format("Hello, {}!", Name);} // module.modulemapmodule MyLib { header "MyLib.h" export *} // MyLib.hauto get_answer() -> int { return 42; } Although one of the main advantages of modules is that they can be precompiled once and reused for future uses, support for caching was not included in the scope of this GSoC project. ## Design Overview & Challenges ### 1. Check: Enable the Modules Driver? Once stabilized, the driver-managed module build feature should be enabled automatically for compilations which make use of C++20 named modules and have two or more source inputs.To detect named‑module usage without adding noticeable overhead to compilations that don’t use modules, we added a fast scanner that inspects the leading lines of a source input for named module directives only. We measured compile times for building Clang with the check enabled and disabled, and profiled it using perf. The benchmarks show that the check typically makes up less than 0.1% of total compile time (full benchmarks). ### 2. Modules Driver Logic At a high level, the modules driver logic can be summarized as simple as: (1) scan, (2) plan the build order, and (3) reorder/modify the jobs. Some parts introduced unique challenges: #### Handling of the Standard library modules. Clang’s dependency scanning tooling uses the generated `-cc1` compilation job’s command lines as input. Since we can’t know in advance whether a standard library module will be needed, we always build the jobs for `std` and `std.compat`. During the scanning phase, we only scan those standard library modules if a command-line source input depends on them. If a standard library module ends up unused, we drop its job and carefully remove its outputs from the linker command line. #### Propagating scan diagnostics After completing the dependency scan, we want to forward all diagnostics generated by the scan through the driver’s diagnostics engine.Because those diagnostics are generated outside the driver’s own invocation, they become invalid once the scan process ends.To prevent this, we serialize each scan diagnostic into an intermediate representation and deserialize it back into a regular diagnostic before emitting it. ## Outcome The current upstream draft for this feature (see here) can successfully compile examples that use both C++20 named modules and Clang modules. Importing Standard Library modules is also supported, and the example above compiles without issues. Additionally, the module dependency graph can be emitted as a diagnostic remark (using `-Rmodules-driver`) in the form of a DOT graph: Regular translation units are able to import both Clang and C++20 named modules. However, importing a Clang module into a C++20 named module interface unit, or vice versa, is not yet supported. ## Future work While basic examples using modules compile correctly, there are still many command-line options and input configurations that are incompatible or may break the modules driver in unexpected ways. In the near term, I plan to fix the draft’s remaining quirks, land this feature, and make it more robust. In addition, the modules driver should gain support for caching precompiled module files, since caching is one of the core strengths of modules and makes up for the initial overhead they add. In the longer term, support for imports between different kinds of module units should also be added. Because of the current design of Clang’s dependency scanning tooling, however, allowing C++20 named modules to be imported into Clang modules would require deeper architectural changes. ## Acknowledgements I’d like to thank my mentor, Michael Spencer, for his invaluable help and guidance, as well as the GSoC and LLVM project admins for making this experience possible. ## Links to all PRs and RFCs Over the course of this project, I’ve submitted the following PRs and RFCs: Contributions **Project related** * #156248 Add initial support for driver-managed module builds. * #155450 Relocate previous work due to changes in the modules driver design. * #149900 Adds scanner to detect C++20 module usage. * #148674 Fixes a lexing error in the dependency scanning tooling’s scanner. * #148685 Fixes a lexing error in the dependency scanning tooling’s scanner. * #152811 Allow GraphWriter specialization required for DOT graph remark. * #145857 Adds support for a test for `clang-scan-deps` on Windows. * #145221 Adds C++20 named module outputs to the scanning format `experimental-full` to enable combined scanning of both module kinds. * #143950 Implements P2223R2 for the dependency scanning tooling’s scanner. * #142455 (NFC) Moves argument handling: Driver::BuildActions -> handleArguments * #155523 (NFC) Removes dead code in the dependency scanning tooling. **Misc. contributions** * #145243 Implements P2223R2 for `clang-format` * #141230 Fixes crash related to octal floating-point literals * #139457 Fixes crash related to command line handling of source locations. RFCs * RFC: Support simple C++20 modules use from the Clang driver without a build system * RFC: Link the Driver against clangDependencyScanning
blog.llvm.org
October 6, 2025 at 11:10 PM
GSoC 2025: Bfloat16 in LLVM libc
## Introduction BFloat16 is a 16-bit floating-point format, introduced by Google and standardized in C++23 as `std::bfloat16_t`. It uses 1 sign bit, 8 exponent bits (the same as `float`), and 7 mantissa bits. This configuration allows BFloat16 to represent a much wider dynamic range than IEEE `binary16` (~3×10^38 values compared to 65,504), though with lower precision. BFloat16 has become popular in AI and machine learning use-cases where it offers significant performance advantages over IEEE `binary32` while preserving its approximate dynamic range. The goal of this project was to implement the BFloat16 type in LLVM libc along with the basic math functions like `fabsbf16`, `fmaxbf16`, etc. We also want all functions to be generic (platform-independent) and correctly rounded for all rounding modes. ## What was done * BFloat16 type was added to LLVM libc (`libc/src/__support/FPUtil/bfloat16.h`) #144463. * All 70 expected basic math functions for `bfloat16` were implemented, using a generic approach that supports all libc supported architectures (ARM, RISC-V, GPUs, x86, Darwin) (see table below). * Implemented two additional basic math functions: `iscanonicalbf16` and `issignalingbf16`. * Implemented higher math functions: `sqrtbf16` #156654 and `log_bf16` #157811 (open). * Comparison operations for the `FPBits` class were added #144983. Basic Math Function| PR ---|--- `fabsbf16`| #148398 `ceilbf16`, `floorbf16`, `roundbf16`, `roundevenbf16`, `truncbf16`| #152352 `bf16add`, `bf16addf`, `bf16addl`, `bf16addf128`, `bf16sub`, `bf16subf`, `bf16subl`, `bf16subf128`| #152774 `fmaxbf16`, `fminbf16`| #152782 `bf16mul`, `bf16mulf`, `bf16mull`, `bf16mulf128`| #152847 `fmaximumbf16`, `fmaximum_magbf16`, `fmaximum_mag_numbf16`, `fmaximum_numbf16`, `fminimumbf16`, `fminimum_magbf16`, `fminimum_mag_numbf16`, `fminimum_numbf16`| #152881 `bf16div`, `bf16divf`, `bf16divl`, `bf16divf128`| #153191 `bf16fma`, `bf16fmaf`, `bf16fmal`, `bf16fmaf128`| #153231 `llrintbf16`, `llroundbf16`, `lrintbf16`, `lroundbf16`, `nearbyintbf16`, `rintbf16`| #153882 `fromfpbf16`, `fromfpxbf16`, `ufromfpbf16`, `ufromfpxbf16`| #153992 `nextafterbf16`, `nextdownbf16`, `nexttowardbf16`, `nextupbf16`| #153993 `getpayloadbf16`, `setpayloadbf16`, `setpayloadsigbf16`| #153994 `nanbf16`| #153995 `frexpbf16`, `ilogbbf16`, `ldexpbf16`, `llogbbf16`, `logbbf16`| #154427 `modfbf16`, `remainderbf16`, `remquobf16`| #154652 `canonicalizebf16`, `iscanonicalbf16`, `issignalingbf16`, `copysignbf16`, `fdimbf16`| #155567 `totalorderbf16`, `totalordermagbf16`| #155568 `scalbnbf16`, `scalblnbf16`| #155569 `fmodbf16`| #155575 The implementation status can be viewed at the libc `math.h` header implementation status page, which is updated regularly. ## What was not done * The implementation used a generic approach and did not rely on the `__bf16` compiler intrinsic, as it is not available in all compilers versions. Our goal is to ensure that the type is supported by all compilers and versions supported by LLVM libc. * Hardware optimizations provided by Intel’s AVX-512_BF16 were not utilized. These instructions only support round-to-nearest-even mode, always flush output denormals to zero, and treat input denormals as zero, which does not align with our goal. See VCVTNE2PS2BF16 instruction description. * ARMv9 SVE instructions were not utilized, as they are relatively new and not yet widely supported. * Not all higher math functions were implemented due to time constraints. ## Future Work * Implement the remaining higher math functions. * Perform performance comparisons with other libc implementations once their `bfloat16` support is available and also with the CORE-MATH project. * Update the test suite when the `mpfr_get_bfloat16` function becomes available. ## Acknowledgements I would like to thank my mentors, Tue Ly and Nicolas Celik, for their invaluable guidance and support throughout this project. The project wouldn’t have been possible without them. I am also grateful to the LLVM Foundation and the GSoC admins for giving me this opportunity.
blog.llvm.org
October 6, 2025 at 11:11 PM
GSoC 2025: Improving Core Clang-Doc Functionality
I was selected as a contributor for GSoC 2025 under the project “Improving Core Clang-Doc Functionality” for LLVM.My mentors for the project were Paul Kirth and Petr Hosek. Clang-Doc is a tool in clang-tools-extra that generates documentation from Clang’s AST and can output Markdown, HTML, YAML, and JSON.The project started in 2018 but major development eventually slowed.Recently, there have been efforts to get it back on track. This year, the GSoC project idea had a simple premise: improve core functionality. # The Project The project idea proposed three main areas of focus to improve documentation quality. 1. C++ support 2. Doxygen comments 3. Markdown support First, not all C++ constructs were supported, like friends or concepts.Not supporting core C++ constructs in C++ documentation is not good.Second, it’s important that Doxygen command support is robust and that we can support as many as possible.Lastly, having Markdown available to developers for documentation would be useful.Markdown provides the power of expression in an area that is technically dense.It can be used to highlight critical information and warnings. # The Architecture Here’s a quick overview on Clang-Doc’s architecture, which follows a map-reduce pattern: 1. Visit source declarations via Clang’s `ASTVisitor`. 2. Serialize relevant source information into an `Info` (Clang-Doc’s main data entity). 3. Write `Info`s into bitcode, reduce, and reread. 4. Serialize `Info`s into the desired format with a target backend. The architecture seems straightforward at a glance, but Clang-Doc has critical flaws at step 4. ## The Bad Clang-Doc has support for many formats.That sounds great in principal, but the backend pipeline’s execution made development extremely cumbersome. Unlike in LLVM, Clang-Doc doesn’t have a framework like CodeGen that shares functionality across different targets.To document a `class`, every backend needs to independently implement logic to serialize the `class` into its target format.Each backend also has separate logic to write all of the documented entities to disk.There is also no IR where `Info`s can be preprocessed, which means that any organizational preprocessing done in a backend can’t be shared. Here’s the code for serializing the bases and virtual bases of a class in the HTML backend: std::vector<std::unique_ptr<HTMLNode>> Parents = genReferenceList(I.Parents, I.Path); std::vector<std::unique_ptr<HTMLNode>> VParents = genReferenceList(I.VirtualParents, I.Path); if (!Parents.empty() || !VParents.empty()) { Out.emplace_back(std::make_unique<TagNode>(HTMLTag::TAG_P)); auto &PBody = Out.back(); PBody->Children.emplace_back(std::make_unique<TextNode>("Inherits from ")); if (Parents.empty()) appendVector(std::move(VParents), PBody->Children); ... Here’s the same logic for doing it in Markdown: std::string Parents = genReferenceList(I.Parents); std::string VParents = genReferenceList(I.VirtualParents); if (!Parents.empty() || !VParents.empty()) { if (Parents.empty()) writeLine("Inherits from " + VParents, OS); else if (VParents.empty()) writeLine("Inherits from " + Parents, OS); else writeLine("Inherits from " + Parents + ", " + VParents, OS); writeNewLine(OS); } ... You can see how differently both backends need to handle these constructs, which makes it complicated to bring feature parity.The HTML tag creation being so tightly coupled to the documentation serialization also highlights a different problem of formatting problems being difficult to identify. This lack of generic handling imposed a very high maintenance cost for extending basic functionality.It also easily led to backend disparity; a construct might be serialized in YAML, but not in Markdown.Changes to how a documentation entity was handled would not be uniform across backends. Testing was also in an awkward spot.If not all backends were guaranteed to generate the same documentation, who could be trusted as the source of truth?YAML was originally meant to serve this role, but it suffered from feature disparity.It’s a cumbersome process to implement support for a construct in YAML, verify it there, but then also go to implement it in HTML.There’s a logical disconnect: what’s serialized in YAML isn’t guaranteed to reflect in HTML, so what is the benefit of updating YAML if my documentation is shown through HTML? ## The Good The good news is that Clang-Doc’s recent improvements had brought in changes that could rectify these problems, with a bit more work.Last year’s GSoC brought in great improvements that became the basis of my summer.First, last year’s GSoC contributor landed a large performance improvement.I might not have been able to test Clang-Doc on Clang itself without it. The same contributor authored the Mustache template engine implementation in LLVM.Mustache templates allow Clang-Doc to shift away from manually generating HTML tags and eliminate high maintenance burdens.Templates could also solve the feature parity problem by using JSON to feed templates.This was a huge part of my summer and allowed me to bring in great improvements that would make Clang-Doc more flexible and easier to contribute to. # Building a JSON Backend While studying the codebase during the Community Bonding Period, I determined that creating a separate JSON backend would be extremely helpful.A JSON backend presented two immediate benefits: 1. We could use it to feed our Mustache HTML templates and future template usage. 2. As the main feeder format, testing can be focused on the JSON output. The existing Mustache backend in Clang-Doc already contained logic to create JSON documents, but they were immediately discarded when the templates were rendered.This backend is extremely beneficial to Clang-Doc because it completely eliminated any need for manual HTML tag generation, thus greatly reducing lines of code.If the JSON and template rendering logic from the existing implementation were uncoupled, we could apply the same pattern to any format we’d want to support.Markdown generation would be a similar case where templates would be used to automate the creation of all markup syntax. This diagram models the architecture that Clang-Doc would follow given a unified JSON backend.Note the similarities to Clang, where our frontend (the visitation/serialization) gathers all the information we need and emits an intermediate representation (JSON).The JSON is then fed to the desired templates to produce our documentation, similar to how IR is used for different LLVM backends.Following this pattern would reduce the maintenance to only the JSON generation; all the formatting for HTML, Markdown, etc. would exist in template files that are very simple to change and neatly separates documentation logic from display/formatting logic.Also note how much more streamlined it is compared to the previous diagram where serialization logic was separated among Clang-Doc’s backends. Thus, I adapted the JSON logic from the Mustache backend and create a separate JSON backend.I also added tests to ensure the C++ constructs that Clang-Doc already supported were properly serialized in JSON.I didn’t realize it at the time, but this would end up dramatically accelerating my pace of implementation.I was especially pleased with the timeframe of this feature since I had no plans at all to work on it when submitting my proposal. ## C++ Language Support and Testing After landing the JSON generator in about a week, I returned to my proposed schedule by implementing support for missing C++ constructs.The new JSON generator allowed me to quickly implement and test these features because I didn’t have to worry about HTML formatting or appearance.I could work with the assumption that as long as the information was properly serialized into JSON, it would be able to be displayed well in HTML later. Testing is an area that the JSON backend brought a lot of clarity to.Clang-Doc didn’t have a format where all the information we wanted to document was validated.At one time, YAML was meant to be that format, but it suffered from feature disparity since it wasn’t relevant when something needed to be displayed in HTML.If we used HTML instead, there was a lot of other data (tags, indentation, classes, IDs) that would need to be validated alongside the construct.Testing the documentation and testing the displayed content are two different tasks.I ended up adding 14 different test files over the course of the summer to ensure test coverage. ### Pull Requests * add tags to Mustache namespace template * add a JSON generator * add namespaces to JSON generator * removed default label on some switches * precommit and add support for concepts * precommit and document global variables * serialize isBuiltIn and IsTemplate * precommit and serialize friends # Comments ## Groups and Order Comments weren’t ordered in Clang-Doc’s HTML documentation.They were just displayed in whatever order they were serialized in, which is the order that they’re written in source.This meant comments would be extremely difficult to read - you don’t want to search for another parameter comment after reading the first one, even if they’re expected to be written in order in source. Funnily enough, Mustache made this a little more complicated.The only logic operation that Mustache has to check if a field exists is an iteration like `{{#Fields}}`, but any header that denotes a comment section would be duplicated. {{#Fields}}<h3>Field Header</h3> {{FieldInfo}}{{/Fields}} the `<h3>` header would be duplicated for every iteration over `Fields`.If the header was outside of the iteration, then it would be displayed even if there weren’t any elements in `Fields`.All of the logic to order them needs to be done in the serialization to JSON itself, so I had overhaul our comment organization. Previously, Clang-Doc’s comments were organized exactly as in Clang’s AST like the following: * `FullComment` * `BriefComment` * `ParagraphComment` * `TextComment` * `TextComment` * `BriefComment` * `ParagraphComment` Everything was unnecessarily nested under a `FullComment`, and `TextComment`s were also unnecessarily nested.Every non-verbatim comment’s text was held in one `ParagraphComment`.Since there was only one, we could reduce some boilerplate by directly mapping to the array of `TextComment`s. After the change, Clang-Doc’s comments were structured like this: * `BriefComments` * `TextCommentArray` * `TextCommentArray` * `ParagraphComments` * `TextCommentArray` Now, we can just iterate over every type of comment, which means iterating over every array.This left our JSON documentation with a few more fields, since one is needed for every Doxygen command, but with easier identification of what comments exist in the documentation.After this refactor was landed, I implemented support for the comments we had already supported and ones we didn’t, like Doxygen code comments. ## Reaping the benefits of JSON This was an area where the JSON backend really accelerated my progress.Without it, I would’ve written the same JSON logic but written tests for HTML output.This meant that I would’ve had to: 1. Add the appropriate templating language to allow the comments to render. 2. Add the correct HTML tags to allow the test to pass. As I mentioned, comments weren’t being generated the best in HTML anyways, so I could’ve run into more annoyances if I had to follow that workflow.Instead, I could just write some really simple JSON. ### Pull Requests Here are the pull requests I made during this phase of the project: * add namespace references to VarInfo * fix ASan complaints from passing RepositoryURL as reference * integrate JSON as the source for Mustache templates * separate comments into categories * enable comments in class templates * remove nesting of text comments inside paragraphs * generate comments for functions * add param comments to comment template * add return comments to comment template * add code comments to comment template # Markdown Markdown was the most speculative aspect of the project.It wasn’t clear whether we’d try to integrate a solution into Clang itself or whether we’d keep it in clang-tools-extra. ## A JavaScript Solution The first option I explored was suggested by my mentor, which was a JavaScript library called Markdown-Tag.This would’ve been really convenient since all it requires is an HTML tag to enable rendering, so any comment text in a template can be easily rendered.Unfortunately, it requires all HTML to be sanitized, which defeats the purpose of a ready-made solution for us.We would have to parse any potential HTML in comments anyways. ## A Parser Solution Without an out-of-the-box solution, we were left with implementing our own parser.When I considered this in my proposal, I knew an in-tree parser would want to conform to the simplest possible standard.Markdown has no official standard, so I opted for CommonMark conformance. The summer ended without a complete solution since a couple weeks were spent researching whether or not this could be integrated directly in the Clang comment parser or whether we’d need to build our own solution or not.You can see my initial draft here.I plan to continue working on this parser and landing it in Clang-Doc. # Refactors, Name Mangling, and More! During my summer, I would stumble into places where I would think “This could be better” and my mentors usually agreed.Thus, there were a few patches where I dedicated time to general refactors to improve code reuse and hopefully make the lives of future contributors much easier than what I had to go through.In fact, one of my best refactors was of the JSON generator that I wrote, which my mentor noted had a lot of areas for great code reuse.Future me was extremely thankful for the easy-to-use functions I had added. ## Bitcode Refactor The bitcode read/write operations contain highly repetitive code.Adding something to documentation, like serializing `const` for a function, required several copy-pastes in several locations.It was structured like so: case BI_MEMBER_TYPE_BLOCK_ID: { MemberTypeInfo TI; if (auto Err = readBlock(ID, &TI)) return Err; if (auto Err = addTypeInfo(I, std::move(TI))) return Err; return llvm::Error::success();} `addTypeInfo` is specific for `MemberTypeInfo`, so every other type of `Info` would need to call its own function.Hence, highly repetitive similar code.I refactored that block to this: return handleTypeSubBlock<TypeInfo>(ID, I, CreateAddFunc(addTypeInfo<T, TypeInfo>)); `handleTypeSubBlock` contains the same logic as the previous block, but it calls a generic `Function`.All of this was achieved without compromising the performance of documentation generation. ## Mangling Filenames Clang-Doc had a bug stemming from non-unique filenames.The YAML backend avoided this problem because its filenames were SymbolIDs, but this meant that the lit tests would have to use regex to find the file for FileCheck.Nasty. In HTML and JSON, the filenames for classes were just the class name.If you had a template specialization, this would cause problemsIn HTML, we’d get duplicate HTML resulting in wonky web pages.In JSON, we’d get a fatal error from the JSON parser since there were two sets of top level braces.I used `ItaniumMangleContext` to generate mangled names we could use for the filenames. ## Pull Requests Here are the pull requests I made for refactors during the project: * Serialize record files with mangled name * refactor BitcodeReader::readSubBlock * refactor JSONGenerator array usage * refactor JSON for better Mustache compatibility # Overview * I implemented a new JSON generator that will serve as the basis for Clang-Doc’s documentation generation. This will vastly reduce overall lines of code and maintenance burdens. * I added a lot of tests to increase code coverage and ensure we are serializing all the information necessary for high-quality documentation. * I refactored our comment handling to streamline the logic that handles them and for better output in the HTML. * I also explored options for rendering Markdown and began an implementation for a parser that I plan on working on in the future. * Along the way, I also did some refactoring to improve code reuse and improved maintenance burdens by reducing boilerplate code. After my work this summer, Clang-Doc is nearly ready to switch to HTML generation via Mustache templates, which will be a huge milestone.It is backed by the JSON generator which will allow for a much more flexible architecture that will change how we generate other documentation formats like our existing Markdown backend.All of this was achieved without compromising the performance of documentation generation.I also hope that future contributors have an easier time than I did learning about and working with Clang-Doc.The threshold for contributing was high due to a disjointed architecture. I’m also very excited to present my work and showcase Clang-Doc at this year’s LLVM Dev Meeting during a technical talk.This is the first time I’ll be presenting at a conference and I didn’t expect to have the opportunity when I started at the beginning of the summer. Over the summer, I addressed these issues: * template operator T() produces a bad name * Add a JSON backend to clang-doc to better leverage mustache templates * Reconsider use of enum InfoType::IT_default * Add a JSON backend to clang-doc to better leverage mustache templates # Future Work These are issues that I identified over the summer that I wasn’t able to address but would benefit from community discussion and contribution. ## Doxygen Grouping Doxygen has a very useful grouping feature that allows structures to be grouped under a custom heading or on separate pages.You can see it in llvm::sys::path.We opened up an issue for Clang to track this issue, which ended up being a duplicate of this issue. There would most likely have to be some major changes to Clang’s comment parsing and Clang’s own parsing.That’s because a lot of the group opening tokens in Clang are free-floating, like so: /// @{class Foo {}; That `@{` doesn’t attach to a Decl; only comments directly above a declaration are attached to a Decl in the AST.My mentors wisely advised that this would be too much to even consider this summer, and could probably be its own GSoC project. ## Cross-referencing In Doxygen you can use the `@copydoc` command to copy the documentation from one entity to another.Doxygen also displays where an entity is referenced, like where a function is invoked.Clang-Doc currently has no support for this kind of behavior. Clang-Doc would need a preprocessing step where any reference to another entity is identified and then resolved somewhere.One of my mentors pointed out that it would be great to do during the reduction step where every `Info` is being visited anyways.This actually wasn’t something I had even considered in my proposal besides identifying that `@copydoc` wasn’t supported by the comment parser.It’s a common feature of modern documentation, so hopefully someday soon Clang-Doc can acquire it. # Acknowledgements Thank you very much to my mentors Paul Kirth and Petr Hosek for guiding me and advising me in this project.I learned so much from review feedback and our conversations.I would not be presenting at the Dev Meeting if not for the encouragement you both gave.
blog.llvm.org
September 29, 2025 at 11:09 PM
GSoC 2025 - Byte Type: Supporting Raw Data Copies in the LLVM IR
This summer I participated in GSoC under the LLVM Compiler Infrastructure.The goal of the project was to add a new byte type to the LLVM IR, capable of representing raw memory values.This new addition enables the native implementation of memory-related intrinsics in the IR, including `memcpy`, `memmove` and `memcmp`, fixes existing unsound transformations and enables new optimizations, all with a minimal performance impact. # Background One of LLVM’s longstanding problems is the absence of a type capable of representing raw memory values.Currently, memory loads of raw bytes are performed through an appropriately sized integer type.However, integers are incapable of representing an arbitrary memory value.Firstly, they do not retain pointer provenance information, rendering them unable to fully specify the value of a pointer.Secondly, loading memory values containing `poison` bits through an integer type taints the loaded value, as integer values are either `poison` or have a fully-defined value, with no way to represent individual `poison` bits. Source languages such as C1 and C++2 provide proper types to inspect and manipulate raw memory.These include `char`, `signed char` and `unsigned char`.C++17 introduced the `std::byte` type, which offers similar raw memory access capabilities, but does not support arithmetic operations.Currently, Clang lowers these types to the `i8` integer type, which does not accurately model their raw memory access semantics, motivating miscompilations such as the one reported in bug report 37469. The absence of a similar type in the LLVM IR hinders the implementation of memory-related intrinsics such as `memcpy`, `memmove` and `memcmp`, and introduces additional friction when loading and converting memory values to other types, leading to implicit conversions that are hard to identify and reason about.The two core problems stemming from the absence of a proper type to access and manipulate raw memory, directly addressed by the byte type and explored throughout the remainder of this section, are summarized as follows: 1. Integers do not track provenance, rendering them incapable of representing a pointer. 2. Loads through integer types spread `poison` values, which taints the load result if the loaded values contain at least one `poison` bit (as occurs with padded values). ## Pointer Provenance According to the LLVM Language Reference, pointers track provenance, which is _the ability to perform memory accesses through the pointer, in the sense of the pointer aliasing rules_.The main goal of tracking pointer provenance is to simplify alias analysis, yielding more precise results, which enables high-level optimizations. Integers, unlike pointers, do not capture provenance information, being solely characterized by their numerical value.Therefore, loading a pointer through an integer type discards the pointer’s provenance.This is problematic as such loads can cause pointer escapes that go unnoticed by alias analysis.Once alias analysis is compromised, simple optimizations that rely on the absence of aliasing become invalid, compromising the correctness of the whole compilation process. Currently, Alive2 defines the result of loading a pointer value through an integer type as `poison`.This implies that loads through integer types fail to accurately recreate the original memory value, hindering pointer copies via integer types.In the following example, storing a pointer to memory and loading it through the `i64` type yields `poison`, invalidating the transformation. define ptr @src(ptr %ptr, ptr %v) { store ptr %v, ptr %ptr %l = load ptr, ptr %ptr ret ptr %l}define ptr @tgt(ptr %ptr, ptr %v) { store ptr %v, ptr %ptr %l = load i64, ptr %ptr ; poison %c = inttoptr i64 %l to ptr ; poison ret ptr %c ; poison} ## Undefined Behavior LLVM’s `poison` value is used to represent unspecified values, such as padding bits.Loading such memory values through an integer type propagates `poison` values, as integer types are either `poison` or have a fully-defined value, not providing enough granularity to represent individual `poison` bits.This hinders the copy of padded values. Moreover, this lack of granularity can lead to subtle issues that are often overlooked.The LLVM Language Reference defines the `bitcast` instruction as a _no-op cast because no bits change with this conversion_.Nonetheless, while scalar types are either `poison` or have a fully-defined value, vector types in LLVM track `poison` values on a per-lane basis.This introduces potential pitfalls when casting vector types to non-vector types, as the cast operation can inadvertently taint non-`poison` lanes.In the following example, considering the first lane of `%v` to be `poison`, the result of casting the vector to an `i64` value is `poison`, regardless of the value of the second lane. define i64 @ub(ptr %ptr) { %v = load <2 x i32>, ptr %ptr ; <i32 poison, i32 42> %c = bitcast <2 x i32> %v to i64 ; i64 poison ret i64 %c} Although covered by the Language Reference ("_the [bitcast] conversion is done as if the value had been stored to memory and read back as [the destination type]_ "), this duality in the value representation between vector and scalar types integer constitutes a corner case that is not widely contemplated and often unnecessarily introduces undefined behavior. # Implementing the Byte Type Back in 2021, a GSoC project with a similar goal, produced a working prototype of the byte type.This prototype introduced the byte type to the IR, lowered C and C++’s raw memory access types to the byte type and implemented some optimizations over the new type. The current project began by porting these patches to the latest version of LLVM, adapting the code to support the newly introduced opaque pointers.As the work progressed and new challenges emerged, the original proposal was iteratively refined.The implementation of the byte type in LLVM and Alive2 can be found here and here, respectively. ## Byte Type The byte type is a first-class single-value type, with the same size and alignment as the equivalently sized integer type.Memory loads through the byte type yield the value’s raw representation, without introducing any implicit casts.This allows the byte type to represent both pointer and non-pointer values. Additionally, the byte type is equipped with the necessary granularity to represent `poison` values at the bit-level, such that loads of padded values through the byte type do not taint the loaded value.As a consequence, a `bitcast` between vector and scalar byte types preserves the raw byte value.In the following example, a `poison` lane does not taint the cast result, unlike with equivalently sized integer types. define b64 @f(ptr %ptr) { %v = load <2 x b32>, ptr %ptr %c = bitcast <2 x b32> %v to b64 ret b64 %c} These two properties of the byte type directly addressed the aforementioned problems, enabling the implementation of a user-defined `memcpy` in the IR, as shown in the following example.In a similar manner, a native implementation of `memmove` can be achieved. define ptr @my_memcpy(ptr %dst, ptr %src, i64 %n) {entry: br label %for.condfor.cond: %i = phi i64 [ 0, %entry ], [ %inc, %for.body ] %cmp = icmp ult i64 %i, %n br i1 %cmp, label %for.body, label %for.endfor.body: %arrayidx = getelementptr inbounds b8, ptr %src, i64 %i %byte = load b8, ptr %arrayidx %arrayidx1 = getelementptr inbounds b8, ptr %dst, i64 %i store b8 %byte, ptr %arrayidx1 %inc = add i64 %i, 1 br label %for.condfor.end: ret ptr %dst} The newly implemented type also fixes existing optimizations.Previously, InstCombine lowered small calls to `memcpy` and `memmove` into integer load/store pairs.Due to the aforementioned reasons, this lowering is unsound.By using byte load/store pairs instead, the transformation, as shown in the following example, is now valid. define void @my_memcpy(ptr %dst, ptr %src) { call void @llvm.memcpy(ptr %dst, ptr %src, i64 8) ret void}define void @my_memmove(ptr %dst, ptr %src) { call void @llvm.memmove(ptr %dst, ptr %src, i64 8) ret void}define void @my_memcpy(ptr %dst, ptr %src) { %l = load b64, ptr %src store b64 %l, ptr %dst ret void}define void @my_memmove(ptr %d, ptr %s) { %l = load b64, ptr %s store b64 %l, ptr %d ret void} SROA performs a similar transformation, lowering `memcpy` calls to integer load/store pairs.Similarly, this optimization pass was changed to use byte load/store pairs, as depicted in the following example. define void @src(ptr %a, ptr %b) { %mem = alloca i8 call void @llvm.memcpy(ptr %mem, ptr %a, i32 1) call void @llvm.memcpy(ptr %a, ptr %mem, i32 1) ret void}define void @tgt(ptr %a, ptr %b) { %mem.copyload = load b8, ptr %a store b8 %mem.copyload, ptr %a ret void} ## Bytecast Instruction Byte values can be reinterpreted as values of other primitive types.This is achieved through the `bytecast` instruction.This cast instruction comes in two flavors, either allowing or disallowing type punning.Considering that a byte might hold a pointer or a non-pointer value, the `bytecast` follows the following semantics: * A vanilla `bytecast`, distinguished by the absence of the `exact` flag, is used to cast a byte to any other primitive type, allowing type punning. More precisely, * If the type of the value held by the byte matches the destination type of the cast, it is a no-op. * Otherwise, the cast operand undergoes a conversion to the destination type, converting pointers to non-pointer values and vice-versa, respectively wrapping a `ptrtoint` or `inttoptr` cast. * A `bytecast` with the `exact` flag succeeds if both the type of the value held by the byte and the destination type are either both pointer or non-pointer types. More specifically, * If the type of the value held by the byte matches the destination type of the cast, it is a no-op. * Otherwise, the result is `poison`, preventing type punning between pointer and non-pointer values. The `exact` version of the `bytecast` mimics the reinterpretation of a value, as if it had been stored in memory and loaded back through the cast destination type.This is aligned with the semantics adopted by the `bitcast` instruction, which “ _is done as if the value had been stored to memory and read back as [the destination type]_ ”", enabling store-to-load forwarding optimizations, such as the one depicted in the next example. define i8 @src(b8 %x) { %a = alloca b8 store b8 %x, ptr %a %v = load i8, ptr %a ret i8 %v}define i8 @tgt(b8 %x) { %cast = bytecast exact b8 %x to i8 ret i8 %cast} ## Memcmp Lowering The standard version of the `bytecast` enables the implementation of `memcmp` in the IR.Currently, calls to `memcmp` of small sizes are lowered to integer loads, followed by a subtraction, comparing the two loaded values.Due to the aforementioned problems, this lowering is unsound.Loading the two memory values as bytes is insufficient as comparisons between bytes are undefined, as to avoid overloading the IR by supporting comparisons between pointers and provenance-unaware values.To that end, the version of the `bytecast` which performs type punning is used, forcefully converting possible pointer values into their integer representation.The two values, then converted to integers, can be compared as before.The following example depicts the previous and new lowerings of a `memcmp` of 1 byte. define i32 @before(ptr %p, ptr %q) { %lhsc = load i8, ptr %p %lhsv = zext i8 %lhsc to i32 %rhsc = load i8, ptr %q %rhsv = zext i8 %rhsc to i32 %chardiff = sub i32 %lhsv, %rhsv ret i32 %chardiff}define i32 @after(ptr %p, ptr %q) { %lhsb = load b8, ptr %p %lhsc = bytecast b8 %lhsb to i8 %lhsv = zext i8 %lhsc to i32 %rhsb = load b8, ptr %q %rhsc = bytecast b8 %rhsb to i8 %rhsv = zext i8 %rhsc to i32 %chardiff = sub i32 %lhsv, %rhsv ret i32 %chardiff} ## Load Widening A common optimization performed by LLVM is to widen memory loads when lowering calls to `memcmp`.The previously proposed lowering falls short in the presence of such optimizations.Whilst using a larger byte type to load the memory value preserves its raw value, the `bytecast` to an integer type yields `poison` if any of the loaded bits are `poison`.This is problematic as the remaining bits added by the widened load could assume any value or even be uninitialized.As such, when performing load widening, the following lowering, depicted in the next example, is performed.The `!uninit_is_nondet`, proposed in the RFC proposing uninitialized memory loads to return `poison`, converts any `poison` bits to a non-deterministic value, preventing the `bytecast` to an integer type from yielding `poison`. define i32 @src(ptr %x, ptr %y) { %call = tail call i32 @memcmp( ptr %x, ptr %y, i64 2) ret i32 %call}define i32 @tgt(ptr %x, ptr %y) { %1 = load b16, ptr %x, !uninit_is_nondet %2 = load b16, ptr %y, !uninit_is_nondet %3 = bytecast b16 %1 to i16 %4 = bytecast b16 %2 to i16 %5 = call i16 @llvm.bswap.i16(i16 %3) %6 = call i16 @llvm.bswap.i16(i16 %4) %7 = zext i16 %5 to i32 %8 = zext i16 %6 to i32 %9 = sub i32 %7, %8 ret i32 %9} ## Casts, Bitwise and Arithmetic Operations Values of other primitive types can be cast to the byte type using the `bitcast` instruction, as shown in the following example. %1 = bitcast i8 %val to b8%2 = bitcast i64 %val to b64%3 = bitcast ptr to b64 ; assuming pointers to be 64 bits wide%4 = bitcast <8 x i8> to <8 x b8> Furthermore, bytes can also be truncated, enabling store-to-load forwarding optimizations, such as the one presented in the next example.Performing an exact `bytecast` to `i32`, followed by a `trunc` to `i8` and a `bitcast` to `b8` would be unsound, as if any of the unobserved bits of the byte value were `poison`, the `bytecast` would yield `poison`, invalidating the transformation. define b8 @src(b32 %x) { %a = alloca b32 store b32 %x, ptr %a %v = load b8, ptr %a ret b8 %v}define b8 @tgt(b32 %x) { %trunc = trunc b32 %x to b8 ret b8 %trunc} Due to the cumbersome semantics of performing arithmetic on provenance-aware values, arithmetic operations on the byte type are disallowed.Bitwise binary operations are also disallowed, with the exception of logical shift right.This instruction enables store-to-load forwarding optimization with offsets, such as the one performed in the following example.To rule out sub-byte accesses, its use is restricted to shift amounts that are multiples of 8. define i8 @src(b32 %x) { %a = alloca b32 %gep = getelementptr i8, ptr %a, i64 2 store b32 %x, ptr %a %v = load i8, ptr %gep ret i8 %v}define i8 @tgt(b32 %x) { %shift = lshr b32 %x, 16 %trunc = trunc b32 %shift to b8 %cast = bytecast exact b8 to i8 ret i8 %cast} ## Value Coercion Optimizations Some optimization passes perform transformations that are unsound under the premise that type punning is disallowed.Such an optimization pass is GVN, which performs value coercion in order to eliminate redundant loads.Currently, a class of optimization where a pointer load is coerced to a non-pointer value or a non-pointer load is coerced to a pointer value is reported as unsound by Alive2. The following example illustrates one such optimization, in which GVN replaces the pointer load at `%v3` by a phi node, merging the pointer load at `%v2` with the coerced value at `%1`, resulting from an `inttoptr` cast.If the value stored in memory is a pointer, the source function returns the pointer value, while, in the target function, the load at `%v1` returns `poison`. declare void @use(...) readonlydefine ptr @src(ptr %p, i1 %cond) { br i1 %cond, label %bb1, label %bb2bb1: %v1 = load i64, ptr %p call void @use(i64 %v1) %1 = inttoptr i64 %v1 to ptr br label %mergebb2: %v2 = load ptr, ptr %p call void @use(ptr %v2) br label %mergemerge: %v3 = load ptr, ptr %p ret ptr %v3}define ptr @tgt(ptr %p, i1 %cond) { br i1 %cond, label %bb1, label %bb2bb1: %v1 = load i64, ptr %p call void @use(i64 %v1) %1 = inttoptr i64 %v1 to ptr br label %mergebb2: %v2 = load ptr, ptr %p call void @use(ptr %v2) br label %mergemerge: %v3 = phi ptr [ %v2, %bb2 ], [ %1, %bb1 ] ret ptr %v3} The byte type can be leveraged to avoid the implicit type punning that hinders this kind of optimizations, as depicted in the following example.Since the byte type can represent both pointer and non-pointer values, the loads at `%v1` and `%v2` can instead be performed using the byte type.The `bytecast` instruction is then used to convert the byte into the desired type.As the load through the byte type accurately models the loaded value, avoiding implicit casts, the `bytecast`, yields the pointer stored in memory.This value can then be used to replace the load at `%v3`. declare void @use(...) readonlydefine ptr @src(ptr %p, i1 %cond) { br i1 %cond, label %bb1, label %bb2bb1: %v1 = load i64, ptr %p call void @use(i64 %v1) %1 = inttoptr i64 %v1 to ptr br label %mergebb2: %v2 = load ptr, ptr %p call void @use(ptr %v2) br label %mergemerge: %v3 = load ptr, ptr %p ret ptr %v3}define ptr @tgt(ptr %p, i1 %cond) { %load = load b64, ptr %p br i1 %cond, label %bb1, label %bb2bb1: %v1 = bytecast exact b64 %load to i64 call void @use(i64 %v1) %1 = bytecast exact b64 %load to ptr br label %mergebb2: %v2 = bytecast exact b64 %load to ptr call void @use(ptr %v2) br label %mergemerge: %v3 = phi ptr [ %v2, %bb2 ], [ %1, %bb1 ] ret ptr %v3} ## Other Optimizations Additional optimizations were also implemented.While these do not affect program correctness, they do contribute to performance improvements.Some of them include cast pair eliminations and combining of load and `bytecast` pairs with a single use, depicted in the following examples. define b32 @src_float(b32 %b) { %1 = bytecast exact b32 %b to float %2 = bitcast float %1 to b32 ret b32 %2}define i8 @src_int(i8 %i) { %b = bitcast i8 %i to b8 %c = bytecast exact b8 %1 to i8 ret i8 %c}define b32 @tgt_float(b32 %b) { ret b32 %b}define i8 @tgt_int(i8 %i) { ret i8 %i} define i8 @src(ptr %p) { %b = load b8, ptr %p %c = bytecast exact b8 %b to i8 ret i8 %c}define i8 @tgt(ptr %p) { %i = load i8, ptr %p ret i8 %i} ## Clang Given the raw memory access capabilities of the byte type, Clang was altered to lower C and C++’s raw memory access types were lowered to the byte type.These include `char`, `signed char`, `unsigned char` and `std::byte`.The new lowerings are depicted in the next example. void foo( unsigned char arg1, char arg2, signed char arg3, std::byte arg4); void @foo( b8 zeroext %arg1, b8 signext %arg2, b8 signext %arg3, b8 zeroext %arg4); Additionally, code generation was updated to insert missing `bytecast` instructions where integer values were previously expected, such as in arithmetic and comparison operations involving character types.The next example depicts an example function in C, adding two `char` values, and the corresponding lowering to LLVM IR as performed by Clang. char sum(char a, char b) { return a + b;} define b8 @sum(b8 %a, b8 %b) { %conv = bytecast exact b8 %a to i8 %conv1 = sext i8 %conv to i32 %conv2 = bytecast exact b8 %b to i8 %conv3 = sext i8 %conv2 to i32 %add = add nsw i32 %conv1, %conv3 %conv4 = trunc i32 %add to i8 %res = bitcast i8 %conv4 to b8 ret b8 %res} ## Summary In summary, the byte type contributes with the following changes/additions to the IR: * **Raw memory representation:** Optimization passes can use the byte type to represent raw memory values, avoiding the introduction of implicit casts and treating both pointer and non-pointer values uniformly. * **Bit-level`poison` representation:** The byte type provides the necessary granularity to represent individual `poison` bits, providing greater flexibility than integer types, which either have a fully-defined value or are tainted by `poison` bits. * **`bitcast` instruction:** This instruction allows conversions from other primitive types to equivalently-sized byte types. Casts between vector and scalar byte types do not taint the cast result in the presence of `poison` lanes, as occurs with integer types. * **`bytecast` instruction:** This instruction enables the conversion of byte values to other primitive types. The standard version of the cast performs type punning, reinterpreting pointers as integers and vice-versa. The `exact` flag disallows type punning by returning `poison` if the type of the value held by the byte does not match the cast destination type. * **`trunc` and `lshr` instructions:** The `trunc` and `lshr` instructions accept byte operands, behaving similarly to their integer counterparts. The latter only accepts shift amounts that are multiples of 8, ruling out sub-byte accesses. # Results ## Benchmarks The implementation was evaluated using the Phoronix Test Suite automated benchmarking tool, from which a set of 20 C/C++ applications, listed below, were selected. **Benchmark**| **Version**| **LoC**| **Description** ---|---|---|--- aircrack-ng| 1.7| 66,988| Tool suite to test WiFi/WLAN network security botan| 2.17.3| 147,832| C++ library for cryptographic operations compress-7zip| 24.05| 247,211| File archiving tool based on the 7-Zip format compress-pbzip2| 1.1.13| 10,187| Parallel implementation of bzip2 compress-zstd| 1.5.4| 90,489| Lossless compression tool using Zstandard draco| 1.5.6| 50,007| 3D mesh and point cloud compressing library espeak| 1.51| 45,192| Compact open-source speech synthesizer ffmpeg| 7.0| 1,291,957| Audio and video processing framework fftw| 3.3.10| 264,128| Library for computing FFTs graphics-magick| 1.3.43| 267,450| Toolkit for image editing and conversion luajit| 2.1-git| 68,833| JIT-compiler of the Lua programming language ngspice| 34| 527,637| Open-source circuit simulator openssl| 3.3| 597,713| Implementation of SSL/TLS redis| 7.0.4| 178,014| In-memory data store rnnoise| 0.2| 146,693| Neural network for audio noise reduction scimark2| 2.0| 800| Scientific computing suite written in ANSI C sqlite-speedtest| 3.30| 250,607| Program for executing SQLite database tests stockfish| 17| 11,054| Advanced open-source chess engine tjbench| 2.1.0| 57,438| JPEG encoding and decoding tool z3| 4.14.1| 512,002| SMT solver and theorem prover All programs were compiled with the `-O3` pipeline on an AMD EPYC 9554P 64-Core CPU.In order to minimize result variance, turbo boost, hyperthreading, and ASLR were disabled, the performance governor was used, and core pinning was applied.The plots, depicted below, display the compile time, object size, peak memory usage (maximum redisent set size) and run-time performance differences between the implementation and upstream LLVM.The results reveal that the addition of the byte type had a minimal impact on all of the addressed performance metrics.Each result is averaged over three runs. The run-time results represent the average regression percentage across all tests of each benchmark. The following plots show per-function assembly size distributions and differences, indicating that the addition of the `byte` type results in minor changes to the generated code, with the largest observed shift being approximately 5%.Each subplot includes the net byte size change and the percentage of functions with differing assembly, disregarding non-semantic differences such as varying jump and call target addresses. ## Alive2 ### LLVM Test Suite The byte type was implemented in Alive2, enabling the verification of both the reworked and newly added optimizations.Accessing both the correctness of the implementation and the broader impact of introducing the byte type into the IR, Alive2 was run over the LLVM test suite.Several previously unsound optimizations, which were addressed by the byte type, were identified in the tests listed below. **Test**| **Reason** ---|--- ExpandMemCmp/AArch64/memcmp.ll| `memcmp` to integer load/store pairs ExpandMemCmp/X86/bcmp.ll| `bcmp` to integer load/store pairs ExpandMemCmp/X86/memcmp-x32.ll| `memcmp` to integer load/store pairs ExpandMemCmp/X86/memcmp.ll| `memcmp` to integer load/store pairs GVN/metadata.ll| Unsound pointer coercions GVN/pr24397.ll| Unsound pointer coercions InstCombine/bcmp-1.ll| `bcmp` to integer load/store pairs InstCombine/memcmp-1.ll| `memcmp` to integer load/store pairs InstCombine/memcpy-to-load.ll| `memcpy` to integer load/store pairs PhaseOrdering/swap-promotion.ll| `memcpy` to integer load/store pairs SROA/alignment.ll| `memcpy` to integer load/store pairs It is worth noting that some additional tests containing unsound optimizations were addressed.However, Alive2 did not report them as unsound, due to the presence of unsupported features, such as multiple address spaces.Moreover, the `ExpandMemCmp` tests continue to be flagged as unsound by Alive2.This is because the required `!uninit_is_nondet` metadata has not yet been upstreamed and therefore remains absent in `memcmp` load widenings optimizations. ### Single File Programs The `alivecc` tool was used to verify the compilation of two single-file C programs, both compiled with the `-O2` optimization level.The results are presented below. * `bzip2`: No differences were detected during verification. * `sqlite3`: Two optimizations previously flagged as unsound by Alive2 were fixed. These occurred in the `sqlite3WhereOkOnePass` and `dup8bytes` functions. The reduced IR reveals that these were caused by lowerings of `memcpy` to integer load/store pairs. # Future Work After modifying Clang to lower the `char`, `unsigned char` and `signed char` types to the byte type, approximately 1800 Clang regression tests began failing.Over the course of the project, the number of failing tests was gradually reduced and, currently, around 100 regression tests are still failing.LLVM is a fast-moving codebase, and due to the sheer number of Clang tests affected by the introduction of the byte type, maintaining a clean test suite constitutes a continuous effort. The benchmarks were run on an x86-64 system.However, LLVM also supports other popular architectures such as AArch64 and RISC-V, which may require additional performance evaluation. Furthermore, the patches do not include any additions to the Language Reference. # Conclusion The addition of the byte type to the IR solves one of the long lasting problems in LLVM, with a minimal performance impact.Optimization passes can now safely represent and manipulate raw memory values, fixing existing optimizations, and setting up a solid foundation for new, previously inexpressible optimizations. Participating in GSoC was both a great honor and a tremendous learning opportunity.Over the course of this project, I’ve learned a lot about compilers, optimizations and LLVM.It was also a valuable opportunity to get in touch with the LLVM community and contribute through the following pull requests: * [InstCombine] Fold `(x == A) || (x & -Pow2) == A + 1` into range check * [ADT] Add signed and unsigned mulExtended to APInt * [Headers][X86] Allow pmuludq/pmuldq to be used in constexpr * [LangRef] Fix `ptrtoaddr` code block * [clang][x86] Add C/C++ and 32/64-bit test coverage to constexpr tests * [Headers][X86] Allow AVX512 reduction intrinsics to be used in constexpr * [InstCombine] Support offsets in `memset` to load forwarding * [ConstantFolding] Merge constant gep `inrange` attributes * [InstCombine] Propagate neg `nsw` when folding `abs(-x)` to `abs(x)` * [LV] Peek through bitcasts when performing CSE I would like to thank my mentor, Nuno Lopes, for his guidance and support.Not only did his experience and expertise help me get through some of the most challenging parts of the project, but his presence also made the whole process genuinely enjoyable.I also believe few people in the world could guide me so well through the Alive2 codebase! I would also like to thank George Mitenkov, who laid the groundwork by developing the original prototype introducing the byte type.Not only did he accomplish quite a lot in a single summer, but he also wrote a phenomenal write-up, which greatly contributed to my understanding of the problem. * * * 1. _Values stored in non-bit-field objects of any other object type consist of n x`CHAR_BIT` bits, where `n` is the size of an object of that type, in bytes. The value may be copied into an object of type `unsigned char [n]` (e.g., by `memcpy`); the resulting set of bytes is called the object representation of the value._ (C99 ISO Standard, 6.2.6.1.4) ↩︎ 2. _The underlying bytes making up the object can be copied into an array of`char`, `unsigned char`, or `std::byte`. If the content of that array is copied back into the object, the object shall subsequently hold its original value._ (C++20 ISO Standard, 6.9.2) ↩︎
blog.llvm.org
September 15, 2025 at 10:51 PM
GSoC 2025: Profiling and Testing Math Functions on GPUs
With the increasing importance of GPU computing, having a robust and familiar C standard library becomes a valuable asset for developers. The LLVM project is actively working to provide this foundation, as a solid libc implementation enables more complex libraries to be developed for and used on GPUs. A key part of this effort is providing the C standard math library (LLVM-libm) on GPUs, often reusing the same target-agnostic implementations developed for CPU targets. This context creates a twofold challenge. First, there is a need to systematically **verify the conformance** of these implementations to standards like OpenCL. Second, it is crucial to **benchmark their performance** against the highly-optimized vendor libraries to understand the trade-offs involved. This Google Summer of Code 2025 project was designed to address both challenges by developing a framework for conformance testing as well as refining and expanding the existing benchmarking infrastructure. The work provides two benefits to the LLVM community: * It **empowers libc contributors** with a robust tool to validate their GPU math function implementations. * It **builds trust with end-users** by providing transparent accuracy and performance data. This report details the work completed on this project. ## Conformance Testing To address the goal of accuracy verification, I implemented a C++ framework for conformance testing within the `offload/unittests/Conformance` directory. This framework was designed to be extensible, easy to use, and capable of testing various implementation providers (`llvm-libm`, `cuda-math`, `hip-math`) across different hardware platforms (AMD, NVIDIA). ### Key Components The framework’s power and simplicity come from a few key components that work together: * **`DeviceContext`** : A lightweight wrapper around the new Offload API that abstracts away the low-level details of device discovery, resource management, and kernel launching. * **`InputGenerator`** : An extensible interface for test input generation. The framework provides two concrete implementations: * **`ExhaustiveGenerator`** : Used for functions with small input spaces (e.g., half-precision functions and single-precision univariate functions), this generator iterates over every representable point in a given space, ensuring complete coverage. * **`RandomGenerator`** : Used for functions with large input spaces (e.g., single-precision bivariate and double-precision functions), this generator produces a massive, deterministic stream of random points to sample the space thoroughly. * **`GpuMathTest`** : The main test harness class that orchestrates the entire process. It manages loading the correct GPU binary, setting up device buffers, invoking the generator, launching the kernel, and triggering the verification process. * **`HostRefChecker`** : After the GPU computation is complete, this component calculates the expected result for each input on the host CPU (using the correctly rounded LLVM-libm’s implementations) and computes the ULP (Units in the Last Place) distance to the actual result from the GPU. This architecture makes writing a new, complete test incredibly simple and concise. For example, a full exhaustive test for the `expf` function requires only a few lines of code: #include "mathtest/TestRunner.hpp"// ... other includes// 1. Configure the test for the `expf` function.namespace mathtest {template <> struct FunctionConfig<expf> { static constexpr llvm::StringRef Name = "expf"; static constexpr llvm::StringRef KernelName = "expfKernel"; // ULP tolerance sourced from the OpenCL C Specification static constexpr uint64_t UlpTolerance = 3;};} // namespace mathtest// 2. Define the main function to run the testint main(int argc, const char **argv) { llvm::cl::ParseCommandLineOptions(argc, argv, "..."); // 3. Define the input space and select the generator mathtest::IndexedRange<float> Range; mathtest::ExhaustiveGenerator<float> Generator(Range); // 4. Run the tests against all configured providers bool Passed = mathtest::runTests<expf>( Generator, mathtest::cl::getTestConfigs(), DEVICE_BINARY_DIR); return Passed ? EXIT_SUCCESS : EXIT_FAILURE;} ### Contributions This part of the project was submitted to the LLVM project through a series of pull requests, which can be grouped into the following categories: * **Framework Creation and Evolution** * #149242: Add framework for math conformance tests on GPUs * #151714: Build device code as C++ * #152362: Add support for CUDA Math and HIP Math providers * #154252: Add RandomGenerator for large input spaces * **Adding Test Coverage** * #152013: Add tests for single-precision math functions * #154663: Add randomized tests for single-precision bivariate math functions * #155003: Add randomized tests for double-precision math functions * #155112: Add exhaustive tests for half-precision math functions * **Enabling Work and Ecosystem Improvements** * Enabling Functions in `libc`: #151841, #152157, #154857, #155060 * Infrastructure and Dependencies: #150083, #150140, #151820 * **Documentation** * #155190: Add README file ### Accuracy Results The primary deliverable of the conformance testing work is a comprehensive set of accuracy data. The framework reports the maximum observed ULP (Units in the Last Place) distance for a wide range of functions across three providers: `llvm-libm`, `cuda-math`, and `hip-math`. The table below presents a sample of these results for a few selected single-precision functions, all tested exhaustively. The tests were run on an AMD gfx1030 and an NVIDIA RTX 4000 SFF Ada Generation GPU, with ULP tolerances based on the OpenCL C specification. #### Exhaustive Test Results for Selected Single-Precision Univariate Math Functions Function | ULP Tolerance | Max ULP Distance ---|---|--- llvm-libm (AMDGPU) | llvm-libm (CUDA) | cuda-math (CUDA) | hip-math (AMDGPU) cosf | 4 | 1 | 1 | 2 | 2 expf | 3 | 0 | 0 | 2 | 1 logf | 3 | 1 | 1 | 1 | 2 sinf | 4 | 1 | 1 | 1 | 2 tanf | 5 | 0 | 0 | 3 | 2 The complete accuracy results for all tested functions (including half, single, and double precision) will be published on the official LLVM-libc GPU Supported Functions page. For users and future contributors interested in running existing or adding new tests, detailed instructions are available in the project’s `README.md` file. ## Performance Profiling Alongside accuracy, performance is a critical metric for a GPU math library. The second major goal of this project was to refine and expand the existing benchmarking framework to enable fair, reproducible, and insightful performance comparisons between LLVM-libc and vendor-optimized libraries. This effort involved overcoming several subtle challenges and significantly refactoring the infrastructure. ### Key Enhancements The path to reliable performance data involved a series of foundational improvements to ensure that the results are fair, statistically sound, and reproducible. Key enhancements included: * **Reproducibility and Fairness** : The initial framework was enhanced with a deterministic, per-thread random number generator (PRNG) to ensure that LLVM-libc and vendor libraries are compared using the exact same input sequences. Additionally, to prevent misleading results caused by compiler optimizations, loop unrolling was explicitly disabled in the throughput measurement loop. This change prevents the compiler from aggressively optimizing the transparent libc code in a way that isn’t possible for vendor libraries, ensuring a true apples-to-apples comparison. * **Statistical Soundness** : The framework’s statistical calculations were improved. The standard deviation is now computed correctly using a sum-of-squares approach, and results from multiple GPU threads are aggregated using a statistically sound pooled mean and variance. The timing logic was also refined to subtract a baseline measurement of the empty benchmark loop, isolating the true cost of the function call. * **Flexible Input Generation and New Benchmarks** : To support a wider range of functions, the framework was refactored with a pluggable input generation system. New distribution classes, `UniformExponent` (for values spanning orders of magnitude) and `UniformLinear` (for linear ranges), were introduced. This new flexibility enabled the addition of a comprehensive suite of benchmarks for the `exp` and `log` families. ### More Contributions The contributions that refined and expanded the benchmarking infrastructure were submitted in the following pull requests: #153512, #153900, #153971, and #155727. ### Performance Results As an example, see below part of the output from the `log` function benchmark on an NVIDIA RTX 4070 Laptop GPU. It highlights interesting performance characteristics. Notice the exceptionally low and nearly constant cycles per call for NVIDIA’s `__nv_logf`. Its IR reveals this is due to a compact `float`-only routine (a fixed sequence of FMAs plus simple bit manipulations) with Flush-To-Zero enabled and no lookup tables or divergent memory accesses. Running Suite: LlvmLibcLogGpuBenchmarkBenchmark | Cycles (Mean) | Stddev | Min | Max | Iterations | Threads |------------------------------------------------------------------------------------------------------LogAroundOne_1 | 1031 | 8 | 1017 | 1082 | 1984 | 32 |LogAroundOne_128 | 608 | 2 | 604 | 615 | 1984 | 32 |LogMedMag_1 | 1033 | 6 | 1015 | 1113 | 17024 | 32 |LogMedMag_128 | 606 | 2 | 603 | 610 | 1344 | 32 |NvLogAroundOne_1 | 1397 | 5 | 1397 | 1473 | 8480 | 32 |NvLogAroundOne_128 | 1341 | 0 | 1341 | 1342 | 352 | 32 |NvLogMedMag_1 | 1403 | 4 | 1403 | 1473 | 8480 | 32 |NvLogMedMag_128 | 1342 | 0 | 1342 | 1344 | 576 | 32 |Running Suite: LlvmLibcLogfGpuBenchmarkBenchmark | Cycles (Mean) | Stddev | Min | Max | Iterations | Threads |------------------------------------------------------------------------------------------------------LogfAroundOne_1 | 1047 | 5 | 1035 | 1104 | 5952 | 32 |LogfAroundOne_128 | 496 | 2 | 492 | 500 | 2880 | 32 |LogfMedMag_1 | 1047 | 8 | 1035 | 1649 | 258688 | 32 |LogfMedMag_128 | 495 | 2 | 491 | 498 | 1984 | 32 |NvLogfAroundOne_1 | 61 | 0 | 61 | 61 | 1344 | 32 |NvLogfAroundOne_128 | 94 | 0 | 94 | 94 | 576 | 32 |NvLogfMedMag_1 | 61 | 0 | 61 | 61 | 1344 | 32 |NvLogfMedMag_128 | 94 | 0 | 94 | 94 | 576 | 32 | ## Future Work The next logical steps include: * Expanding conformance test coverage to include new higher math functions as they are implemented in LLVM-libm. * Adding performance benchmarks for more higher math functions. ## Acknowledgements I would like to express my gratitude to my mentors, Joseph Huber and Tue Ly. I am deeply thankful for their belief in my potential, for their encouragement during the most challenging moments, and for their incredible availability to guide me and review my pull requests, often at night and on weekends. Their mentorship, which was rich with lessons in mathematics, programming, and GPU architecture, was certainly the best part of this experience. I would also like to thank the entire LLVM community for creating a welcoming and collaborative environment.
blog.llvm.org
September 8, 2025 at 10:43 PM
LLVMCGO25 - CARTS: Enabling Event-Driven Task and Data Block Compilation for Distributed HPC
# LLVMCGO25 - CARTS: Enabling Event-Driven Task and Data Block Compilation for Distributed HPC Hello everyone! I’m Rafael, a PhD candidate at the University of Delaware. I recently flew from Philadelphia to Las Vegas to attend the CGO conference,where I had the chance to present my project and soak in new ideas about HPC. In this blog, I’ll dive into the project I discussed at the conference and share some personal insights and lessons I learned along the way.Although comments aren’t enabled here, I’d love to hear from you, feel free to reach out at (_rafaelhg at udel dot edu_) if you’re interested in collaborating, have questions, or just want to chat. ## Motivation: Why CARTS? Modern High-Performance Computing (HPC) and AI/ML workloads are pushing our hardware and software to the limits. Some key challenges include: * **Evolving Architectures:** Systems now have complex memory hierarchies that need smart utilization. * **Hardware Heterogeneity:** With multi-core CPUs, GPUs, and specialized accelerators in the mix, resource management gets tricky. * **Performance Pressure:** Large-scale systems demand efficient handling of concurrency, synchronization, and communication. These challenges led to the creation of CARTS—a compiler framework that combines the flexibility of MLIR with the reliability of LLVM to optimize applications for distributed HPC environments. ## A Closer Look at ARTS and Its Inspirations At the heart of CARTS is ARTS. Originally, ARTS stood for the **Abstract Runtime System**.I often get mixed up and mistakenly call it the **Asynchronous Runtime System**. To keep things light,we sometimes joke about it being the **Any Runtime System**. ARTS is inspired by the Codelet model, a concept I could talk about all day!The Codelet model breaks a computation into small, independent tasks (or “codelets”) that can run as soon as their data dependencies are met.If you’re curious to learn more about this model (or find it delightfully abstract), I suggest you visit our research group websiteat CAPSL, University of Delaware and check out the Codelet Model website. ### What Does ARTS Do? ARTS is designed to support fine-grained, event-driven task execution in distributed systems. Here’s a simple breakdown of some key concepts: * **Event-Driven Tasks (EDTs):** These are the basic units of work that can be scheduled independently. Think of an EDT as a small, self-contained task that runs once all its required data is ready. * **DataBlocks:** These represent memory regions holding the data needed by tasks. ARTS tracks these DataBlocks across distributed nodes so that tasks have quick and efficient access to the data they need. * **Events:** These are signals that tell the system when a DataBlock is ready or when a task has finished. They help synchronize tasks without the need for heavy locks. * **Epochs:** These act as synchronization boundaries. An epoch groups tasks together, ensuring that all tasks within the group finish before moving on to the next phase. By modeling tasks, DataBlocks, events, and epochs explicitly, ARTS makes it easier to analyze and optimize how tasks are executed across large, distributed systems. ## The CARTS Compiler Pipeline Building on ARTS, CARTS creates a task-centric compiler workflow. Here’s how it works: ### Clang/Polygeist: From C/OpenMP to MLIR * **Conversion Process:** Using the Polygeist infrastructure, we translate C/OpenMP code into MLIR. This process handles multiple dialects (like OpenMP, SCF, Affine, and Arith). * **Extended Support:** We’ve enhanced it to handle more OpenMP constructs, including OpenMP Tasks ### ARTS Dialect: Simplifying Concurrency * **Custom Language Constructs:** The ARTS dialect converts high-level OpenMP tasks into a form that directly represents EDTs, DataBlocks, events, and epochs. * **Easier Analysis:** This clear representation makes it simpler to analyze and optimize the code. ### Optimization and Transformation Passes * **EDT Optimization:** We remove redundant tasks and optimize task structures—for example, turning a “parallel” task that contains only one subtask into a “sync” task. * **DataBlock Management:** We analyze memory access patterns to decide which DataBlocks are needed and optimize their usage. * **Event Handling and Classic Optimizations:** We allocate and manage events, applying techniques like dead code elimination and common subexpression elimination to clean up the code. ### Lowering to LLVM IR and Runtime Integration * **Conversion to LLVM IR:** The ARTS-enhanced MLIR is converted into LLVM IR. This involves outlining EDT regions into functions and inserting ARTS API calls for task, DataBlock, epoch, and event management. * **Seamless Integration:** The final binary runs on the ARTS runtime, which schedules tasks dynamically based on data readiness. ## Looking Ahead: Future Directions for CARTS The journey with CARTS is just beginning. Here’s a glimpse of what’s next: * **Comprehensive Benchmarking:** Testing the infrastructure with a variety of benchmarks to validate performance under diverse scenarios. * **Expanded OpenMP Support:** Enhancing support for additional OpenMP constructs such as loops, barriers, and locks. * **Advanced Transformation Passes:** Developing techniques like dependency pruning, task splitting/fusion, and affine transformations to further optimize task management and data locality. * **Memory-Centric Optimizations:** Implementing strategies like cache-aware tiling, data partitioning, and optimized memory layouts to reduce cache misses and enhance data transfer efficiency. * **Feedback-Directed Compilation:** Incorporating runtime profiling data to adapt optimizations dynamically based on actual workload and hardware behavior. * **Domain-Specific Extensions:** Creating specialized operations for domains such as stencil computations and tensor operations to boost performance in targeted HPC applications. ## Wrapping Up Conferences like CGO are not just about technical presentations, they’re also about meeting people and sharing ideas. I really enjoyed the mix of technical sessions and informal conversations.One of my favorite moments was meeting a professor at the conference and joking about how we only seem to meet when we’re away from Newark.It’s these human connections, along with the valuable feedback on my work, that make attending such events worthwhile. Here are a few personal takeaways: * **Invaluable Feedback:** Presenting work-in-progress at LLVM CGO workshops has taught me that constructive criticism is the fuel for innovation. * **Community Spirit:** Reconnecting with fellow researchers, whether through formal sessions or casual hallway conversations, enriches both our professional and personal lives.I encourage fellow PhD candidates and early-career researchers to take every opportunity to present your work,your ideas might not be 100% polished, but the community is there to help you refine them. Presenting CARTS allowed me to share detailed technical insights, discuss the practical challenges of HPC, and even have a few laughs along the way. While the technical details might seem dense at times, Ihope the mix of personal anecdotes and hands-on explanations makes the topic accessible and engaging.If you’re interested in discussing more about ARTS, the Codelet model, or anything else related to HPC, please drop me an email at (_rafaelhg at udel dot edu_). I’d love to chat, collaborate, or simply hang out. ## Acknowledgements * This work is supported by the US DOE Office of Science project “Advanced Memory to Support Artificial Intelligence for Science” at PNNL. PNNL is operated by Battelle Memorial Institute under Contract DEAC06-76RL01830. * Thanks to the LLVM Foundation for the travel award that made attending the CGO conference possible.
blog.llvm.org
August 16, 2025 at 10:23 PM
LLVM Fortran Levels Up: Goodbye flang-new, Hello flang!
LLVM has included a Fortran compiler “Flang” since LLVM 11 in late 2020. However,until recently the Flang binary was not `flang` (like `clang`) but instead`flang-new`. LLVM 20 ends the era of `flang-new`. The community has decided that Flang isworthy of a new name. The “new” name? You guessed it, `flang`. A simple change that represents a major milestone for Flang. This article will cover the almost 10 year journey of Flang. The firstconcepts, multiple rewrites, the adoption of LLVM’s Multi Level IntermediateRepresentation (MLIR) and Flang entering the LLVM Project. If you want to try `flang` right now, you candownloadit or try it in your browser usingCompiler Explorer. # Why Fortran? Fortran was first created in the 1950s, and the name came from “Formula Translation”.Fortran focused on the mathematics use case and freed programmers from writingassembly code that could only run on specific machines. Instead they could write code that looked like a formula. You expect this todaybut for the time it was a revolution. This feature led to heavy use in scientificcomputing: weather modelling, fluid dynamics and computational chemistry, justto name a few. > Whilst many alternative programming languages have comeand gone, it [Fortran] has regained its popularity for writing highperformance codes. Indeed, over 80% of the applicationsrunning on ARCHER2, a 750,000 core Cray EX which isthe UK national supercomputer, are written in Fortran. * Fortran High-Level Synthesis: Reducing the barriersto accelerating High Performance Computing (HPC) codes on FPGAs (Gabriel Rodriguez-Canal et al., 2023) Fortran has had a resurgencein recent years, gaining a package manager, an unofficialstandard library and LFortran,a compiler that supports interactive programming (LFortran also uses LLVM). For the full history of Fortran, IBM has an excellent articleon the topic and I encourage you to look at the“Programmer’s Primer for Fortran”if you want to see the early form of Fortran. If you want to learn the language, fortran-lang.orgis a great place to start. # Why Would You Make Another Fortran Compiler? There are many Fortran compilers. Some are vendor specific such as theIntel Fortran Compileror NVIDIA’s HPC compilers. Thenthere are open source options like GFortran, whichsupports many platforms. Why build one more? The two partners in the early days of Flang were the US National Labs and NVIDIA. For Pat McCormick (Flang project lead at Los Alamos National Laboratory) preservingthe utility of Fortran code was imperative: > These [Fortran] codes represent an essential capability that supports manyelements of our [The United States’] scientific mission and will continue to doso for the foreseeable future. A fundamental risk facing these codes is theabsence of a long-term, non-proprietary support path for Fortran. GFortran might seem to counter that statement, but remember that a single projectis a single point of failures, incompatibilities and disagreements. Having multipleimplementations reduces that risk. NVIDIA’s Gary Klimowicz laid outtheir goals for Flang in a presentation to FortranCon in 2020: * Use a permissive license like that of LLVM,which is more palatable to commercial users and contributors. * Develop an active community of Fortran compiler developers that includescompanies and institutions. * Support Fortran tool development by basing Flang on existing LLVM frameworks. * Support Fortran language experimentation for future language standards proposals. Intentions echoed by Pat McCormick: > The overarching goal was to establish an open-source, modern implementation andsimultaneously grow a community that spanned industry, academia, and federalagencies at both the national and international levels. Fortran as a language also benefits from having many implementations. For C++language features, it is common to implement them on top of Clang and GCC, toprove the feature is viable and get feedback. Implementing the feature multiple times in different compilers uncoversassumptions that may be a problem for certain compilers, or certain groups ofcompiler users. In the same way, Flang and GFortran can provide that diversity. However, even when features are standardised, standards can be ambiguous andimplementations do make mistakes. A new compiler is a chance to uncover these. Jeff Hammond (NVIDIA) is very familiar with this, having tested Flang with manyexisting applications. They had this to say on the motivations for Flangand how users have reacted to it: > The Fortran language has changed quite a bit over the past 30 years. Modern Fortrandeserves a modern compiler ecosystem, that’s not only capable of compiling allthe old codes and all the code written for the current standard, but also supportsinnovation in the future. > > Because it’s a huge amount of work to build a feature-complete modern Fortran compiler,it’s useful to leverage the resources of the entire LLVM community for this effort.NVIDIA and ARM play leading roles right now, with important contributions from IBM,Fujitsu and LBNL [Lawrence Berkeley National Laboratory], e.g. related to testsuites and coarrays. We hope to see the developer community grow in the future. > > Another benefit from the LLVM Fortran compiler is that users are more likely toinvest in supporting a new compiler when it has full language support and runs onall the platforms. A broad developer base is critical to support all the platforms. > > What I have seen so far interacting with our Fortran users is that they are veryexcited about LLVM Flang and were willing to commit to supporting it in theirbuild systems and CI systems, which has driven quality improvements in both theFlang compiler and the applications. > > Like Clang did with C and C++ codes when it started to become popular, Flangis helping to identify bugs in Fortran code that weren’t noticed before, whichis making the Fortran software ecosystem better. # PGI to LLVM: The Flang Timeline The story of Flang really starts in 2015, but the Portland Group (PGI) collaboratedwith US National Labs prior to this. PGI would later become part of NVIDIA andbe instrumental to the Flang project. * **1989** The Portland Groupis formed. To provide C, Fortran 77 and C++ compilers for the Intel i860 market. * **1990** Intel bundles PGI compilers with its iPSC/860 supercomputer. * **1996** PGIworks withSandia National Laboratories to provide compilers for the Accelerated Strategic Computing Initiative (ASCI) Option Redsupercomputer. * **December 2000** PGI becomes awholly owned subsidiary ofSTMicroElectronics. * **August 2011** Away from PGI, Bill Wendling startsan LLVM based Fortran compiler called “Flang” (later known as “Fort”).Bill is joined by several collaborators a few months later. * **July 2013** PGI is sold to NVIDIA. In late 2015 there were the first signs of what would become “Classic Flang”. Thoughat the time it was just “Flang”, I will use “Classic Flang” here for clarity. Development of what was to become “Fort” continued under the “Flang” name,completely separate from the Classic Flang project. * **November 2015** NVIDIA joins the US Department of EnergyExascale Computing Project. Including a commitment to create an open sourceFortran compiler. > “The U.S. Department of Energy’s National Nuclear Security Administration and itsthree national labs [Los Alamos, Lawrence Livermore and Sandia] have reached anagreement with NVIDIA’s PGI division to adapt and open-source PGI’s Fortranfrontend, and associated Fortran runtime library, for contribution to the LLVM project.” (this news is also the first appearance of Flang in an issue ofLLVM Weekly) * **May 2017** The first release of Classic Flang as a separaterepository, outside of the LLVM Project. Composed of a PGI compiler frontendand a new backend that generates LLVM Intermediate Representation (LLVM IR). * **August 2017** The Classic Flang project is announced officially(according to LLVM Weekly’s report, the original mailing list is offline). During this time, plans were formed to propose moving Classic Flang into the LLVMProject. * **December 2017** The original “Flang” is renamed to“Fort”so as not to compete with Classic Flang. * **April 2018** Steve Scalpone (NVIDIA) announcesat the European LLVM Developers’ Conference that the frontend of Classic Flang will be rewritten to addressfeedback from the LLVM community. This new front end became known as “F18”. * **August 2018** Eric Schweitz (NVIDIA) begins work on what would become“Fortran Intermediate Representation”, otherwise known as “FIR”. This work wouldlater become the `fir-dev` branch. * **February 2019** Steve Scalpone proposescontributing F18 to the LLVM Project. * **April 2019** F18 is approvedfor migration into the LLVM Project monorepo. At this point F18 was only the early parts of the compiler, it could not generatecode (later `fir-dev` work addressed this). Despite that, it moved into `flang/`in the monorepo, awaiting the completion of the rest of the work. * **June 2019** Peter Waller (Arm) proposesadding a Fortran mode to the Clang compiler driver. * **August 2019** The first appearanceof the `flang.sh` driver wrapper script (more on this later). * **December 2019** The planfor rewriting the F18 git history to fit into the LLVM project is announced.This effort was led by Arm, with Peter Waller going so far as to writea custom toolto rewrite the history of F18. Kiran Chandramohan (Arm) proposesan OpenMP dialect for MLIR, with the intention of using it in Flang (discussioncontinues on Discourseduring the following January). * **February 2020** The planfor improvements to F18 to meet the standards required for inclusion in theLLVM monorepo is announced by Richard Barton (Arm). * **April 2020** Upstreaming of F18 into the LLVM monorepo iscompleted. At this point what was in the LLVM monorepo was F18, the rewritten frontend ofClassic Flang. Classic Flang remained unchanged, still using the PGI based frontend. Around this time work started in the Classic Flang repo on the `fir-dev` branchthat would enable code generation when using F18. For the following events remember that Classic Flang was still in use. The ClassicFlang binary is named `flang`, just like the folder F18 now occupies in the LLVM Project. **Note:** Some LLVM changes referenced below will appear to have skipped an LLVM release.This is because they were done after the release branch was created, but beforethe first release from that branch was distributed. * **April 2020** The first attempt at adding a new compiler driver for Flang isposted for review. It used the name`flang-tmp`. This change was later abandoned in favour of a different approach. * **September 2020** Flang’s new compiler driver is addedas an experimental option. This is the first appearance of the `flang-new` binary,instead of `flang-tmp` as proposed before. > The name was intended as temporary, but not the driver. * Andrzej Warzyński (Arm, Flang Driver Maintainer) * **October 2020** Flang is included in an LLVM release for the first time inLLVM 11.0.0. There is an `f18` binary and the previously mentioned script`flang.sh`. * **August 2021** `flang-new` is no longer experimental and replacesthe previous Flang compiler driver binary `f18`. * **October 2021** LLVM 13.0.0 is the first release to include a `flang-new` binary(alongside `f18`). * **March 2022** LLVM 14.0.0 releases, with `flang-new` replacing `f18` as the Flangcompiler driver. * **April 2022** NVIDIA ceases developmentof the `fir-dev` branch in the Classic Flang project. Upstreaming of `fir-dev`to the LLVM Project begins around this date. `flang-new` can now do code generationif the `-flang-experimental-exec` option is used. This change used workoriginally done on the `fir-dev` branch. * **May 2022** Kiran Chandramohan announcesat the European LLVM Developers’ Meeting that Flang’s OpenMP 1.1 support is close to complete. The `flang.sh` compiler driver script becomes `flang-to-external-fc`. Itallows the user to use `flang-new` to parse Fortran source code, then write it backto a file to be compiled with an existing Fortran compiler. The script can be put in place of an existing compiler to test Flang’s parser onlarge projects. * **June 2022** Brad Richardson (Berkeley Lab) changes`flang-new` to generate code by default, removing the `-flang-experimental-exec`option. * **July 2022** Valentin Clément (NVIDIA) announcesthat upstreaming of `fir-dev` to the LLVM Project is complete. * **September 2022** LLVM 15.0.0 releases, including Flang’s experimental codegeneration option. * **September 2023** LLVM 17.0.0 releases, with Flang’s code generation enabledby default. At this point the LLVM Project contained Flang as it is known today. Sometimesreferred to as “LLVM Flang”. “LLVM Flang” is the combination of the F18 frontend and MLIR-based code generationfrom `fir-dev`. As opposed to “Classic Flang” that combines a PGI based frontend andits own custom backend. The initiative to upstream Classic Flang was in some sense complete. Thoughwith all of the compiler rewritten in the process, what landed in the LLVM Projectwas very different to Classic Flang. * **April 2024** The `flang-to-external-fc` script is removed. * **September 2024** LLVM 19.1.0 releases. The first release of `flang-new`as a standalone compiler. * **October 2024** The community deems that Flang has met the criteria to not be“new” and the name is changed. Goodbye `flang-new`, hello `flang`! * **November 2024** AMD announcesits next generation Fortran compiler, based on LLVM Flang. Arm releases an experimental versionof its new Arm Toolchain for Linux product, which includes LLVM Flangas the Fortan compiler. * **March 2025** LLVM 20.1.0 releases. The first time the `flang` binary has beenincluded in a release. # Flang and the Definition of New Renaming Flang was discussed a few times beforethe final proposal. It was always contentious, so for the finalproposalBrad Richardson decided to use the LLVM proposal process.Rarely used, but specifically designed for these situations. > After several rounds of back and forth, I thought the discussion wasdevolving and there wasn’t much chance we’d come to a consensus without someoutside perspective. * Brad Richardson That outside perspective included Chris Lattner (co-founder of the LLVM Project),who quicklyidentifieda unique problem: > We have a bit of an unprecedented situation where an LLVM project is takingthe name of an already established compiler [Classic Flang]. Everyone seems towant the older flang [Classic Flang] to fade away, but flang-new is not asmature and it isn’t clear when and what the criteria should be for that. Confusion about the `flang` name was a key motivation for Brad Richardson too: > Part of my concern was that the name “flang-new” would get common usagebefore we were able to change it. I think it’s now been demonstrated that thatconcern was valid, because right now November 2024] fpm [[Fortran Package Manager]recognizes the compiler by that name. > > My main goal at that point was just clear goals for when we wouldmake the name change. No single list of goals was agreed, but some came up many times: * Known limitations and supported features should be documented. * As much as possible, work that was expected to fix knownbugs should be completed, to prevent duplicate bug reports. * Unimplemented language features should fail with a message saying that they areunimplemented. Rather than with a confusing failure or by producing incorrectcode. * LLVM Flang should perform relatively well when compared to other Fortrancompilers. * LLVM Flang must have a reasonable pass rate with a large Fortran language testsuite, and results of that must be shown publicly. * All reasonable steps should be taken to prevent anyone using a pre-packagedClassic Flang confusing it with LLVM Flang. You will see a lot of relative language in those, like “reasonable”. Noone could say exactly what that meant, but everyone agreed that it wasinevitable that one day it would all be true. Paul T Robinson summarised the dilemma earlyin the thread: > > the plan is to replace Classic Flang with the new Flang in the future. > > I suppose one of the relevant questions here is: Has the future arrived? After that Steve Scalpone (NVIDIA) gavetheir perspectivethat it was not yet time to change the name. So the community got to work on those goals: * Many performance and correctness issues were addressed by the “High LevelFortran Intermediate Representation” (HLFIR) (which this article will explain later). * A cross-company team including Arm, Huawei, Linaro, Nvidia and Qualcommcollaboratedto make it possible to build the popular SPEC 2017benchmark with Flang. * Flang gained support for OpenMP up to version 2.5, and was able to compile OpenMPspecific benchmarks like SPEC OMP and theNAS Parallel Benchmarks. * Linaro showed that the performanceof Flang compared favourably with Classic Flang and was not far behind GFortran. * The GFortran test suite was added to the LLVM Test Suite,and Flang achieved good results. * Fujitsu’s test suite was madepublic and tested with Flang. The process to make IBM’s Fortran test suite publicwas started. With all that done, in October of 2024 `flang-new`became`flang`. The future had arrived. > And it’s merged! It’s been a long (and sometimes contentious) process, butthank you to everyone who contributed to the discussion. * Brad Richardson, closing out the proposal. The goals the community achieved have certainly been worth it for Flang as acompiler, but did Brad achieve their own goals? > What did I hope to see as a result of the name change? I wanted it to beeasier for more people to try it out. So once you have finished reading this article,downloadFlang or try it out on Compiler Explorer.You know at least one person will appreciate it! # Fortran Intermediate Representation (FIR) All compilers that use LLVM as a backend eventually produce code in the form ofthe LLVM Intermediate Representation(LLVM IR). A drawback of this is that LLVM IR does not include language specific information.This means that for example, it cannot be used to optimise arrays in a wayspecific to Fortran programs. One solution to this has been to build a higher level IR that represents theunique features of the language, optimise that, then convert the result into LLVM IR. Eric Schweitz (NVIDIA) started to do that for Fortran in late 2018: > FIR was originally conceived as a high-level IR that would interoperate withLLVM but have a representation more friendly and amenable to Fortranoptimizations. Naming is hard but Eric did well here: > FIR was a pun of sorts. Fortran IR and meant to be evocative of the trees(Abstract Syntax Trees). We will not go into detail about this early FIR, because MLIRwas revealed soon after Eric started the project and they quickly adopted it. > When MLIR was announced, I quickly switched gears from building datastructures for a new “intermediate IR” to porting my IR design to MLIR andusing that instead. > > I believe FIR was probably the first “serious project” outside of Google tostart using MLIR. The FIR work continued to develop, with Jean Perier (NVIDIA) joining Eric onthe project. It became its own public branch `fir-dev`, which was later contributedto the LLVM Project. The following sections will go into detail on the intermediate representationsthat Flang uses today. # MLIR The journey from Classic Flang to LLVM Flang involved a rewrite of theentire compiler. This provided an opportunity to pick up new things fromthe LLVM Project. Most notably MLIR. “Multi-Level Intermediate Representation” (MLIR) was firstintroduced to the LLVMcommunity in 2019, around the time that F18 was approved to move into the LLVM Project. The problem that MLIR addresses is the same one that Eric Schweitz tackled with FIR:It is difficult to map high level details of programming languagesinto LLVM IR. You either have to attach them to the IR as metadata, try to recover thelost details later, or fight an uphill battle to add the details toLLVM IR itself. These details are crucial for producing optimised code in certainlanguages. (Fortran array optimisations were one use case referenced). This led languages such as Swift and Rust to create their own IRs that includeinformation relevant to their own optimisations. After that IR has been optimisedit is converted into LLVM IR and goes through the normal compilation pipeline. To implement these IRs they have to build a lot of infrastructure, but it cannotbe shared between the compilers. This is where MLIR comes in. > The MLIR project aims to directly tackle these programming language design andimplementation challenges—by making it very cheap to define and introduce newabstraction levels, and provide “in the box” infrastructure to solve commoncompiler engineering problems. * “MLIR: A Compiler Infrastructure for the End of Moore’s Law”(Chris Lattner, Mehdi Amini et al., 2020) ## Flang and MLIR The same year MLIR debuted, Eric Schweitz gave a talk at the later USLLVM Developers’ meeting titled“An MLIR Dialect for High-Level Optimization of Fortran”.FIR by that point was implemented as an MLIR dialect. > That [switching FIR to be based on MLIR] happened very quickly and I neverlooked back. > > MLIR, even in its infancy, was clearly solving many of the exact same problemsthat we were facing building a new Fortran compiler. * Eric Schweitz The MLIR community were also happy to have Flang on board: > It was fantastic to have very quickly in the early days of MLIR a non-ML [Machine Learning] frontendto exercise features we built in MLIR in anticipation. It led us to course-correctin some cases, and Flang was a motivating factor for many feature requests.It contributed significantly to establishing and validating that MLIR had the right foundations. * Mehdi Amini Flang did not stop there, later adding another dialect“High Level Fortran Intermediate Representation”(HLFIR) which works at a higher level than FIR. A big target of HLFIRwas array optimisations, that were more complex to handle using FIR alone. > FIR was a compromise on both ends to some degree. It wasn’t trying to capturesyntactic information from Fortran, and I assumed there would be work done onan Abstract Syntax Tree. That niche would later be filled by “High Level FIR”[HLFIR]. * Eric Schweitz ## IRs All the Way Down The compilation process starts with Fortran source code. subroutine example(a, b) real :: a(:), b(:) a = bend subroutine (Compiler Explorer) Thesubroutine `example`assigns array `b` to array `a`. It is tempting to think of the IRs in a “stack” where each one is convertedinto the next. However, MLIR allows multiple “dialects” of MLIR to exist in thesame file. (The steps shown here are the most important ones for Flang. In reality thereare many more between Fortran and LLVM IR.) In the first step, Flang produces a file that is a mixture of HLFIR, FIRand the built-in MLIR dialect `func` (function). module attributes {<...>} { func.func @_QPexample(%arg0: !fir.box<!fir.array<?xf32>> {fir.bindc_name = "a"}, %arg1: !fir.box<!fir.array<?xf32>> {fir.bindc_name = "b"}) { %0 = fir.dummy_scope : !fir.dscope %1:2 = hlfir.declare %arg0 dummy_scope %0 {uniq_name = "_QFexampleEa"} : (!fir.box<!fir.array<?xf32>>, !fir.dscope) -> (!fir.box<!fir.array<?xf32>>, !fir.box<!fir.array<?xf32>>) %2:2 = hlfir.declare %arg1 dummy_scope %0 {uniq_name = "_QFexampleEb"} : (!fir.box<!fir.array<?xf32>>, !fir.dscope) -> (!fir.box<!fir.array<?xf32>>, !fir.box<!fir.array<?xf32>>) hlfir.assign %2#0 to %1#0 : !fir.box<!fir.array<?xf32>>, !fir.box<!fir.array<?xf32>> return }} For example, the “dummy arguments” (thearguments of a subroutine)are declared with `hlfir.declare` but their type is specified with `fir.array`. As MLIR allows multiple dialects to exist in the same file, there is no need forHLFIR to have a `hlfir.array` that duplicates `fir.array`, unless HLFIR wantedto handle that differently. The next step is to convert HLFIR into FIR: module attributes {<...>} { func.func @_QPexample(<...>) { <...> %c3_i32 = arith.constant 3 : i32 %7 = fir.convert %0 : (!fir.ref<!fir.box<!fir.array<?xf32>>>) -> !fir.ref<!fir.box<none>> %8 = fir.convert %5 : (!fir.box<!fir.array<?xf32>>) -> !fir.box<none> %9 = fir.convert %6 : (!fir.ref<!fir.char<1,17>>) -> !fir.ref<i8> %10 = fir.call @_FortranAAssign(%7, %8, %9, %c3_i32) : (!fir.ref<!fir.box<none>>, !fir.box<none>, !fir.ref<i8>, i32) -> none return }<...>} Then this bundle of MLIR dialects is converted into LLVM IR: define void @example_(ptr %0, ptr %1) { <...> store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %37, ptr %3, align 8 call void @llvm.memcpy.p0.p0.i32(ptr %5, ptr %4, i32 48, i1 false) %38 = call {} @_FortranAAssign(ptr %5, ptr %3, ptr @_QQclX2F6170702F6578616D706C652E66393000, i32 3) ret void}<...> This LLVM IR passes through the standard compilation pipeline that clang also uses.Eventually being converted into target specificMachine IR (MIR), into assembly andfinally into a binary program. * Fortran * MLIR (including HLFIR and FIR) * MLIR (including FIR) * LLVM IR * MIR * Assembly * Binary At each stage, the optimisations most suited to that stage are done.For example, while you have HLFIR you could optimise array accesses because at thatpoint you have the most information about how the Fortran treats arrays. If Flang were to do this later on, in LLVM IR, it would be much more difficult.Either the information would be lost or incomplete, or you would be at a stage inthe pipeline where you cannot assume that you started with a specific sourcelanguage. # OpenMP to Everyone **Note:** Most of the points made in this section also apply to OpenACC support in Flang. In the interest of brevity, Iwill only describe OpenMP in this article. You can find more about OpenACCin this presentation. ## OpenMP Basics OpenMP is a standardised API for addingparallelism to C, C++ and Fortran programs. Programmers mark parts of their code with “directives”. These directivestell the compiler how the work of the program should be distributed.Based on this, the compiler transforms the code and inserts calls to anOpenMP runtime library for certain operations. This is a Fortran example: SUBROUTINE SIMPLE(N, A, B) INTEGER I, N REAL B(N), A(N)!$OMP PARALLEL DO DO I=2,N B(I) = (A(I) + A(I-1)) / 2.0 ENDDOEND SUBROUTINE SIMPLE (from “OpenMP Application Programming Interface Examples”, Compiler Explorer) **Note:** Fortran arrays are one-based by default. So the first element is at index 1. This example reads the previous element as well, so it starts `I` at 2. `!$OMP PARALLEL DO` is a directive in the form of a Fortran comment (Fortrancomments start with `!`).`PARALLEL DO` starts a parallel “region” thatincludes the code from `DO` to `ENDDO`. This tells the compiler that the work in the `DO` loop should be shared amongstall the threads available to the program. Clang has supported OpenMPfor many years now. The equivalent C++ code is: void simple(int n, float *a, float *b){ int i; #pragma omp parallel for for (i=1; i<n; i++) b[i] = (a[i] + a[i-1]) / 2.0;} (Compiler Explorer) For C++, the directive is in the form of a `#pragma` and attachedto the `for` loop. LLVM IR does not know anything about OpenMP specifically, so Clang does all thework of converting the intent of the directives into LLVM IR. The output fromClang looks like this: define dso_local void @simple(int, float*, float*) (i32 noundef %n, ptr noundef %a, ptr noundef %b) <...> {entry:<...> call void (<...>) @__kmpc_fork_call(@simple <...> (.omp_outlined) <...>) ret void}define internal void @simple(int, float*, float*) (.omp_outlined) (ptr <...> %.global_tid., ptr <...> %.bound_tid., ptr <...> %n, ptr <...> %b, ptr <...> %a) {entry:<...> call void @__kmpc_for_static_init_4(<...>)<...>omp.inner.for.body.i:<...>omp.loop.exit.i: call void @__kmpc_for_static_fini(<...>)<...> ret void} (output edited for readability) The body of `simple` no longer does all the work. Instead it uses`__kmpc_fork_call` to tell the OpenMPruntime libraryto run another function, `simple (.omp_outlined)` to do the work. This second function is referred to as a “micro task”. The runtime librarysplits the work across many instances of the micro task and each timethe micro task function is called, it gets a different slice of the work. The number of instances is only known at runtime, and can be controlled withsettings such as `OMP_NUM_THREADS`. The LLVM IR representation of `simple (.omp_outlined)` includes labels like`omp.loop.exit.i`, but these are not specific to OpenMP. They are just normal LLVM IRlabels whose name includes `omp`. ## Sharing Clang’s OpenMP Knowledge Shortly after Flang was approved to join the LLVM Project, it was proposed thatFlang should share OpenMP support code with Clang. > This is an RFC for the design of the OpenMP front-ends under the LLVMumbrella. It is necessary to talk about this now as Flang (aka. F18) ismaturing at a very promising rate and about to become a sub-project nextto Clang. > > TLDR;Keep AST nodes and Sema separated but unify LLVM-IR generation forOpenMP constructs based on the (almost) identical OpenMP directivelevel. * “RFC] Proposed interplay of Clang & Flang & LLVM wrt. OpenMP”,Johannes Doerfert (Lawrence Livermore National Laboratory), May 2019 (only one[partof this still exists online, this quote is from a copy of the other part, which was provided to me). For our purposes, the “TLDR” means that although both compilers have differentinternal representations of the OpenMP directives, they both have to produceLLVM IR from that representation. This proposal led to the creation of the `LLVMFrontendOpenMP` library in`llvm`. By using the same class `OpenMPIRBuilder`, there is no need to repeat work inboth compilers, at least for this part of the OpenMP pipeline. As you will see in the following sections, Flang has diverged from Clang for otherparts of OpenMP processing. ## Bringing OpenMP to MLIR Early in 2020, Kiran Chandramohan (Arm) proposedan MLIR dialect for OpenMP, for use by Flang. > We started the work for the OpenMP MLIR dialect because of Flang.… So, MLIR has an OpenMP dialect because of Flang. * Kiran Chandramohan This dialect would represent OpenMP specifically, unlike the generic LLVM IRyou get from Clang. If you compile the original Fortran OpenMP example without OpenMP enabled, youget this MLIR: module attributes {<...>} { func.func @_QPsimple(<...> { %1:2 = hlfir.declare %arg0 <...> {uniq_name = "_QFsimpleEn"} : <...> %3:2 = hlfir.declare %2 <...> {uniq_name = "_QFsimpleEi"} : <...> %10:2 = hlfir.declare %arg1(%9) <...> {uniq_name = "_QFsimpleEa"} : <...> %17:2 = hlfir.declare %arg2(%16) <...> {uniq_name = "_QFsimpleEb"} : <...> %22:2 = fir.do_loop <...> { <...> hlfir.assign %34 to %37 : f32, !fir.ref<f32> } fir.store %22#1 to %3#1 : !fir.ref<i32> return }} (output edited for readability) Notice that the `DO` loop has been converted into `fir.do_loop`. Now enableOpenMP and compile again: module attributes {<...>} { func.func @_QPsimple(<...> { %1:2 = hlfir.declare %arg0 <...> {uniq_name = "_QFsimpleEn"} : <...> %10:2 = hlfir.declare %arg1(%9) <...> {uniq_name = "_QFsimpleEa"} : <...> %17:2 = hlfir.declare %arg2(%16) <...> {uniq_name = "_QFsimpleEb"} : <...> omp.parallel { %19:2 = hlfir.declare %18 {uniq_name = "_QFsimpleEi"} : <...> omp.wsloop { omp.loop_nest (%arg3) : i32 = (%c2_i32) to (%20) inclusive step (%c1_i32) { hlfir.assign %32 to %35 : f32, !fir.ref<f32> omp.yield } } omp.terminator } return }} (output edited for readability) You will see that instead of `fir.do_loop` you have `omp.parallel`,`omp.wsloop` and `omp.loop_nest`. `omp` is an MLIR dialect that describesOpenMP. This translation of the `PARALLEL DO` directive is much more literal thanthe LLVM IR produced by Clang for `parallel for`. As the `omp` dialect is specifically made for OpenMP, it can representit much more naturally. This makes it easier to understand the code and towrite optimisations. Of course Flang needs to produce LLVM IR eventually, and to do that ituses the same `OpenMPIRBuilder` class that Clang does. From theMLIR shown previously, `OpenMPIRBuilder` produces the following LLVM IR: define void @simple_ <...> {entry: call void (<...>) @__kmpc_fork_call( <...> @simple_..omp_par <...>) ret void}define internal void @simple_..omp_par <...> {omp.par.entry: call void @__kmpc_for_static_init_4u <...>omp_loop.exit: call void @__kmpc_barrier(<...>) ret voidomp_loop.body: <...>} The LLVM IR produced by Flang and Clang is superficially different, butstructurally very similar. Considering the differences in source languageand compiler passes, it is not surprising that they are not identical. ## ClangIR and the Future It is surprising that a compiler for a language as old as Fortran got ahead ofClang (the most well known LLVM based compiler) when it came to adopting MLIR. This is largely due to timing, MLIR is a recent invention and Clang existedbefore MLIR arrived. Clang also has a legacy to protect, so it is unlikely tomigrate to a new technology right away. The ClangIR project is working to changeClang to use a new MLIR dialect, “Clang Intermediate Representation” (“CIR”).Much like Flang and its HLFIR/FIR dialects, ClangIR will convert C and C++into the CIR dialect. Work on OpenMP support for ClangIR has already started,using the `omp` dialect that was originally added for Flang. Unfortunately at time of writing the `parallel` directive is not supported byClangIR. However, if you look at the CIR produced when OpenMP is disabled, you cansee the `cir.for` element that the OpenMP dialect might replace: module <...> attributes {<...>} { cir.func @_Z6simpleiPfS_( <...> { %1 = cir.alloca <...> ["a", init] <...> %2 = cir.alloca <...> ["b", init] <...> %3 = cir.alloca <...> ["i"] <...> cir.scope { cir.for : cond { <...> } body { <...> cir.yield loc(#loc13) } step { <...> cir.yield loc(#loc36) } loc(#loc36) } loc(#loc36) cir.return loc(#loc2) } loc(#loc31)} loc(#loc) (on Compiler Explorer) # Flang Takes Driving Lessons **Note:** This section paraphrases material from“Flang Drivers”.If you want more detail please refer to that document, orDriving Compilers. “Driver” in a compiler context means the part of the compiler that decideshow to handle a set of options. For instance, when you use the option `-march=armv8a+memtag`,something in Flang knows that you want to compile for Armv8.0-a with the MemoryTagging Extension enabled. `-march=` is an example of a “compiler driver” option. These options are what usersgive to the compiler. There is actually a second driver after this, confusinglycalled the “frontend” driver, despite being behind the scenes. In Flang’s case the “compiler driver” is `flang` and the “frontend driver” is`flang -fc1` (they are two separate tools, contained in the same binary). They are separate tools so that the compiler driver can provide an interfacesuited to compiler users, with stable options that do not change over time.On the other hand, the frontend driver is suited to compiler developers, exposesinternal compiler details and does not have a stable set of options. You can see the differences if you add `-###` to the compiler command: $ ./bin/flang /tmp/test.f90 -march=armv8a+memtag -### "<...>/flang" "-fc1" "-triple" "aarch64-unknown-linux-gnu" "-target-feature" "+v8a" "-target-feature" "+mte" "/usr/bin/ld" \ "-o" "a.out" "-L/usr/lib/gcc/aarch64-linux-gnu/11" (output edited for readability) The compiler driver has split the compilation into a job for the frontend(`flang -fc1`) and the linker (`ld`). `-march=` has been converted into manyarguments to `flang -fc1`. This means that if compiler developers decided tochange how `-march=` was converted, existing `flang` commands would still work. Another responsibility of the compiler driver is to know where to find librariesand header files. This differs between operating systems and evendistributions of the same family of operating systems (for example Linuxdistributions). This created a problem when implementing the compiler driver for Flang. All thesedetails would take a long time to get right. Luckily, by this time Flang was in the LLVM Project alongside Clang.Clang already knew how to handle this and had been tested on all sorts ofsystems over many years. > The intent is to mirror clang, for both the driver and CompilerInvocation, asmuch as makes sense to do so. The aim is to avoid re-inventing the wheel andto enable people who have worked with either the clang or flang entry points,drivers, and frontends to easily understand the other. * Peter Waller (Arm) Flang became the first in-tree project to use Clang’s compiler driverlibrary (`clangDriver`) to implement its own compiler driver. This meant that Flang was able to handle all the targets and tools that Clangcould, without duplicating large amounts of code. # Reflections on Flang We are almost 10 years from the first announcement of what would become LLVMFlang. In the LLVM monorepo alone there have been close to 10,000 commitsfrom around 400 different contributors. Undoubtedly more in Classic Flang beforethat. So it is time to hear from users, contributors, and supporters, past andpresent, about their experiences with Flang. > Collaborating with NVIDIA and PGI on Classic Flang was crucial in establishingArm in High Performance Computing. It has been an honour to continue investingin Flang, helping it become an integral part of the LLVM project and a solidfoundation for building HPC toolchains. > > We are delighted to see the project reach maturity, as this was the last step inallowing us to remove all downstream code from our compiler. Look out for ArmToolchain for Linux 20, which will be a fully open source, freely availablecompiler based on LLVM 20, available later this year.” * Will Lovett, Director Technology Management at Arm. (the following quote is presented in Japanese and English, in case of differences,Japanese is the authoritative version) > 富士通は、我々の数十年にわたるHPCの経験を通じて培ったテストスイートを用いて、Flangの改善に貢献できたことを嬉しく思います。Flangの親切で協力的なコミュニティに大変感銘を受けました。 > > 富士通は、より高いパフォーマンスと使いやすさを実現し、我々のプロセッサを最大限に活用するために、引き続きFlangに取り組んでいきます。Flangが改善を続け、ユーザーを増やしていくことを強く願っています。 > > Fujitsu is pleased to have contributed to the improvement of Flang with ourtest suite, which we have developed through our decades of HPC experience.Flang’s helpful and collaborative community really impressed us. > > Fujitsu will continue to work on Flang to achieve higher performance andusability, to make the best of our processors. We hope that Flang will continueto improve and gain users. * 富士通株式会社 コンパイラ開発担当 マネージャー 鎌塚 俊 (Shun Kamatsuka, Manager of the Compiler Development Team at Fujitsu). > Collaboration between Linaro and Fujitsu on an active CI using Fujitsu’stestsuite helped find several issues and make Flang more robust, inaddition to detecting any regressions early. > > Linaro has been contributing to Flang development for two years now, fixing agreat number of issues found by the Fujitsu testsuite. * Carlos Seo, Tech Lead at Linaro. > SciPy is a foundational Python package. It provides easyaccess to scientific algorithms, many of which are written in Fortran. > > This has caused a long stream of problems for packaging and shipping SciPy,especially because users expect first-class support for Windows;a platform that (prior to Flang) had no license-free Fortran compilersthat would work with the default platform runtime. > > As maintainers of SciPy and redistributors in the conda-forgeecosystem, we hoped for a solution to this problem for many years. In the end,we switched to using Flang, and that processwas a minor miracle. > > Huge thanks to the Flang developers for removing a major source of pain for us! * Axel Obermeier, Quantsight Labs. > At the Barcelona Supercomputing Center, like many other HPC centers, we cannotignore Fortran. > > As part of our research activities, Flang has allowed us to apply our work inlong vectors for RISC-V to complex Fortran applications which we have been ableto run and analyze in our prototype systems. We have also used Flang to supportan in-house task-based directive-based programming model. > > These developments have proved to us that Flang is a powerful infrastructure. * Roger Ferrer Ibáñez, Senior Research Engineer at the Barcelona Supercomputing Center (BSC). > I am thrilled to see the LLVM Flang project achieve this milestone. It is a uniqueproject in that it marries state of the art compiler technologies like MLIR withthe venerable Fortran language and its large community of developers focused onhigh performance compute. > > Flang has set the standard for LLVM frontends by adopting MLIR and C++17 featuresearlier than others, and I am thrilled to see Clang and other frontends modernizebased on those experiences. > > Flang also continues something very precious to me: the LLVM Project’s abilityto enable collaboration by uniting people with shared interests even if theyspan organizations like academic institutions, companies, and other research groups. * Chris Lattner, serving member of the LLVM Board of Directors, co-founder ofthe LLVM Project, the Clang C++ compiler and MLIR. > The need for a more modern Fortran compiler motivated the creation of the LLVM Flangproject and AMD fully supports that path. > > In following with community trends, AMD’s Next-Gen Fortran Compiler will be adownstream flavor of LLVM Flang and will in time supplant the current AMD Flangcompiler, a downstream flavor of “Classic Flang”. > > Our mission is to allow anyone that is using and developing a Fortran HPC codebaseto directly leverage the power of AMD’s GPUs. AMD’s Next-Gen Fortran Compiler’s goalis fulfilling our vision by allowing you to deploy and accelerate your Fortran codeson AMD GPUs using OpenMP offloading, and to directly interface and invoke HIP andROCm kernels. * AMD, “Introducing AMD’s Next-Gen Fortran Compiler” # Getting Involved Flang might not be new anymore, but it is definitely still improving. If youwant to try Flang on your own projects, you candownloadit right now. If you want to contribute, there are many ways to do so. Bug reports,code contributions, documentation improvements and so on. Flang follows theLLVM contribution process and youcan find links to the forums, community calls and anything else youmight need here. # Credits Thank you to the following people for their contributions to this article: * Alex Bradbury (Igalia) * Andrzej Warzyński (Arm) * Axel Obermeier (Quansight Labs) * Brad Richardson (Lawrence Berkeley National Laboratory) * Carlos Seo (Linaro) * Daniel C Chen (IBM) * Eric Schweitz (NVIDIA) * Hao Jin * Jeff Hammond (NVIDIA) * Kiran Chandramohan (Arm) * Leandro Lupori (Linaro) * Luis Machado (Arm) * Mehdi Amini * Pat McCormick (Los Alamos National Laboratory) * Peter Waller (Arm) * Steve Scalpone (NVIDIA) * Tarun Prabhu (Los Alamos National Laboratory) # Further reading * Learn Fortran * The ’eu’ in eucatastrophe – Why SciPy builds for Python 3.12 on Windows are a minor miracle * Resurrecting Fortran * The Fortran Package Manager’s First Birthday * How to write a new compiler driver? The LLVM Flang perspective * Flang in the Exascale Supercomputing Project
blog.llvm.org
March 14, 2025 at 7:19 PM
GSoC 2024: Improve Clang Doc
Hi, my name is Peter, and this year I was involved in Google Summer of Code 2024. I worked on improving the Clang-Doc documenation generator Mentors: Petr Hosek and Paul Kirth ## Project Background Clang-Doc is a documentation generator developed on top of libtooling, as analternative to Doxygen. Development started in 2018 and continued through 2019,however, it has since stalled. Currently, the tool can generate HTML, YAML, and markdown but the generated output has usability issues. This GSOC project aimed to address the pain points regarding the output of the HTML, by adding support for various C++ constructs and reworking the CSS of the HTML output to be more user-friendly. ## Work Done The original scope of the project was to improve the output of Clang-Doc’s generation. However during testing the tool was significantly slower than expected which made developing features for the tool impossible.Documentation generation for the LLVM codebase was taking upwards of 10 hours on my local machine. Additionally, the tool utilized a lot of memory and was prone to crashing with an out-of-memory error. Similar tools such as Doxygen and Hdoc ran in comparatively less time for the same codebase. This pointed to a significant bottleneck within Clang-Doc’s code path when generating large-scale software projects. Due to this the project scope quickly changed to improving the runtime of Clang-Doc so that it could run much faster. It was only during the latter half of the project did the scope changed back to improving Clang-Doc’s generation. ### Added More Test Cases to Clang-Doc test suite Clang-Doc previously had tests which did not test the full scope of the the HTML or Markdown output. I added more end-to-end tests to make sure that in the process of optimizing documentation generation we were not degrading the quality or functionality of the tool. In summary, I added four comprehensive tests that covered all features that we were not testing such as testing the generation for Enums, Namespace, and Records for HTML and Markdown. ### Improve Clang-Doc’s performance by 1.58 times Internally, the way Clang-Doc works is by leveraging libtooling’s ASTVisitor class to parse the source level declarations in each TU. The tool is architected using a Map-Reduce pattern. Clang-Doc parses each fragment of a declaration into an in-memory data format which is serialized then into an internal format and stored as a key value paired, identified by their USR. After, Clang-Doc deserializes and combines each of the fragment declarations back into the in-memory data format which is used by each of the backend to generate the results. Many experiments were conducted to identified the source of the bottleneck. First I tried benchmarking the code with many different codebases such JSON, and fmtlib to identify certain code patterns that slowed the code path down. This didn’t really work since the bottlenecking only showed up for large codebases like LLVM.Next I leverage Windows prolifer (since I was coding on windows) however the visualizations was not helpful and the my system was not capable of profiling the 10 hour runtime required to compile LLVM documenation. Eventually, we were able to identify a major bottleneck in Clang-Doc’s by leveraging the TimeProfiler (similar to -ftime-trace in clang) code to identify where the performance bottleneck was. Clang-Doc was performing redundant work when it was processing each declaration. We settled on a caching/memoization strategy to minimize the redundant work. For example, if we had the following project: //File: Base.hclass Base {} //File: A.cpp#include "Base.h"... //File: B.cpp#include "Base.h"... In this case, the ASTVisitor invoked by Clang-Doc would visit the serialized Base class three times, once when it is parsing Base.h, another when its visiting A.cpp then B.cpp. The problem was that there was no mechanism to identify declarations that we had already seen. Using a simple dictionary which kept track of a list of declaration that Clang-Doc had visited as a basic form of memoization ended up being a surprisingly effective optimization. Here is a plot of the benchmarking numbers: The benchmarking numbers were performed on a machine with a 6th gen Intel(R) Xeon(R) CPU @ 2.00GHz w/ 96 cores, and 180GB of ram. Clang-doc is able to run concurrently, however the benchmark here is with concurrency set to 2. This is because anything higher crashes the slow version of the tool with an out of memory error. It took around 6 hours to complete a full generation of LLVM documentation in the previous tool, where as current version took around 4 hours. Here is a plot of the benchmark by number of threads: We notice a pretty dramatic dropoff as more and more threads are utilize, the original time t 6 hours was cut down to 13 minutes at 64 threads. Considering the previous versions of the tool could not use the higher thread count without crashing (even on a machine with 180GB of ram), the performance gains are even more dramatic. ### Added Template Mustache HTML Backend Clang-Doc originally used an ad-hoc method of generating HTML. I introduced a templating language as a way of reducing project complexity and reducing the ease of development. Two RFCs were made before arriving at the idea of introducing Mustache as a library. Originally the idea was to introduce a custom templating language, however, upon further discussion, it was decided that the complexity of designing and implementing a new templating language was too much.An LLVM community member (@cor3ntin) suggested using Mustache as a templating language.Mustache was the ideal choice since it was very simple to implement, and has a well defined spec that fit what was needed for Clang-Doc’s use case. The feedback on the RFC was generally positive. While there was some resistance regarding the inclusion of an HTML support library in LLVM, this concern stemmed partly from a lack of awareness that HTML generation already occurs in several parts of LLVM. Additionally, the introduction of Mustache has the potential to simplify other HTML-related use cases.In terms of engineering wins, this library was able to cut the direct down on the HTML backend significantly dropping 500 lines of code compared to the original Clang-Doc HTML backend. This library was also designed for general-purpose use around LLVM since there are numerous places in LLVM where various tools generate HTML in its way. Using the Mustache templating library would be a nice way to standardize the codebase. ### Improve Clang-Doc HTML Output The previous version of Clang-Doc’s HTML output was a pretty minimal, bare bones implementation. It had a sidebar that contained every single declaration within the project which created a large unnavigable UI. Typedef documentation was missing, plus method documentation was missing details such as whether or not the method was a const or virtual. There was no linking between other declarations in the project and there was no syntax highlighting on any language construct. With the new Mustache changes an additional backend was added using the specifier (–format=mhtml). That addresses these issues. Below is a comparison of the same output between the two backends You can also visit the output project on my github.io page linkhere. Note: this output is still a work in progress. ## Learning Insight I’ve learned a lot in the past few months, thanks to GSOC I now have a much better idea of what it’s like to participate in a large open-source project. I received a lot of feedback through PRs, making RFC, and collaborating with other GSOC members. I learned a lot about how to interact with the community and solicit feedback. I also learned a lot about instrumentation/profiling code having conducted many experiments to try to speed Clang-Doc up. ## Future Work As my work concluded I was named as one of the maintainers of the project. In the future I plan to work on Clang-Doc until an MVP product can be generated and evaluated for the LLVM project. My remaining tasks include landing the Mustache support library and Clang-Doc’s Mustache backend, as well as gathering feedback from the LLVM community regarding Clang-Doc’s current output. Additionally, I intend to add test cases for the Mustache HTML backend to ensure its robustness and functionality. ## Conclusion Overall the current state of Clang-Doc is much healthier than it was before. It now has much better test coverage across all its output, markdown, html, yaml. Whereas previously there were no e2e test cases that were not as comprehensive. The tool is significantly faster especially for large scale projects like LLVM making documentation generation and development a much better experience.The tool also has a simplified HTML backend that will be much easier to work with compared to before leading to a faster velocity for development. ## Acknowledgements I’d like to thank my mentors, Paul and Petr for their invaluable input when I encounter issues with the project. This year has been tough for me mentally, and I’d like to thank my mentors for being accommodating with me.
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: Adding LLVM and Clang plugin support for windows
Hello everyone! My name is Thomas and for GSOC I’ve been working on adding plugin support for LLVM and Clang to windows, which mainly involved implementing proper support for building LLVM and Clang as shared libraries (known as DLLs on Windows, dylibs on macOS, or DSOs/SOs on most other Unices) on Windows. ## Background The LLVM CMake buildsystem had some existing support for building LLVM as a shared library on Windows, but it suffers from two limitations when trying to make code symbol visibility for DLLs work like Linux.Most of the environments that LLVM works on use ELF as the object and executable file format. Windows, however, uses PE as its executable file format and COFF as its object file format. This difference is important to highlight as it impacts how dynamic libraries operate in these environments. The ELF (and MachO) based targets implicitly export symbols across the module boundary, but they can be explicitly controlled via the GNU attribute applied to the symbol: `__attribute__((__visibility__(“...”)))`. For PE/COFF environments, two different attributes are required. Symbols meant to be exposed to other modules are decorated with `__declspec(dllexport)` and symbols which are imported from other modules are decorated with `__declspec(dllimport)`. Additionally, the PE format maintains a list of symbols which are public and assigns them a numerical identity known as the ordinal. This is represented in the file format as a 16-bit field, limiting the number of exported symbols to 64K. In order to support DLL builds on MinGW, a python script would scan the object files generated during the build and extract the symbol names from them. In order to remain under the 64K limit, the symbols would be filtered by pattern matching. The final set would then be used to create an import library that the consumer could use. This technique not only potentially over-exported symbols, introduced a secondary source of truth for the code, but also relied on the linker to generate fix up thunks as the compiler could not determine if a symbol originated from a shared library or static library. This would add additional overhead on a function call that went through this import library as it would no longer be a simple indirect call. Such a thunk was also not possible for data symbols such as static fields of classes except for MinGW which uses a custom runtime fixup. ## What We Did Some initial work I did was update the LLVM CMake build system to be able to build a LLVM DLL and use clang-cl’s /Zc:dllexportInlines-. Inline declared class methods by default are not compiled unless used, but when the `__declspec(dllexport)` attribute is applied to a class all its methods are compiled and exported by the compiler even if not used. This option negates this behaviour, preventing inline methods from being compiled and exported. This avoids emitting these methods in every translation unit including the declaration, greatly reducing compile times for DLL builds. More importantly, it almost halves the number of symbols exported for LLVM to 28k and Clang DLL to 20k. The cost of this improvement is that DLLs built with this option cannot be consumed by code being built with MSVC as that code expects these methods to be available externally. There is a Microsoft Developer Community issue to add it to MSVC, please consider voting for it so that it may be considered by Microsoft for addition to the MSVC toolchain. Another major thing I worked on was extending a Clang tooling based tool ids that Saleem Abdulrasool created to automate adding symbol visibility macros to classes, global functions and variables in public LLVM and Clang headers. I made its file processing multi-threaded and added a config file system to make it simple to add exclusions for different headers and directories when adding macro annotations. I also changed it to automatically add a include the header that defines the visibility macros when code is annotated by macros in a file. I managed to get plugins for Clang and LLVM working including passing the LLVM and Clang test suite when building with clang-cl. Some of the changes to support this have already merged LLVM and Clang or are waiting in open PRs, but it will take some time to get all the changes merged across the whole LLVM and Clang codebase.The greatly reduced install size from using a non statically linked build of LLVM tools and Clang could also help with current limitation of the installer used for the official distribution on windows that forced the number of targets included in the official distribution to be limited. It would shrink from over 2GB to close to 500MB. ## Future Work Some of the next steps after all the symbol visibility changes have been merged in to LLVM and Clang would be to use the `ids` tool annotate new classes and functions added to either be integrated in to the PR LLVM pre-submit action to generate symbol visibility macros for new classes and functions added or alternative something like gn syncbot that runs as an after commit action along.A build bot will also later need to be set up to make sure the windows shared library builds continue to build and function correctly.Clang still has some weak areas when it comes to tracking the source location for some code constructs in its AST like explicit function template instantiation that `ids` could benefit from being fixed. ## Acknowledgements I’d like to thank Tom Stellards for doing a lot of the initial work I reused and built on top of. My mentors Saleem Abdulrasool and Vassil Vassilev. ## Links Github issue for current progress adding plugin and LLVM_BUILD_LLVM_DYLIB support for WindowsPrevious discussion of Supporting LLVM_BUILD_LLVM_DYLIB on Windows
blog.llvm.org
December 23, 2024 at 7:14 PM
Lightstorm: minimalistic Ruby compiler
Some time ago I was talking about an ahead-of-time Ruby compiler.We started the project with certain goals and hypotheses in mind, and while the original compiler is at nearly 90% completion, there are still those other 90% that needs to be done. In the meantime, we decided to strip it down to a bare minimum and implement just enough features to validate the hypotheses. Just like the original compiler we use MLIR to bridge the gap between Ruby VM’s bytecode and the codegen, but instead of targeting LLVM IR directly, we go through EmitC dialect and targeting C language, as it significantly simplifies OS/CPU support. We go into a bit more details later. The source code of our minimalistic Ruby compiler is here: https://github.com/dragonruby/lightstorm. The rest of the article covers why we decided to build it, how we approached the problem, and what we discovered along the way. ## Motivation and the use case Our use case is pretty straightforward: we are building a cross-platform game engine that’s indie-focused, productive, and easy to use. The game engine itself is written in a mix of C and Ruby, but the main user-interface is the Ruby itself. As soon as the game development is done and the game is ready for “deployment,” the code is more or less static and so we asked ourselves if we can pre-compile it into machine code to make it run faster. But given all the dynamism of Ruby, why would a compilation make it faster? So here comes our hypothesis. But first let’s look at some high-level implementation details to see where the hypothesis comes from. ## Compilers vs Interpreters While a language itself cannot be strictly qualified as compiled or interpreted, the typical implementations certainly are. In case of a “compiled” language the compiler would take the whole program, analyze it and produce the machine code targeting a specific hardware (real or virtual), while an interpreter would take the program and execute it right away, one “instruction” at a time. _The definition above is a bit handwavy: zoom our far enough and everything is a compiler, zoom in close enough and everything is an interpreter. But you ’ve got the gist._ Most Ruby implementations are interpreter-based, and in our case we are using mruby. The mruby interpreter is a lightweight register-based virtual machine (VM). Let’s look at a concrete example. The following piece of code: 42 + 15 is converted into the following VM bytecode, consisting of various operations (ops for short): LOADI R1 42LOADI R2 15ADD R1 R2HALT The VM’s interpreter loop looks as follows (pseudocode): dispatch_next: Op op = bytecode.next_op(); switch (op.opcode) { case LOADI: { vstack.store(op.dest, mrb_int(op.literal)); goto dispatch_next; } case ADD: { mrb_value lhs = vstack.load(op.lhs); mrb_value rhs = vstack.load(op.rhs); vstack.store(op.dest, mrb_add(lhs, rhs)); goto dispatch_next; } // More ops... case HALT: goto halt_vm; }halt_vm: // ... For each bytecode operation the VM will jump/branch into the right opcode handler, and then will branch back to the beginning of the dispatch loop.In the meantime, each opcode handler would use a virtual stack (confusingly located on the heap) to store intermediate results. If we unroll the above bytecode manually, then the code can look like this: goto loadi_1;loadi_1: // LOADI R1 42 mrb_value t1 = mrb_int(42); vstack.store(1, t1); goto loadi_2;loadi_2: // LOADI R2 42 mrb_value t2 = mrb_int(15); vstack.store(2, t2); goto add;add: // ADD R1 R2 mrb_value lhs = vstack.load(1); mrb_value rhs = vstack.load(2); mrb_value t3 = mrb_add(lhs, rhs); vstack.store(1, t3); goto halt;halt: // shutdown VM Many items in this example can be eliminated: specifically, we can avoid load/stores from/to the heap, and we can safely eliminate `goto`s/branches: mrb_value t1 = mrb_int(42); mrb_value t2 = mrb_int(15);; mrb_value t3 = mrb_add(t1, t2); vstack.store(1, t3); goto halt;halt: // shutdown VM So here goes our hypothesis: > ### Hypothesis > > By precompiling/unrolling the VM dispatch loop we can eliminate many load/stores and branches along with branch mispredictions, this should improve the performance of the end program. We can also try to apply some optimizations based on the high-level bytecode analysis, but due to the Ruby’s dynamism the static optimization surface is limited. ## Approach As mentioned in the beginning, building a full-fledged AOT compiler is a laborious task which requires time and has certain constraints. For the minimalistic version we decided to change/relax some of the constraints as follows: * the compiled code must be compatible with the existing ecosystem/runtime * the existing runtime must not require any changes * the supported language features should be easily “representable” in C Unlike the original compiler, we are not targeting machine code directly, but C instead.This eliminates a lot of complexity, but it also means that we only support a subset of the language (e.g., blocks and exceptions are missing at the moment). This is obviously not ideal, but it serves important purpose - **our goal at this point is to validate the hypothesis**. A classical compilation pipeline looks as follows: To build a compiler one needs to implement the conversions from the raw source file all the way down to machine code and the language runtime library.Since we are targeting the existing implementation, we have the benefit of reusing the frontend (parsing + AST) and the runtime library. Still, we need to implement the conversion from AST to the machine code.And this is where the power of MLIR kicks in: we built a custom dialect (Rite) which represents mruby VM’s bytecode, and then use a number of builtin dialects (`cf`, `func`, `arith`, `emitc`) to convert our IR into C code. At this point, we can just use clang to compile/link the code together with the existing runtime and that’s it. The final compilation pipeline looks as follows: > With the benefit of MLIR we are able to build a funtional compiler in just a couple of thousands lines of code! Now let’s look at how it performs. ## Some numbers Benchmarking is hard, so take these numbers with a grain of salt. We run various (micro)-benchmarks showing results in the range of 1% to 1200% speedups, but we are sticking to the aobench as it is very close to the game-dev workloads we are targeting. mruby also uses aobench as part of its benchmark suite, though we slightly modified it to replace `Number.each` blocks with explicit `while` loops. Next we used excellent simple-kpc library to capture CPU counters on Apple M1 CPU, namely we collect total cycles, total instructions count, branches, branch mispredictions, and load/stores (`FIXED_CYCLES`, `FIXED_INSTRUCTIONS`, `INST_BRANCH`, `BRANCH_MISPRED_NONSPEC`, and `INST_LDST` respectively). Naturally, we also collect the total execution time. All the benchmarks compare vanilla bytecode interpretation against the “unrolled” compiled version. We are using mruby 3.0. While it’s not the most recent version at the time of writing, it was the most recent version at the time we’ve build the compiler. The following chart shows the results of our measurements.The three versions we compare are the baseline on the left, compiled version without optimizations in the middle, and the compiled version plus simple escape analysis and common subexpression elimination (CSE) on the right side. _The raw data and the formulas arehere._ With all the current optimizations in place both the number of cycles and the total execution time went down by roughly ~30%. We are able to eliminate ~17% of branches and ~28% of load/stores, while the branch misses were cut in half with ~55% decrease. The numbers look promising, although the amount of load/stores and branches will certainly go up as we implement all the language features due to the way blocks and exceptions are handled. On the other hand, we didn’t touch the runtime implementation which together with LTO will enable some more improvements due to more inlining. ## Where to next? As mentioned in the beginning, some parts of the engine itself are written in C with and some of them are purely due to performance reasons. We are looking into replacing those critical pieces with compiled Ruby. While we may still pay performance penalty, we hope that ease of maintenance will be worthwile. In the meantime, do not hesitate to give it a shot, and if you have any questions reach out to Alex or Amir! ## Some links * the compiler: https://github.com/DragonRuby/lightstorm * the game engine: https://dragonruby.org * our discord: https://discord.dragonruby.org
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: Out-Of-Process Execution For Clang-Repl
Hello! I’m Sahil Patidar, and this summer I had the exciting opportunity toparticipate in Google Summer of Code (GSoC) 2024. My project revolved aroundenhancing Clang-Repl by introducing Out-Of-Process Execution. Mentors: Vassil Vassilev and Matheus Izvekov ## Project Background Clang-Repl, part of the LLVM project, is a powerful interactive C++ interpreter using Just-In-Time (JIT) compilation. However, it faced two major issues: high resource consumption and instability. Running both Clang-Repl and JIT in the same process consumed excessive system resources, and any crash in user code would shut down the entire session. To address these problems, **Out-Of-Process Execution** was introduced. By executing user code in a separate process, resource usage is reduced and crashes no longer affect the main session. This solution significantly enhances both the efficiency and stability of Clang-Repl, making it more reliable and suitable for a broader range of use cases, especially on resource-constrained systems. ## What We Accomplished As part of my GSoC project, I’ve been focused on implementing out-of-process execution in Clang-Repl and enhancing the ORC JIT infrastructure to support this feature. Here is a breakdown of the key tasks and improvements I worked on: ### Out-Of-Process Execution Support for Clang-Repl **PR** : #110418 One of the primary objectives of my project was to implement **out-of-process (OOP) execution** capabilities within Clang-Repl, enabling it to execute code in a separate, isolated process. This feature leverages **ORC JIT ’s remote execution capabilities** to enhance code execution flexibility by isolating runtime environments. To enable OOP execution in Clang-Repl, I utilized the `llvm-jitlink-executor`, allowing Clang-Repl to offload code execution to a dedicated executor process. This setup introduces a layer of isolation between Clang-Repl’s main process and the code execution environment. * **New Command-Line Flags** : To facilitate the out-of-process execution, I added two key command-line flags: * **`--oop-executor`** This flag starts a separate JIT executor process. The executor handles the actual code execution independently of the main Clang-Repl process. * **`--oop-executor-connect`** This flag establishes a communication link between Clang-Repl and the out-of-process executor. It allows Clang-Repl to transmit code to the executor and retrieve the results from the execution. With these flags in place, Clang-Repl can utilize `llvm-jitlink-executor` to execute code in an isolated environment. This approach significantly enhances separation between the compilation and execution stages, increasing flexibility and ensuring a more secure and manageable execution process. ### Issues Encountered * **Block Dependence Calculation in ObjectLinkingLayer**Commit Link **Code Example** clang-repl> int f() {return 1;}clang-repl> int f1() {return f();}clang-repl> f1();error: disconnectingclang-repl> JIT session error: FD-transport disconnectedJIT session error: disconnectingJIT session error: FD-transport disconnectedJIT session error: Failed to materialize symbols: { (main, { __Z2fv }) }disconnecting During my work on `clang-repl`, I encountered an issue where the JIT session would crash during incremental compilation. The root cause was a bug in `ObjectLinkingLayer::computeBlockNonLocalDeps`. The problem arose from the way the worklist was built: it was being populated within the same loop that records immediate dependencies and dependants, which caused some blocks to be missed from the worklist. This bug was fixed by **Lang Hames**. ### ORC JIT Enhancements As part of the OOP execution work, several improvements were made to ORC JIT, the underlying framework responsible for dynamic compilation and execution of code in Clang-Repl. These improvements target better handling of incremental execution, especially for Mach-O and ELF platforms, and ensuring that initializers are properly managed across different execution environments. 1. **Incremental Initializer Execution for Mach-O and ELF****PRs** : #97441, #110406 In a typical JIT execution environment, the `dlopen` function is used to handle code mapping, reference counting, and initializer execution for dynamically loaded libraries. However, this approach is often too broad for interactive environments like Clang-Repl, where we only need to execute newly introduced initializers rather than reinitializing everything. To address this, I introduced the **`dlupdate`** function in the ORC runtime. The `dlupdate` function is a targeted solution that focuses solely on running new initializers added during a REPL session. Unlike `dlopen`, which handles a variety of tasks and can lead to unnecessary overhead, `dlupdate` only triggers the execution of newly registered initializers, avoiding redundant operations. This improvement is particularly beneficial in interactive settings like Clang-Repl, where code is frequently updated in small increments. By streamlining the execution of initializers, this change significantly improves the efficiency of Clang-Repl. 2. **Push-Request Model for ELF Initializers****PR** : #102846 A push-request model has been introduced to manage ELF initializers within the runtime state for each `JITDylib`, similar to how initializers are handled for Mach-O and COFF. Previously, ELF required a fresh request for initializers with each invocation of `dlopen`, but lacked mechanisms to register, deregister, or retain these initializers. This created issues during subsequent `dlopen` calls, as initializers were erased after the `rt_getInitializers` function was invoked, making further executions impossible. To resolve these issues, the following functions were introduced: * **`__orc_rt_elfnix_register_init_sections`** : Registers ELF initializers for the `JITDylib`. * **`__orc_rt_elfnix_register_jitdylib`** : Registers the `JITDylib` with the ELF runtime state. With the new push-request model, the management and tracking of initializers for each `JITDylib` state are now more efficient. By leveraging Mach-O’s `RecordSectionsTracker`, only newly registered initializers are executed, greatly improving efficiency and reliability when working with ELF targets in `clang-repl`. This update is crucial for enabling out-of-process execution in `clang-repl` on ELF platforms, offering a more effective approach to managing incremental execution. ### Additional Improvements Beyond the main enhancements to Clang-Repl and ORC JIT, I also worked on several other improvements: 1. **Auto-loading Dynamic Libraries in ORC JIT.** **PR** : #109913 (On-going) With this update, we’ve introduced a new feature to the ORC executor and controller: **automatic loading of dynamic libraries in the ORC JIT**. This enhancement enables efficient resolution of symbols from both loaded and unloaded libraries. * How It Works: * **Symbol Lookup:** When a lookup request is made, the system first attempts to resolve the symbol from already loaded libraries. * **Unloaded Libraries Scan:** If the symbol is not found in any loaded library, the system then scans the unloaded dynamic libraries to locate it. * Key Addition: **Global Bloom Filter** A significant improvement in this update is the introduction of a **Global Bloom Filter**. When a symbol cannot be resolved in the loaded libraries, the symbol tables from the scanned libraries are incorporated into this filter. If the symbol is still not found, the bloom filter’s result is returned to the controller, allowing it to skip checking for symbols that do not exist in the global table during future lookups. Additionally, the system tracks symbols that were previously thought to be present but are actually absent in both loaded and unloaded libraries. With these enhancements, symbol resolution is significantly faster, as the bloom filter helps prevent unnecessary lookups, thereby improving efficiency for both loaded and unloaded dynamic libraries. 2. **Refactor of`dlupdate` Function****PR** : #110491 This update simplifies the `dlupdate` function by removing the `mode` argument, streamlining the function’s interface. The change enhances the clarity and usability of `dlupdate` by reducing unnecessary parameters, improving the overall maintainability of the code. ## Benchmarks: In-Process vs Out-of-Process Execution * Prime Finder * Fibonacci Sequence * Matrix Multiplication * Sorting Algorithms ## Result With these changes, `clang-repl` now supports out-of-process execution. We can run it using the following command: clang-repl --oop-executor=path/to/llvm-jitlink-executor --orc-runtime=path/to/liborc_rt.a ## Future Work * **Crash Recovery and Session Continuation :** Investigate and develop ways to enhance crash recovery so that if something goes wrong, the session can seamlessly resume without losing progress. This involves exploring options for an automatic process to restart the executor in the event of a crash. * **Finalize Auto Library Loading in ORC JIT :** Wrap up the feature that automatically loads libraries in ORC JIT. This will streamline symbol resolution for both loaded and unloaded dynamic libraries by ensuring that any required dylibs containing symbol definitions are loaded as needed. ## Conclusion With this project, **Clang-Repl** now supports **out-of-process execution** for both `ELF` and `Mach-O`, making it much more efficient and stable, especially on devices with limited resources. In the future, I plan to work on automating library loading and improving ORC-JIT to make Clang-Repl’s out-of-process execution even better. ## Acknowledgements I would like to thank **Google Summer of Code (GSoC)** and the LLVM community for providing me with this amazing opportunity. Special thanks to my mentors, **Vassil Vassilev** and **Matheus Izvekov** , for their continuous support and guidance. I am also deeply grateful to **Lang Hames** for sharing their expertise on ORC-JIT and helping improve `clang-repl`. This experience has been a major step in my development, and I look forward to continuing my contributions to open source. ## Related Links * LLVM Repository * Project Description * My GitHub Profile
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: The 1001 thresholds in LLVM
Hey everyone! My name is Shourya and I worked on LLVM this summer through GSoC. My project is called The 1001 thresholds in LLVM. The main objective of this project was to study how varying different thresholds in llvm affect performance parameters like compile-time, bitcode-size, execution-time and llvm stats. # Background LLVM has lots of thresholds and flags to avoid “costly cases”. However, it is unclear if these thresholds are useful, their value is reasonable, and what impact they really have. Since there are a lot, one cannot do a simple exhaustive search. An example of work in this direction includes the introduction of a C++ class that can replace hardcoded values which offers control over the threshold, e.g., one can increase the recursion limit via a command line flag from the hardcoded “6” to a different number. As such there is a need to explore different thresholds in llvm, understand what it means for a threshold to be hit, profile different thresholds and select optimal values for different thresholds. # What We Did This work provides a tool that can efficiently explore these knobs and understand how modifying them affects metrics like compile time, size of the generated program, or any statistic that LLVM emits like “Number of loops vectorized”. (Note that execution-time is currently not evaluated because input-gen does not work on optimized IR and is thus part of future work.) We first built a clang matcher for which we looked for the following patterns : 1. Const knob_name = knob_val 2. Cl::init 3. Enum {knob_name = knob_val} to first identify the knobs in the codebase and then used a custom python tool (optimised to deal with I/O and cache bottlenecks) to collect the different stat values in parallel and stored them in a json file. After manual selection of interesting knobs, we have so far conducted three studies in which we measure compile-time and bitcode-size along with various other statistics, and present them in the form of interactive graphs. Two of them (on 10,000 and 100 bitcode files) look at average statistics for each knob value while the third one (on 10,000 bitcode files) studies how each file is affected individually by changing knob values. We see some very interesting patterns in these graphs, for instance in the following two graphs, representing the jump-threading-threshold, we can observe improved statistics (top graph) and decreased average compile time (bottom graph) if the knob value is increased. # Results The per file study proves that there is no one single magic knob value and the optimum, with regards to compile time or code size, depends on the file that is currently being compiled. For instance here we can see that different knob values (for the knob licm-mssa-optimization-cap) give good cumulative compile time improvements for different files. In detail, most files benefit from a knob value of 300 while 60 is the best knob value for the second most files. We further show that the presence of an oracle that can tell the best knob value for each file can significantly improve the cumulative compile time. In this project, we explored various thresholds in LLVM—specifically, 93 thresholds (a 100 file study for each can be found here) using the Clang matcher—and observed that these thresholds are largely file-specific. This indicates that there is no universally optimal value, or even a set of values, that can be applied across different scenarios. Instead, what is needed is an adaptive mechanism within LLVM, an oracle, that can dynamically determine the appropriate threshold values during compilation. We also experimented with varying thresholds cumulatively by leveraging file-specific information through an LLVM pass. However, after discussions with the mentors, this approach was set aside due to the significant changes it would necessitate across other parts of the LLVM codebase. As a result, we have not yet categorized different thresholds, such as identifying optimal threshold values for specific file types (e.g., I/O-intensive files). Nonetheless, we provide a tool that can efficiently collect this data (LLVM statistics, bitcode-size and compile-time) and help visualize it with the help of interactive graphs as well as histograms that examine these variations on a per-file basis. Additionally, a correlation table between knob values and performance metrics further illustrates the significant impact this study could have on improving LLVM’s overall performance. # Future Work The early results show that we need a better understanding of knob values to maximise various objectives. Our results will provide the community with the first step in developing a guided compilation model attune to the file that is being compiled. We further intend to show how these knobs interact with each other and whether modifying multiple knobs together compounds the benefits or not. One more area of work could be on input-gen that would enable us to collect and study execution-time in our performance parameters. # Acknowledgements This project would not have been possible without my amazing mentors, Jan Hückelheim, Johannes Doerfert, the LLVM Foundation admins, and the GSoC admins. # Links Code Studies GSoC Project Page
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: 3-way comparison intrinsics
Hello everyone! My name is Volodymyr, and in this post I would like to talk about the project I have been working on for the past couple of months as part of Google Summer of Code 2024. The aim of the project was to introduce 3-way comparison intrinsics to LLVM IR and add a decent level of optimizations for them. # Background Three-way comparison is an operation present in many high-level languages, such as C++ and its spaceship operator or Rust and the `Ord` trait. It operates on two values for which there is a defined comparison operation and returns `-1` if the first operand is less than the second, `0` if they are equal, and `1` otherwise. At the moment, compilers that use LLVM express this operation using different sequences of instructions which are optimized and lowered individually rather than as a single operation. Adding an intrinsic for this operation would therefore help us generate better machine code on some targets, as well as potentially optimize patterns in the middle-end that we didn’t optimize before. # What was done Over the course of the project I have added two new intrinsics to the LLVM IR: `llvm.ucmp` for an unsigned 3-way comparison and `llvm.scmp` for a signed comparison. They both take two arguments that must be integers or vectors of integers and return an integer or a vector of integers with the same number of elements. The arguments and the result do not need to have the same type. In the middle-end the following passes received some support for these intrinsics: * InstSimplify (#1, #2) * InstCombine (#1, #2, #3, #4, #5) * CorrelatedValuePropagation * ConstraintElimination I have also added folds of idiomatic ways that a 3-way comparison can be expressed to a call to the corresponding intrinsic. In the backend there are two different ways of expanding the intrinsics: as a nested select (i.e. `(x < y) ? -1 : (x > y ? 1 : 0)`) or as a subtraction of zero-extended comparisons (`zext(x > y) - zext(x < y)`). The second option is the default one, but targets can choose to use the first one through a TLI hook. # Results I think that overall the project was successful and brought a small positive change to LLVM. To demonstrate its impact in a small test case, the following function in C++ that uses the spaceship operator was compiled twice, first with Clang 18.1 and then with Clang built from the main branch of LLVM repository: #include <compare>std::strong_ordering cmp(unsigned int a, unsigned int b){ return a <=> b;} With Clang 18.1: ; ====== LLVM IR ======define i8 @cmp(i32 %a, i32 %b) {entry: %cmp.lt = icmp ult i32 %a, %b %sel.lt = select i1 %cmp.lt, i8 -1, i8 1 %cmp.eq = icmp eq i32 %a, %b %sel.eq = select i1 %cmp.eq, i8 0, i8 %sel.lt ret i8 %sel.eq}; ====== x86_64 assembly ======cmp: xor ecx, ecx cmp edi, esi mov eax, 0 sbb eax, eax or al, 1 cmp edi, esi movzx eax, al cmove eax, ecx ret With freshly built Clang: ; ====== LLVM IR ======define i8 @cmp(i32 %a, i32 %b) {entry: %sel.eq = tail call i8 @llvm.ucmp.i8.i32(i32 %a, i32 %b) ret i8 %sel.eq}; ====== x86_64 assembly ======cmp: cmp edi, esi seta al sbb al, 0 ret As you can see, the number of instructions in the generated code had gone down considerably (from 8 to 3 excluding `ret`). Although this isn’t much and is a small synthetic test, it can still make a noticeable impact if code like this is found in a hot path somewhere. The impact of these changes on real-world code is much harder to quantify. Looking at llvm-opt-benchmark, there are quite a few places where the intrinsics are being used, which suggests that some improvement must have taken place, although it is unlikely to be significant in all but very few cases. # Future Work There are still many opportunities for optimization in the middle-end, some of which are already known and being worked on at the time of writing this, others are yet to be discovered. I would also like to allow pointers and vectors of pointers to be valid operands for the intrinsics, although that would be quite a minor change. In the backend I would also like to work on better handling of intrinsics in GlobalISel, which is something that I didn’t have enough time for and other members of LLVM community had helped me with. # Acknowledgements None of this would have been possible without my two amazing mentors, Nikita Popov and Dhruv Chawla, and the LLVM community as a whole. Thank you for helping me on this journey and I am looking forward to working with you in the future.
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: ABI Lowering in ClangIR
ClangIR is an ongoing effort to build a high-level intermediate representation(IR) for C/C++ within the LLVM ecosystem. Its key advantage lies in its abilityto retain more source code information. While ClangIR is making progress, itstill lacks certain features, notably ABI handling. Currently, ClangIR lowersmost functions without accounting for ABI-specific calling convention details. ## Goals The “Build & Run SingleSource Benchmarks with ClangIR - Part 2” Google Summer ofCode 2024 builds on my contributions from GSoC 2023 by addressing one of themain issues I encountered: target-specific lowering. It focuses on extendingClangIR’s code generation capabilities, particularly in ABI-lowering for X86-64.Several tests rely on operations and types (e.g., `va_arg` calls and complexdata types) that require target-specific information to compile correctly. The concrete steps to achieve this were: 1. **Implement foundational infrastructure** that can scale to multiplearchitectures while adhering to ClangIR design principles such as CodeGenparity, feature guarding, and AST backreferences. 2. **Handle basic calling convention scenarios** as a proof of concept tovalidate the foundational infrastructure. 3. **Add lowering for a second architecture** to further validate theinfrastructure’s extensibility to multiple architectures. 4. **Unify target-specific ClangIR lowering into the library** , as there are afew isolated methods handling target-specific code lowering like`cir.va_arg`. 5. **Integrate calling convention lowering into the main pipeline** to ensurefuture contributions and continued development of this infrastructure. ## Contributions The list of contribution (PRs) can be foundhere. ### Target Lowering Library The most significant contribution of this project was the development of amodular `TargetLowering` library.This ensures that target-specific MLIR lowering passes can leverage this sharedlibrary for lowering logic. The library also follows ClangIR’s feature guardingprinciples, ensuring that any contributor can refer to the original CodeGen forcontributions, and any unimplemented feature is asserted at specific codepoints, making it easy to track missing functionality. ### Calling Convention Lowering Pass As a proof of concept, the initial development of the `TargetLowering` libraryfocused on implementing a calling convention loweringpass that targets multiplearchitectures. Currently, ClangIR ignores the target ABI during CodeGen toretain high-level information. For example, structs are not unraveled to improveargument-passing efficiency. ABI-specific LLVM attributes are also ignored. Thispass addresses these issues by properly tagging LLVM attributes and rewritingfunction definitions and calls to handle unraveled structs. This was implementedfor both X86-64 and AArch64,demonstrating the library’s multi-architecture support. ## Shortcomings ### Target-Specific Lowering Unification While some target-specific lowering code was moved into the library, it wascopied and pasted rather than properly integrated. This is not ideal forleveraging the library’s multi-architecture features. ### Inclusion in the Main Pipeline This is still a work in progress, as the library is not yet mature enough tohandle most pre-existing ClangIR tests. There are also feature guards withunreachable statements for many unimplemented features. ## Future Work Now that there is a base infrastructure for handling target-agnostic totarget-specific CIR code, there is a large amount of future work to be done,including: * Improving DataLayout-related queries using MLIR’s built-in tools. * Implementing calling convention lowering for additional types, such aspointers. * Extending the TargetLowering library to support more architectures. * Unifying remaining target-specific lowering code from other parts of ClangIR. ## Acknowledgements I would like to thank my Google Summer of Code mentors, Bruno Cardoso Lopes andNathan Lanza, for another great GSoC experience. I also want to thank the LLVMcommunity and Google for organizing the program.
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: Statistical Analysis of LLVM-IR Compilation
Welcome! My name is Andrew and I contributed to LLVM through the 2024 Google Summer of Code Program. My project is called Statistical Analysis of LLVM-IR Compilation. The objective of this project is to provide an analysis of how time is spent in the optimization pipeline. Generally, drastic differences in the percentage of time spent by a pass in the pipeline is considered abnormal. # Background In principle, an LLVM IR bitcode file, or module, contains IR features that determine the behavior of the compiler optimization pipeline. By varying these features, the optimization pipeline, opt, can add significantly to the compilation time or marginally. More specifically, optimizations succeed in less or more time; the user can wait for a microsecond or a few minutes. LLVM compiler developers constantly edit the pipeline, so the performance of these optimizations can vary by compiler version (sometimes significantly). Having a large IR dataset such as ComPile allows for testing the LLVM compilation pipeline on a varied sample of IR. The size of this sample is sufficient to determine outlying IR modules. By identifying and examining such files using utilities which are being added to the LLVM IR Dataset Utils Repo, the causes of unexpected compilation times can be determined. Developers can then modify and improve the compilation pipeline accordingly. # Summary of Work The utilities added in PR37 are intended to write each IR module to a tar file corresponding to a programming language. Each file written to the tar files is indexed by its location in the HF dataset. This allows easy identification of files for tools which can be used for data extraction and analysis in the shell, notably clang. Tar file creation allows for potentially using less storage space then downloading the HF dataset to disk, and it allows code to be written which does not depend on the Python interpreter to load the dataset for access. The Makefile from PR36 is responsible for carrying out the data collection. This data includes text segment size, user CPU instruction counts during compile time (analogous to time), IR feature counts sourced from the LLVM pass `print<func-properties>`, and maximum relative time pass names and percentage counts. The data can be extracted in parallel or serially and is stored in a CSV file. An important data collection command in the Makefile is `clang -w -c -ftime-report $(lang)/bc_files/[email protected] -o /dev/null`. The output from the command is large, but the part of interest is the first `Pass execution timing report`: ===-------------------------------------------------------------------------=== Pass execution timing report===-------------------------------------------------------------------------=== Total Execution Time: 2.2547 seconds (2.2552 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 2.1722 ( 96.5%) 0.0019 ( 47.5%) 2.1741 ( 96.4%) 2.1745 ( 96.4%) VerifierPass 0.0726 ( 3.2%) 0.0000 ( 0.0%) 0.0726 ( 3.2%) 0.0726 ( 3.2%) AlwaysInlinerPass 0.0042 ( 0.2%) 0.0015 ( 39.2%) 0.0058 ( 0.3%) 0.0058 ( 0.3%) AnnotationRemarksPass 0.0014 ( 0.1%) 0.0005 ( 13.3%) 0.0019 ( 0.1%) 0.0020 ( 0.1%) EntryExitInstrumenterPass 0.0003 ( 0.0%) 0.0000 ( 0.0%) 0.0003 ( 0.0%) 0.0003 ( 0.0%) CoroConditionalWrapper 2.2507 (100.0%) 0.0039 (100.0%) 2.2547 (100.0%) 2.2552 (100.0%) Total A user can visually see the distribution of these passes by using a profiling tool for .json files. The .json file for a given bitcode file is obtained by `clang -c -ftime-trace <file>`. The visualization of this output can be filtered to the passes of interest as in the following image: The CoroConditionalWrapper pass is accounted by the “Total CoroConditionalWrapper” block. Clearly, that pass takes a far smaller amount of time than the others, as accounted for by the pass execution timing report. However, instead of seeing the pass as an insignificant percentage of time, the visualization allows for additional comparisons of the relative timings of each pass. The example image has the optimization passes of interest selected, but the .json file provides information on the entire compilation pipeline as well. Thus, the entire pipeline execution flow can be visualized. # Current Status Currently, there are three PRs that require approval to be merged. There has been ongoing discussion on their contents, so few steps should be left to merge them.In the current state, users of the utilities in PR38 should be able to readily reproduce the quantitative results I had obtained for my GSoC midterm presentation graphs. Users can easily perform outlier analysis as well on the IR files (excluding Julia IR). Some of the results include the following: Scatter Plot of C IR Files: Table of outliers for C IR files: # Future Work It was discussed in PR 37 to consolidate the tar file creation into the dataset file writer Python script. This is a feature I wish to implement in order to speed up the tar file creation process by having the bitcode files written from memory to the tar instead of from memory, to disk, to tar. As mentioned, Julia IR was not analyzed. Modifying the scripts to include Julia IR results is desirable to make complete use of the dataset.Adding additional documentation for demonstration-of-use purposes could help clarify ways to use the tools. Additionally, outlier analysis can be expanded upon by using more advanced outlier detection methods. Not all the data collected in the CSV files was used, so using those extra features–in particular the `print<func-properties>` pass–can allow for improved accuracy in outlier detection. # Acknowledgements I would like to thank my mentors Johannes Doerfert and Aiden Grossman for their constant support during and prior to the GSoC program. Additionally, I would like to acknowledge the work of the LLVM Foundation admins and the GSoC admins. # Links * PR 38 * PR 37 * PR 36 * LLVM IR Dataset Utils Repo * Compile Dataset
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: Reviving NewGVN
This summer I participated in GSoC under the LLVM Compiler Infrastructure. The goal of the project was to improve the NewGVN pass so that it can replace GVN as the main value numbering pass in LLVM. # Background Global Value Numbering (GVN) consists of assigning value numbers such that instructions with the same value number are equivalent. NewGVN was introduced in 2016 to replace GVN. We now highlight a few aspects in which NewGVN is better than GVN. A key advantage of NewGVN over GVN is that it is complete for loops, while GVN is only complete for acyclical code. NewGVN is complete for loops because when it first processes loops, it assumes that only the first iteration will be executed, later corroborating these assumptions—this is known as the optimistic assumption. In practice, the optimistic assumption boils down to assuming that backedges are unreachable and, consequently, that when evaluating phi instructions, the values carried by them can be ignored. For instance, in the example below, `%a` is optimistically evaluated to `0`. This leads to evaluating `%c` to `%x`, which in turn leads to evaluating `%a.i` to `0`. At this point, there are two possibilities: either the assumption was correct and the loop actually only executes once, and the value numbers computed so far are correct, or the instructions in the loop need to be reevaluated. Assume, for this example, that NewGVN could not prove that only one iteration is executed. Then `%a` once again evaluates to `0`, and all other registers also evaluate to the same. Thanks to the optimistic assumption, we were able to discover that `%a` is loop-invariant and, moreover, that it is equal to `0`. define i32 @optimistic(i32 %x, i32 %y){entry: br label %looploop: %a = phi i32 [0, %entry], [%a.i, %loop] ... %c = xor i32 %x, %a %a.i = sub i32 %x, %c br i1 ..., label %loop,label %exitexit: ret i32 %a} On the other hand, GVN fails to detect this equivalence because it would pessimistically evaluate `%a` to itself, and the previously described evaluation steps would never take place. Another advantage of NewGVN is the value numbering of memory operations using MemorySSA. It provides a functional view of memory where instructions that can modify memory produce a new memory version, which is then used by other memory operations. This greatly simplifies the detection of redundancies among memory operations. For example, two loads of the same type from equivalent pointers and memory versions are trivially equivalent. define i32 @foo(i32 %v, ptr %p) {entry:; 1 = MemoryDef(liveOnEntry) store i32 %v, ptr %p, align 4; MemoryUse(1) %a = load i32, ptr %p, align 4; MemoryUse(1) %b = load i32, ptr %p, align 4; 2 = MemoryDef(1) call void @f(i32 %a); MemoryUse(2) %c = load i32, ptr %p, align 4 %d = sub i32 %b, %c ret i32 %d} In the example above (annotated with MemorySSA), `%a` and `%b` are equivalent, while `%c` is not. All three loads are of the same type from the same pointer, but they don’t all load from the same memory state. Loads `%a` and `%b` load from the memory defined by the store (Memory `1`), while `%c` loads from the memory defined by the function call (Memory `2`). GVN can also detect these redundancies, but it relies on the more expensive and less general MemoryDependenceAnalysis. Despite these and other improvements NewGVN is still not widely used, mainly because it lacks partial redundancy elimination (PRE) and because it is bug-ridden. # Implementing PRE Our main contribution was the development of a PRE stage for NewGVN (found here). Our solution relied on generalizing Phi-of-Ops. It performs a special case of PRE where the instruction depends on a phi instruction, and an equivalent value is available on every reaching path. This is achieved in two steps: phi-translation and phi-insertion. Phi-translation consists of evaluating the original instruction in the context of each of its block’s predecessors. Phi operands are replaced by the value incoming from the predecessor. The value is available in the predecessor if the translated instruction is equivalent to a constant, function argument, or another instruction that dominates the predecessor. Phi-insertion occurs after phi-translation if the value is available in every predecessor. At that point, a phi of the equivalent values is constructed and used to replace the original instruction. The full process is illustrated in the following example. Our generalization eliminated the need for a dependent phi and introduced the ability to insert the missing values in cases where the instruction is partially redundant. To prevent increases in code size (ignoring the inserted phi instructions), the insertion is only made if it’s the only one required. The full process is illustrated in the following example. Integrating PRE into the existing framework also allowed us to gain loop-invariant code motion (LICM) for free. The optimistic assumption, combined with PRE, allows NewGVN to speculatively hoist instructions out of loops. On the other hand, LICM in GVN relies on using LoopInfo and can only handle very specific cases. # Missing Features The two main features our PRE implementation lacks are critical edge splitting and load coercion. Critical edge splitting is required to ensure that we do not insert instructions into paths where they won’t be used. Currently, our implementation simply bails in such cases. Load coercion allows us to detect equivalences of loaded values with different types, such as loads of `i32` and `float`, and then coerce the loaded type using conversion operations. The difficulty in implementing these features is that NewGVN is designed to perform analysis and transformation in separate steps, while these features involve modifying the function during the analysis phase. # Results We evaluated our implementation using the automated benchmarking tool Phoronix Test Suite from which we selected a set of 20 C/C++ applications (listed below). | | | ---|---|---|--- aircrack-ng| encode-flac| luajit| scimark2 botan| espeak| mafft| simdjson zstd| fftw| ngspice| sqlite-speedtest crafty| john-the-ripper| quantlib| tjbench draco| jpegxl| rnnoise| graphics-magick The default `-O2` pipeline was used. The only change betweeen compilations was the value numbering pass used. Despite the missing features, we observed that our implementation, on average, performs 0.4% better than GVN. However, it is important to mention that our solution hasn’t been fine-tuned to consider the rest of the optimization pipeline, which resulted in some cases where our implementation regressed compared to both GVN and the existing NewGVN. The most severe case was with jpegxl, where our implementation, on average, performed 10% worse than GVN. It’s important to note that this was an outlier; excluding jpegxl, most regressions were at most 2%. Unfortunately, due to time constraints, we were unable to study these cases in more detail. # Future Work In the future, we plan to implement the aforementioned missing features and fine-tune the heuristics for when to perform PRE to prevent the regressions discussed in the results section. Once these issues are addressed, we’ll upstream our implementation, bringing us a step closer to reviving NewGVN.
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: Compile GPU kernels using ClangIR
Hello everyone! I’m 7mile. My GSoC project this summer is Compile GPU kernels using ClangIR. It’s been an exciting journey in compiler development, and I’m thrilled to share the progress and insights gained along the way here. # Background The ClangIR project aims to establish a new IR for Clang, built on top of MLIR. As part of the ongoing effort to support heterogeneous programming models, this project focuses on integrating OpenCL C language support into ClangIR. The ultimate goal is to enable the compilation of GPU kernels written in OpenCL C into LLVM IR targeting the SPIR-V architecture, laying the groundwork for future enhancements in SYCL and CUDA support. # What We Did Our work involved several key areas: 1. **Address Space Support** : One of the fundamental tasks was teaching ClangIR to handle address spaces, a vital feature for languages like OpenCL. Initially, we considered mimicking LLVM’s approach, but this proved inadequate for ClangIR’s goals. After thorough discussion and an RFC, we implemented a unified address space design that aligns with ClangIR’s objectives, ensuring a clean and maintainable code structure. 2. **OpenCL Language and SPIR-V Target Integration** : We extended ClangIR to support the OpenCL language and the SPIR-V target. This involved enhancing the pipeline to accommodate the latest OpenCL 3.0 specification and implementing hooks for language-specific and target-specific customizations. 3. **Vector Type Support** : OpenCL vector types, a critical feature for GPU programming, were integrated into ClangIR. We leveraged ClangIR’s existing cir.vector type to generate the necessary code, ensuring consistent compilation results. 4. **Kernel and Module Metadata Emission** : We added support for emitting OpenCL kernel and module metadata in ClangIR, a necessary step for proper integration with the SPIR-V target. This included the creation of structured attributes to represent metadata, following MLIR’s preferences for well-defined structures. 5. **Global and Static Variables with Qualifiers** : We implemented support for global and static variables with qualifiers like `global`, `constant`, and `local`, ensuring that these constructs are correctly represented and lowered in the ClangIR pipeline. 6. **Calling Conventions** : We adjusted the calling conventions in ClangIR to align with SPIR-V requirements, migrating from the default `cdecl` to SPIR-V-specific conventions like `SpirKernel` and `SpirFunction`. This also enables most OpenCL built-in functions like `barrier` and `get_global_id`. 7. **User Experience Enhancements** : Finally, we ensured that the end-to-end kernel compilation experience using ClangIR was smooth and intuitive, with minimal manual intervention required. # Results The project successfully met its primary goals. OpenCL kernels from the Polybench-GPU benchmark suite can now be compiled using ClangIR into LLVM IR for SPIR-V. All patches have been merged into the main ClangIR repository, and the project’s progress has been well-documented in the overview issue. I believe the work not only advanced OpenCL support but also laid a solid foundation for future enhancements, such as SYCL and CUDA support in ClangIR. We have successfully compiled and executed all 20 OpenCL C benchmarks from the polybenchGpu repository, passing the built-in result validation. Please refer to our artifact evaluation repository for detailed instructions on how to experiment with our work. # Future Works As we look forward, there are two key areas that require further development: 1. **Function Attribute Consistency** : For example, the `convergent` function attribute is crucial for preventing misoptimizations in SIMT languages like OpenCL. ClangIR currently lacks this attribute, which could lead to issues in parallel computing contexts. Addressing this is a priority to ensure correct optimization behavior. 2. **Support for OpenCL Built-in Types** : Another critical area for future work is the support for OpenCL built-in types, such as `pipe` and `image`. These types are essential for handling data streams and image processing tasks in various specialized OpenCL applications. Supporting these types will significantly enhance ClangIR’s adherence to the OpenCL standard, broadening its applicability and ensuring better compatibility with a wide range of OpenCL programs. # Acknowledgements This project would not have been possible without the guidance and support of the LLVM community. I extend my deepest gratitude to my mentors, Julian Oppermann, Victor Lomüller, and Bruno Cardoso Lopes, whose expertise and encouragement were instrumental throughout this journey. Additionally, I would like to thank Vinicius Couto Espindola for his collaboration on ABI-related work. This experience has been immensely rewarding, both technically and in terms of community engagement. # Appendix * Overview issue of OpenCL C support * Artifact Evaluation Instructions
blog.llvm.org
December 23, 2024 at 7:14 PM
GSoC 2024: Half-precision in LLVM libc
C23 defines new floating-point types, such as `_Float16`, which corresponds tothe binary16 format from IEEE Std 754, also known as “half-precision,” or FP16.C23 also defines new variants of the C standard library’s math functionsaccordingly, such as `fabsf16` to get the absolute value of a `_Float16`. The “Half-precision in LLVM libc” Google Summer of Code 2024 project aimed toimplement these new `_Float16` math functions in LLVM libc, making it the firstknown C standard library implementation to implement these C23 functions. We split math functions into two categories: basic operations and higher mathfunctions. The current implementation status of math functions in LLVM libc canbe viewed at https://libc.llvm.org/math/index.html#implementation-status. The exact goals of this project were to: 1. Setup generated headers properly so that the `_Float16` type and `_Float16`functions can be used with various compilers and architectures. 2. Add generic implementations of `_Float16` basic operations for supportedarchitectures. 3. Add optimized implementations of `_Float16` basic operations for specificarchitectures using special hardware instructions and compiler builtinswhenever possible. 4. Add generic implementations of as many `_Float16` higher math functions aspossible. We knew we would not have enough time to implement all of them. ## Work done 1. The `_Float16` type can now be used in generated headers, and declarations of`_Float16` math functions are generated with `#ifdef` guards to enable themwhen they are supported. * https://github.com/llvm/llvm-project/pull/93567 2. All 70 planned `_Float16` basic operations have been merged. * https://github.com/llvm/llvm-project/issues/93566 3. The `_Float16`, `float` and `double` variants of various basic operationshave been optimized on certain architectures. * https://github.com/llvm/llvm-project/pull/98376 * https://github.com/llvm/llvm-project/pull/99037 * https://github.com/llvm/llvm-project/pull/100002 4. Out of the 54 planned `_Float16` higher math functions, 8 have been mergedand 9 have an open pull request. * https://github.com/llvm/llvm-project/issues/95250 We ran into unexpected issues, such as: * Bugs in Clang 11, which is currently still supported by LLVM libc and used inpost-commit CI. * Some post-commit CI workers having old versions of compiler runtimes that aremissing some floating-point conversion functions on certain architectures. * Inconsistent behavior of floating-point conversion functions across compilerruntime vendors (GCC’s libgcc and LLVM’s compiler-rt) and CPU architectures. Due to these issues, LLVM libc currently only enables all `_Float16` functionson x86-64 Linux. Some were disabled on AArch64 due to Clang 11 bugs, and allwere disabled on 32-bit Arm and on RISC-V due to issues with compiler runtimes.Some are not available on GPUs because they take `_Float128` arguments, and the`_Float128` type is not available on GPUs. There is work in progress to work around issues with compiler runtimes by usingour own floating-point conversion functions. ## Work left to do * Implement the remaining `_Float16` higher math functions. * Enable the `_Float16` math functions that are disabled on AArch64 once LLVMlibc bumps its minimum supported Clang version. * Enable `_Float16` math functions on 32-bit Arm and on RISC-V once issues withcompiler runtimes are resolved. ## Acknowledgements I would like to thank my Google Summer of Code mentors, Tue Ly and Joseph Huber,as well as other LLVM maintainers I interacted with, for their help. I wouldalso like to thank Google for organizing this program.
blog.llvm.org
December 10, 2024 at 7:33 PM
GSoC 2024: GPU Libc Benchmarking
Hey everyone! My name is James and I worked on LLVM this summer through GSoC. My project is called GPU Libc Benchmarking. The main objective of this project was to develop microbenchmarking infrastructure for libc on the GPU. # Background The LLVM libc project was designed as an alternative to glibc that aims to be modular, configurable, and sanitizer-friendly. Currently, LLVM libc is being ported to Nvidia and AMD GPUs to give libc functionality (e.g. printf(), malloc(), and math functions) on the GPU. As of March 2024, programs can use GPU libc in offloading languages (CUDA, OpenMP) or through direct compilation and linking with the libc library. # What We Did During this project, we developed a microbenchmarking framework that is directly compiled for and run on the GPU, using libc functions to display output to the user. As this was a short project (90 hours), we mostly focused on developing the infrastructure and writing a few example usages (isalnum(), isalpha(), and sin()). Our benchmarking infrastructure is based on Google Benchmark and measures the average cycles, minimum, maximum, and standard deviation of each benchmark. Each benchmark is run for multiple iterations to stabilize the results. Benchmark writers can measure against vendor implementations of libc functions by passing specific linking flags to the benchmark’s CMake portion and registering the corresponding vendor function from the benchmark itself. Below is an example of our benchmarking infrastructure’s output for `sinf()` Benchmark | Cycles | Min | Max | Iterations | Time / Iteration | Stddev | Threads |----------------------------------------------------------------------------------------------------------Sinf_1 | 764 | 369 | 2101 | 273 | 7 us | 323 | 32 |Sinf_128 | 721 | 699 | 744 | 5 | 913 us | 16 | 32 |Sinf_1024 | 661 | 650 | 689 | 9 | 7 ms | 31 | 32 |Sinf_4096 | 666 | 663 | 669 | 5 | 28 ms | 28 | 32 |SinfTwoPi_1 | 372 | 369 | 632 | 70 | 7 us | 39 | 32 |SinfTwoPi_128 | 379 | 379 | 379 | 4 | 895 us | 0 | 32 |SinfTwoPi_1024 | 335 | 335 | 338 | 5 | 7 ms | 20 | 32 |SinfTwoPi_4096 | 335 | 335 | 335 | 4 | 28 ms | 0 | 32 |SinfTwoPow30_1 | 371 | 369 | 510 | 70 | 7 us | 17 | 32 |SinfTwoPow30_128 | 379 | 379 | 379 | 4 | 894 us | 0 | 32 |SinfTwoPow30_1024 | 335 | 335 | 338 | 5 | 7 ms | 20 | 32 |SinfTwoPow30_4096 | 335 | 335 | 335 | 4 | 28 ms | 0 | 32 |SinfVeryLarge_1 | 477 | 369 | 632 | 70 | 7 us | 58 | 32 |SinfVeryLarge_128 | 487 | 480 | 493 | 5 | 900 us | 14 | 32 |SinfVeryLarge_1024 | 442 | 440 | 447 | 5 | 7 ms | 18 | 32 |SinfVeryLarge_4096 | 441 | 441 | 442 | 4 | 28 ms | 14 | 32 | Users can register benchmarks similar to Google Benchmark, using a macro: uint64_t BM_IsAlnumCapital() { char x = 'A'; return LIBC_NAMESPACE::latency(LIBC_NAMESPACE::isalnum, x);}BENCHMARK(LlvmLibcIsAlNumGpuBenchmark, IsAlnumCapital, BM_IsAlnumCapital); # Results This project met its major goal of creating microbenchmarking infrastructure for the GPU. However, the original scope of this proposal included a CPU component that would use vendor tools to measure GPU kernel properties. However, this was removed after discussion with the mentors due to technical obstacles in offloading specific kernels to the GPU that would require major changes to other parts of the code. # Future Work As this was a short project (90 hours), we only focused on implementing the microbenchmarking infrastructure. Future contributors can use the benchmarking infrastructure to add additional benchmarks. In addition, there are improvements to microbenchmarking infrastructure that could be added, such as more options for user input ranges, better random distributions for math functions, and a CPU element that can launch multiple kernels and compare results against functions running on the CPU. The existing code can be found in the LLVM repo. # Acknowledgements This project would not have been possible without my amazing mentor, Joseph Huber, the LLVM Foundation admins, and the GSoC admins. # Links Landed PRs LLVM GitHub LLVM Homepage GSoC Project Page
blog.llvm.org
December 10, 2024 at 7:33 PM