#HumanEval
Karpathy: nanochat

A small training+inference pipeline for creating your own LLM from scratch

$100 will get you a somewhat functional model

$1000 is more coherent & solves math

detailed walkthrough: github.com/karpathy/nan...

repo: github.com/karpathy/nan...
October 13, 2025 at 6:06 PM
Salesforce released the open-weight of CoDA-1.7B: a text diffusion coding model that outputs tokens bidirectionally in parallel.

⚡️ Faster inference, 1.7B rivaling 7B.
📊 54.3% HumanEval | 47.6% HumanEval+ | 55.4% EvalPlus

Model: huggingface.co/Salesforce/C...
Report: github.com/SalesforceAI...
Salesforce/CoDA-v0-Instruct · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
October 5, 2025 at 2:18 PM
Qwen3-235B-A22B matches or outperforms DeepSeek R1 and Gemini 1.5 Pro across MATH, GSM8K, HumanEval+, and AGIEval with just 22B active parameters.

Top-tier results without needing a 70B+ footprint.
April 29, 2025 at 4:16 AM
Evaluation Benchmarks for Code LLMs

Popular benchmarks like HumanEval, MBPP, and MCEVAL test how well code LLMs generate and understand code across languages. Lua is a strong candidate for evaluating low-resource performance due to its niche status and balanced complexity.

#hackernews #llm #news
Evaluation Benchmarks for Code LLMs
Popular benchmarks like HumanEval, MBPP, and MCEVAL test how well code LLMs generate and understand code across languages. Lua is a strong candidate for evaluating low-resource performance due to its niche status and balanced complexity.
hackernoon.com
June 3, 2025 at 1:15 AM
Oh, I don't think they do. They run benchmarks! The benchmarks aren't deterministic at all. They use stuff like HumanEval to test code generation, MLMU for language understanding... which means the models tend to overfit the evals and datasets.
July 10, 2024 at 8:55 PM
better on pass@1 of HumanEval benchmark. [7/7 of https://arxiv.org/abs/2505.23878v1]
June 2, 2025 at 6:00 AM
PerfOrch, a performance-guided orchestration framework, hits 96.22% correctness on HumanEval-X and 91.37% on EffiBench-X, beating GPT-4o’s 78.66% and 49.11%. https://getnews.me/multi-model-orchestration-boosts-code-generation-accuracy-and-speed/ #perforch #codegeneration #multimodel
October 3, 2025 at 7:49 PM
Show HN: Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B (phind.com)

Main Link | HN Post
August 25, 2023 at 10:15 PM
with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist [4/5 of https://arxiv.org/abs/2505.18789v1]
May 27, 2025 at 6:00 AM
Qwen 2.5 Coder and Qwen 3 Lead in Open Source LLM Over DeepSeek and Meta Qwen 2.5 Coder/Max is currently the top open-source model for coding, with the highest HumanEval (~70–72%), LiveCodeBench (70.7), and Elo (2056) scores among open models. DeepSee...

| Details | Interest | Feed |
Origin
www.nextbigfuture.com
May 21, 2025 at 3:39 PM
improved multiple state-of-the-art LLMs, e.g., 17.1% pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) [5/7 of https://arxiv.org/abs/2505.07768v1]
May 13, 2025 at 6:05 AM
Les performances sont stupéfiantes sur les benchmarks 📊

• Raisonnement multi-étapes (GPQA, MMLU)
• Programmation (HumanEval)

Et tout ça avec une fenêtre de contexte de 200K tokens (≈ 150 000 mots) - soit la taille d'un roman entier que l'IA peut analyser d'un coup !
May 14, 2025 at 5:19 PM
Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank [5/6 of https://arxiv.org/abs/2505.20355v1]
May 28, 2025 at 5:58 AM
AdaDec, an uncertainty‑aware decoding framework, boosts LLM code generation accuracy, achieving up to a 20.9% absolute gain in Pass@1 on HumanEval+, MBPP+ and DevEval benchmarks. https://getnews.me/adaptive-decoding-with-uncertainty-guidance-boosts-llm-code-generation/ #adadec #llm #codegeneration
September 23, 2025 at 12:06 AM
Instruction‑aware Fill‑in‑the‑Middle (IFIM) raises Pass@1 on HumanEval‑infilling from 84.6 % to 93.6 % for Deepseek‑Coder and Qwen2.5‑Coder, reported in September 2025. https://getnews.me/instruction-aware-fill-in-the-middle-boosts-code-completion-performance/ #ifim #codecompletion #deepseekcoder
September 30, 2025 at 10:45 PM
On the the most commonly-used benchmarks is MMLU, which multiple choice about knowledge across lots of fields. Others include writing programs in Python (HumanEval) or solving reasoning puzzles involving recognizing patterns (ARC-AGI), or even taking standardized tests like the LSAT.
January 27, 2025 at 4:53 PM
50% on HumanEval with just 1.3B model (twitter.com)

Main Link | HN Post
June 21, 2023 at 5:30 AM
LoRA), allowing us to assess their repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J.
Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of [5/8 of https://arxiv.org/abs/2505.02931v1]
May 7, 2025 at 6:00 AM
🌐 DeepSeek-V3 is THAT model.

Open-source.
671B total params (only 37B active per token).
FP8 optimized.
Beats GPT-4o & Claude 3.5 in:
✅ MMLU
✅ HumanEval
✅ DROP
✅ Math Reasoning
✅ Chinese C-Eval

🧠 Full deep dive report → deepseekagi.org/deepseek-v3-...

#DeepSeekV3 #OpenSourceLLM #GPT4 #Claude3
DeepSeek‑V3: Architecture, Performance, and Deployment - DeepSeek AGI
DeepSeek‑V3 is a Mixture-of-Experts (MoE) Transformer with 671 billion total parameters (only ~37B “active” per token)
deepseekagi.org
May 5, 2025 at 9:39 AM
LionW outperforms AdamW in both LoRA and full fine-tuning for code models, showing stronger results across learning rates in HumanEval and related tasks. #llmfinetuning
LionW Outperforms AdamW in LoRA and Full Fine-Tuning Tasks
hackernoon.com
June 18, 2025 at 5:00 AM
The Dark Side Of AI: Reliability, Safety, And Security In Code Generation

This section argues that traditional benchmarks like HumanEval and MBPP are insufficient. We explore the nuanced challenges in evaluating AI-generated code for readability, completeness, and the presence o…

#hackernews #news
The Dark Side Of AI: Reliability, Safety, And Security In Code Generation
This section argues that traditional benchmarks like HumanEval and MBPP are insufficient. We explore the nuanced challenges in evaluating AI-generated code for readability, completeness, and the presence of errors and security vulnerabilities.
hackernoon.com
August 6, 2025 at 12:55 AM
LLMs (GPT-4o-mini and DeepSeek-R1), comparing MoT to six baseline prompting techniques across six widely used datasets, HumanEval, HumanEval-ET, HumanEval+, MBPP, MBPP-ET, and MBPP+, demonstrate that MoT significantly outperforms existing baselines [6/7 of https://arxiv.org/abs/2503.12483v1]
March 18, 2025 at 6:00 AM
generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of [6/7 of https://arxiv.org/abs/2505.10402v1]
May 16, 2025 at 6:00 AM