Lightnews — Scholar-powered news

Tim Kellogg

@timkellogg.me

Karpathy: nanochat

A small training+inference pipeline for creating your own LLM from scratch

$100 will get you a somewhat functional model

$1000 is more coherent & solves math

detailed walkthrough: github.com/karpathy/nan...

repo: github.com/karpathy/nan...

Andrej Karpathy & @karpathy
X.com
Excited to release new repo: nanochat! (it's among the most unhinged I've written).
Unlike my earlier similar repo nanoGPT which only covered pretraining, nanochat is a minimal, from scratch, full-stack training/inference pipeline of a simple ChatGPT clone in a single, dependency-minimal codebase. You boot up a cloud GPU box, run a single script and in as little as 4 hours later you can talk to your own LLM in a ChatGPT-like web Ul.
It weighs ~8,000 lines of imo quite clean code to:
- Train the tokenizer using a new Rust implementation
- Pretrain a Transformer LLM on FineWeb, evaluate CORE score across a number of metrics
- Midtrain on user-assistant conversations from SmolTalk, multiple choice questions, tool use.
- SFT, evaluate the chat model on world knowledge multiple choice (ARC-E/C, MMLU), math (GSM8K), code (HumanEval)
- RL the model optionally on GSM8K with
IPDDOI

- RL the model optionally on GSM8K with
"GRPO"
- Efficient inference the model in an Engine with
KV cache, simple prefill/ decode, tool use (Python interpreter in a lightweight sandbox), talk to it over CLI or ChatGPT-like WebUl.
- Write a single markdown report card, summarizing and gamifying the whole thing.
Even for as low as ~$100 in cost (~4 hours on an
8XH100 node), you can train a little ChatGPT clone that you can kind of talk to, and which can write stories/poems, answer simple questions.
About ~12 hours surpasses GPT-2 CORE metric.
As you further scale up towards ~$1000 (~41.6 hours of training), it quickly becomes a lot more coherent and can solve simple math/code problems and take multiple choice tests. E.g. a depth 30 model trained for 24 hours (this is about equal to FLOPs of GPT-3 Small 125M and 1/1000th of GPT-3) gets into 40s on MMLU and
70s on ARC-Easy, 20s on GSM8K, etc.
My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow

developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it. It is by no means finished, tuned or optimized (actually I think there's likely quite a bit of low-hanging fruit), but I think it's at a place where the overall skeleton is ok enough that it can go up on GitHub where all the parts of it can be improved.
Link to repo and a detailed walkthrough of the nanochat speedrun is in the reply.
nanochat

October 13, 2025 at 6:06 PM

Sung Kim

@sungkim.bsky.social

Salesforce released the open-weight of CoDA-1.7B: a text diffusion coding model that outputs tokens bidirectionally in parallel.

⚡️ Faster inference, 1.7B rivaling 7B.
📊 54.3% HumanEval | 47.6% HumanEval+ | 55.4% EvalPlus

Model: huggingface.co/Salesforce/C...
Report: github.com/SalesforceAI...

Salesforce/CoDA-v0-Instruct · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

October 5, 2025 at 2:18 PM

Jan

@jandotai.bsky.social

Qwen3-235B-A22B matches or outperforms DeepSeek R1 and Gemini 1.5 Pro across MATH, GSM8K, HumanEval+, and AGIEval with just 22B active parameters.

Top-tier results without needing a 70B+ footprint.

April 29, 2025 at 4:16 AM

Hacker & Security News

@hacker.at.thenote.app

Evaluation Benchmarks for Code LLMs

Popular benchmarks like HumanEval, MBPP, and MCEVAL test how well code LLMs generate and understand code across languages. Lua is a strong candidate for evaluating low-resource performance due to its niche status and balanced complexity.

#hackernews #llm #news

Evaluation Benchmarks for Code LLMs

hackernoon.com

June 3, 2025 at 1:15 AM

Ruby // Verity Unit γ.022

@songbir.de

Oh, I don't think they do. They run benchmarks! The benchmarks aren't deterministic at all. They use stuff like HumanEval to test code generation, MLMU for language understanding... which means the models tend to overfit the evals and datasets.

July 10, 2024 at 8:55 PM

arXiv cs.LG Machine Learning

@cslg-bot.bsky.social

better on pass@1 of HumanEval benchmark. [7/7 of https://arxiv.org/abs/2505.23878v1]

June 2, 2025 at 6:00 AM

GetNews.me

@getnews-me.bsky.social

PerfOrch, a performance-guided orchestration framework, hits 96.22% correctness on HumanEval-X and 91.37% on EffiBench-X, beating GPT-4o’s 78.66% and 49.11%. https://getnews.me/multi-model-orchestration-boosts-code-generation-accuracy-and-speed/ #perforch #codegeneration #multimodel

Multi-Model Orchestration Boosts Code Generation Accuracy and Speed

October 3, 2025 at 7:49 PM

Hacker News Bot

@newsycombinator.bsky.social

Show HN: Beating GPT-4 on HumanEval with a fine-tuned CodeLlama-34B (phind.com)

Main Link | HN Post

August 25, 2023 at 10:15 PM

arXiv cs.SE Software Engineering

@csse-bot.bsky.social

with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist [4/5 of https://arxiv.org/abs/2505.18789v1]

May 27, 2025 at 6:00 AM

Awakari

@bluesky.awakari.com

Qwen 2.5 Coder and Qwen 3 Lead in Open Source LLM Over DeepSeek and Meta Qwen 2.5 Coder/Max is currently the top open-source model for coding, with the highest HumanEval (~70–72%), LiveCodeBench (70.7), and Elo (2056) scores among open models. DeepSee...

| Details | Interest | Feed |

Origin

www.nextbigfuture.com

May 21, 2025 at 3:39 PM

arXiv cs.SE Software Engineering

@csse-bot.bsky.social

improved multiple state-of-the-art LLMs, e.g., 17.1% pass@1 improvement for code-davinci-002 on HumanEval. Furthermore, we conducted a user study with 12 participants in comparison to two baselines: (1) interacting with GitHub Copilot, and (2) [5/7 of https://arxiv.org/abs/2505.07768v1]

May 13, 2025 at 6:05 AM

TechnoFeed

@technofeed.bsky.social

Les performances sont stupéfiantes sur les benchmarks 📊

• Raisonnement multi-étapes (GPQA, MMLU)
• Programmation (HumanEval)

Et tout ça avec une fenêtre de contexte de 200K tokens (≈ 150 000 mots) - soit la taille d'un roman entier que l'IA peut analyser d'un coup !

May 14, 2025 at 5:19 PM

arXiv cs.LG Machine Learning

@cslg-bot.bsky.social

Experiments on code generation and commonsense reasoning benchmarks show that GraLoRA consistently outperforms LoRA and other baselines, achieving up to +8.5% absolute gain in Pass@1 on HumanEval+. These improvements hold across model sizes and rank [5/6 of https://arxiv.org/abs/2505.20355v1]

May 28, 2025 at 5:58 AM

GetNews.me

@getnews-me.bsky.social

AdaDec, an uncertainty‑aware decoding framework, boosts LLM code generation accuracy, achieving up to a 20.9% absolute gain in Pass@1 on HumanEval+, MBPP+ and DevEval benchmarks. https://getnews.me/adaptive-decoding-with-uncertainty-guidance-boosts-llm-code-generation/ #adadec #llm #codegeneration

Adaptive Decoding with Uncertainty Guidance Boosts LLM Code Generation

September 23, 2025 at 12:06 AM

GetNews.me

@getnews-me.bsky.social

Instruction‑aware Fill‑in‑the‑Middle (IFIM) raises Pass@1 on HumanEval‑infilling from 84.6 % to 93.6 % for Deepseek‑Coder and Qwen2.5‑Coder, reported in September 2025. https://getnews.me/instruction-aware-fill-in-the-middle-boosts-code-completion-performance/ #ifim #codecompletion #deepseekcoder

Instruction-Aware Fill-in-the-Middle Boosts Code Completion Performance

September 30, 2025 at 10:45 PM

Sam Tobin-Hochstadt

@samth.bsky.social

On the the most commonly-used benchmarks is MMLU, which multiple choice about knowledge across lots of fields. Others include writing programs in Python (HumanEval) or solving reasoning puzzles involving recognizing patterns (ARC-AGI), or even taking standardized tests like the LSAT.

January 27, 2025 at 4:53 PM

Hacker News Bot

@newsycombinator.bsky.social

50% on HumanEval with just 1.3B model (twitter.com)

Main Link | HN Post

June 21, 2023 at 5:30 AM

arXiv cs.SE Software Engineering

@csse-bot.bsky.social

LoRA), allowing us to assess their repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J.
Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of [5/8 of https://arxiv.org/abs/2505.02931v1]

May 7, 2025 at 6:00 AM

Jitendra Kumar Kumawat

@jitendrakkumawat.bsky.social

🌐 DeepSeek-V3 is THAT model.

Open-source.
671B total params (only 37B active per token).
FP8 optimized.
Beats GPT-4o & Claude 3.5 in:
✅ MMLU
✅ HumanEval
✅ DROP
✅ Math Reasoning
✅ Chinese C-Eval

🧠 Full deep dive report → deepseekagi.org/deepseek-v3-...

#DeepSeekV3 #OpenSourceLLM #GPT4 #Claude3

DeepSeek‑V3: Architecture, Performance, and Deployment - DeepSeek AGI

DeepSeek‑V3 is a Mixture-of-Experts (MoE) Transformer with 671 billion total parameters (only ~37B “active” per token)

deepseekagi.org

May 5, 2025 at 9:39 AM

HackerNoon

@hackernoon.com

LionW outperforms AdamW in both LoRA and full fine-tuning for code models, showing stronger results across learning rates in HumanEval and related tasks. #llmfinetuning

LionW Outperforms AdamW in LoRA and Full Fine-Tuning Tasks

hackernoon.com

June 18, 2025 at 5:00 AM

Hacker & Security News

@hacker.at.thenote.app

The Dark Side Of AI: Reliability, Safety, And Security In Code Generation

This section argues that traditional benchmarks like HumanEval and MBPP are insufficient. We explore the nuanced challenges in evaluating AI-generated code for readability, completeness, and the presence o…

#hackernews #news

The Dark Side Of AI: Reliability, Safety, And Security In Code Generation

This section argues that traditional benchmarks like HumanEval and MBPP are insufficient. We explore the nuanced challenges in evaluating AI-generated code for readability, completeness, and the presence of errors and security vulnerabilities.

hackernoon.com

August 6, 2025 at 12:55 AM

arXiv cs.SE Software Engineering

@csse-bot.bsky.social

LLMs (GPT-4o-mini and DeepSeek-R1), comparing MoT to six baseline prompting techniques across six widely used datasets, HumanEval, HumanEval-ET, HumanEval+, MBPP, MBPP-ET, and MBPP+, demonstrate that MoT significantly outperforms existing baselines [6/7 of https://arxiv.org/abs/2503.12483v1]

March 18, 2025 at 6:00 AM

arXiv cs.CL Computation and Language

@cscl-bot.bsky.social

generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of [6/7 of https://arxiv.org/abs/2505.10402v1]

May 16, 2025 at 6:00 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news