A small training+inference pipeline for creating your own LLM from scratch
$100 will get you a somewhat functional model
$1000 is more coherent & solves math
detailed walkthrough: github.com/karpathy/nan...
repo: github.com/karpathy/nan...
A small training+inference pipeline for creating your own LLM from scratch
$100 will get you a somewhat functional model
$1000 is more coherent & solves math
detailed walkthrough: github.com/karpathy/nan...
repo: github.com/karpathy/nan...
⚡️ Faster inference, 1.7B rivaling 7B.
📊 54.3% HumanEval | 47.6% HumanEval+ | 55.4% EvalPlus
Model: huggingface.co/Salesforce/C...
Report: github.com/SalesforceAI...
⚡️ Faster inference, 1.7B rivaling 7B.
📊 54.3% HumanEval | 47.6% HumanEval+ | 55.4% EvalPlus
Model: huggingface.co/Salesforce/C...
Report: github.com/SalesforceAI...
Top-tier results without needing a 70B+ footprint.
Top-tier results without needing a 70B+ footprint.
Popular benchmarks like HumanEval, MBPP, and MCEVAL test how well code LLMs generate and understand code across languages. Lua is a strong candidate for evaluating low-resource performance due to its niche status and balanced complexity.
#hackernews #llm #news
Popular benchmarks like HumanEval, MBPP, and MCEVAL test how well code LLMs generate and understand code across languages. Lua is a strong candidate for evaluating low-resource performance due to its niche status and balanced complexity.
#hackernews #llm #news
| Details | Interest | Feed |
• Raisonnement multi-étapes (GPQA, MMLU)
• Programmation (HumanEval)
Et tout ça avec une fenêtre de contexte de 200K tokens (≈ 150 000 mots) - soit la taille d'un roman entier que l'IA peut analyser d'un coup !
• Raisonnement multi-étapes (GPQA, MMLU)
• Programmation (HumanEval)
Et tout ça avec une fenêtre de contexte de 200K tokens (≈ 150 000 mots) - soit la taille d'un roman entier que l'IA peut analyser d'un coup !
Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of [5/8 of https://arxiv.org/abs/2505.02931v1]
Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of [5/8 of https://arxiv.org/abs/2505.02931v1]
Open-source.
671B total params (only 37B active per token).
FP8 optimized.
Beats GPT-4o & Claude 3.5 in:
✅ MMLU
✅ HumanEval
✅ DROP
✅ Math Reasoning
✅ Chinese C-Eval
🧠 Full deep dive report → deepseekagi.org/deepseek-v3-...
#DeepSeekV3 #OpenSourceLLM #GPT4 #Claude3
Open-source.
671B total params (only 37B active per token).
FP8 optimized.
Beats GPT-4o & Claude 3.5 in:
✅ MMLU
✅ HumanEval
✅ DROP
✅ Math Reasoning
✅ Chinese C-Eval
🧠 Full deep dive report → deepseekagi.org/deepseek-v3-...
#DeepSeekV3 #OpenSourceLLM #GPT4 #Claude3
This section argues that traditional benchmarks like HumanEval and MBPP are insufficient. We explore the nuanced challenges in evaluating AI-generated code for readability, completeness, and the presence o…
#hackernews #news
This section argues that traditional benchmarks like HumanEval and MBPP are insufficient. We explore the nuanced challenges in evaluating AI-generated code for readability, completeness, and the presence o…
#hackernews #news