Duc Nguyen Huu
ducnh279.bsky.social
Duc Nguyen Huu
@ducnh279.bsky.social
Data Science in ♥️ Home in 🇻🇳
Reposted by Duc Nguyen Huu
Dream unlocked: I'm publishing my first book! 🎉🎉🎉

It's called "Master Machine Learning with scikit-learn: A Practical Guide to Building Better Models with Python"

Download the first 3 chapters right now:
👉 dataschool.kit.com/mlbook 👈

Thanks for your support 🙏
September 11, 2025 at 5:53 PM
Reposted by Duc Nguyen Huu
I got 3rd out of 691 in a tabular kaggle competition – with only neural networks! 🥉

My solution is short (48 LOC) and relatively general-purpose – I used skrub to preprocess string and date columns, and pytabkit to create an ensemble of RealMLP and TabM models. Link below👇
July 29, 2025 at 11:10 AM
Reposted by Duc Nguyen Huu
This work is presented at ICML next week.
• The paper arxiv.org/html/2502.05...
• The python package: pypistats.org/packages/tab... (try it out 🐍)
• The source code github.com/soda-inria/t... (100% open source, including pre-training 💞)

Longer read (5mn): gael-varoquaux.info/science/tabi...
8/9
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
arxiv.org
July 9, 2025 at 6:42 PM
Reposted by Duc Nguyen Huu
👨‍🎓🧾✨#icml2025 Paper: TabICL, A Tabular Foundation Model for In-Context Learning on Large Data
With Jingang Qu, @dholzmueller.bsky.social, and Marine Le Morvan

TL;DR: a well-designed architecture and pretraining gives best tabular learner, and more scalable
On top, it's 100% open source
1/9
July 9, 2025 at 6:42 PM
Reposted by Duc Nguyen Huu
My thoughts on the current state of AI progress and the most important developments in 2025:

www.dataschool.io/ai-progress-...
AI progress in 2025 📈
Thoughts on the current state of AI progress and the most important developments in 2025
www.dataschool.io
May 28, 2025 at 2:17 PM
Just shared a new article on "The State of Reinforcement Learning for LLM Reasoning"!
If you are new to reinforcement learning, this article has a generous intro section (PPO, GRPO, etc)
Also, I cover 15 recent articles focused on RL & Reasoning.

🔗 magazine.sebastianraschka.com/p/the-state-...
April 20, 2025 at 12:25 PM
Are you familiar with Token Pooling?

Models that use late interaction, like ColBERT, ColPali, and ColQwen, gain significant benefits from this pooling technique! By integrating token pooling methods, the number of vectors to store can be reduced.

Blog: www.answer.ai/posts/colber...
A little pooling goes a long way for multi-vector representations – Answer.AI
Practical AI R&D
www.answer.ai
April 4, 2025 at 11:41 PM
Efficiently scale long CoT models like DeepSeek when using Best-of-N or Majority Voting by early pruning reasoning chains.

Kaggle Discussion: www.kaggle.com/competitions...
AI Mathematical Olympiad - Progress Prize 2
Solve national-level math challenges using artificial intelligence models
www.kaggle.com
April 4, 2025 at 7:48 PM
I find making your agents safe is just as important as making them smart. 🔒

A good read for building secure AI!

arxiv.org/pdf/2503.18813
March 31, 2025 at 12:47 PM
There will be one day ... in 🇺🇸 or 🇻🇳
March 30, 2025 at 7:37 PM
Reposted by Duc Nguyen Huu
Claude finally integrated web search into its results...

But with LangChain & LangGraph, you can build a chatbot that integrates web search into ANY model you like!

You'll learn how to do that (and much more) in my new AI course...

Sign up for EARLY ACCESS:
👉 dataschool.kit.com/agents 👈
March 27, 2025 at 11:59 AM
A practical way for students to secure jobs and earn money is by developing real-world projects. Researching or engineering LLMs often seems like a field dominated by the big tech!

It's still important to learn fundamentals from scratch for growth and problem-solving (e.g be able to fix things)! 😁
Just finished recording my new AI course 😅

Sign up for early access: dataschool.kit.com/agents
March 24, 2025 at 4:17 PM
Reposted by Duc Nguyen Huu
My next tutorial on pretraining an LLM from scratch is now out. It starts with a step-by-step walkthrough of understanding, calculating, and optimizing the loss. After training, we update the text generation function with temperature scaling and top-k sampling: www.youtube.com/watch?v=Zar2...
March 23, 2025 at 1:38 PM
Reposted by Duc Nguyen Huu
cuDF-pandas (%load_ext cudf.pandas) with Rapids ... work similarly and super cool to see we will be able to speed up scikit-learn
March 20, 2025 at 9:59 AM
Scikit-learn accelerated 🚀

My company has a bunch of unused T4 GPUs because the LLMs are too big for AI teams run exps. Now the data science team finally has a reason to ask for them! 🤣

developer.nvidia.com/blog/nvidia-...
NVIDIA cuML Brings Zero Code Change Acceleration to scikit-learn | NVIDIA Technical Blog
Scikit-learn, the most widely used ML library, is popular for processing tabular data because of its simple API, diversity of algorithms, and compatibility with popular Python libraries such as pandas...
developer.nvidia.com
March 20, 2025 at 7:57 AM
Reposted by Duc Nguyen Huu
In honor of March Madness 🏀, I've got a new blog post:

www.dataschool.io/pandas-strea...

Learn how to identify & analyze scoring streaks using pandas operations:

- shift()
- cumsum()
- boolean math
- groupby()
How to calculate "scoring streaks" with pandas 🏀
Learn how to identify & analyze consecutive events in your data using advanced DataFrame methods!
www.dataschool.io
March 17, 2025 at 1:53 PM
Many good advices/best practices for missing value imputation in the paper!

I now have a much deeper appreciation for Data School's course and regard it as the best scikit-learn course.

Master Machine Learning with scikit-learn: courses.dataschool.io/master-machi...
March 18, 2025 at 3:55 PM
Reposted by Duc Nguyen Huu
"Some people today are discouraging others from learning programming on the grounds AI will automate it. This advice will be seen as some of the worst career advice ever given."

-- Andrew Ng, legendary AI researcher

Source: www.deeplearning.ai/the-batch/is...
DeepSeek-R1 Uncensored, QwQ-32B Puts Reasoning in Smaller Model, and more...
The Batch AI News and Insights: Some people today are discouraging others from learning programming on the grounds AI will automate it.
www.deeplearning.ai
March 13, 2025 at 6:05 PM
Reposted by Duc Nguyen Huu
A recent talk, fully in a vscode: 100% code on data wrangling for machine learning with @skrub-data.bsky.social
www.youtube.com/watch?v=hdWW...

super powerful to easily assemble production-ready pipelines in easy syntax
The Future of AI & Machine Learning | The Python Exchange February 2025
YouTube video by Don't Use This Code • James Powell
www.youtube.com
March 14, 2025 at 3:15 PM
Reposted by Duc Nguyen Huu
Yesterday, Google released Gemma 3, their latest open-weight LLM. Finally, a new addition to the "Big 5" of open-weight models (Gemma, Llama, DeepSeek, Qwen, and Mistral). I just went through the Gemma 3 report and experimented a bit with the models, and there are plenty of interesting tidbits:
March 13, 2025 at 4:03 PM
Reposted by Duc Nguyen Huu
Just uploaded my "Coding Attention Mechanisms" tutorial. A 2h15m session on coding attention mechanisms to understand how the engine of LLMs works:
self-attention → parameterized self-attention → causal self-attention → multi-head self-attention
www.youtube.com/watch?v=-Ll8...
Build an LLM from Scratch 3: Coding attention mechanisms
YouTube video by Sebastian Raschka
www.youtube.com
March 11, 2025 at 4:10 PM
Reposted by Duc Nguyen Huu
I just shared a new article, "The State of Reasoning Models", where I am exploring 12 new research articles on improving the reasoning capabilities of LLMs (all published after the release of DeepSeek R1): magazine.sebastianraschka.com/p/state-of-l...

Happy reading!
The State of LLM Reasoning Models
Part 1: Inference-Time Compute Scaling Methods
magazine.sebastianraschka.com
March 8, 2025 at 2:37 PM
Reposted by Duc Nguyen Huu
A couple months ago @dataschool.io wrote about a tool he uses to chat with different LLM models without paying a monthly subscription to all of them.

The tool is called Typing Mind and I decided to pay $30 for lifetime access. It was well worth it.

Kevin's post 👇
www.dataschool.io/save-money-o...
Use premium AI models for pennies 💰
Learn how to access ChatGPT, Claude, and more for pennies per conversation rather than paying for expensive subscriptions!
www.dataschool.io
March 3, 2025 at 4:18 PM
Reposted by Duc Nguyen Huu
19 professionals (in a variety of fields) evaluated OpenAI's Deep Research vs Google's Deep Research.

OpenAI was the clear winner 🏆

Neat study by @binarybits.bsky.social, read more here: www.understandingai.org/p/these-expe...
March 4, 2025 at 3:55 PM
Reposted by Duc Nguyen Huu
A new tutorial in my “Build A Large Language Model From Scratch” series is now live (www.youtube.com/watch?v=341R...)
- Tokenizing raw text and converting tokens into token IDs
- Applying byte pair encoding
- Setting up data loaders in PyTorch for efficient training
Build an LLM from Scratch 2: Working with text data
YouTube video by Sebastian Raschka
www.youtube.com
March 2, 2025 at 2:45 PM