anishathalye.bsky.social
@anishathalye.bsky.social
If you're interested in semantic data processing, you can also check out related systems like DocETL from Shreya Shankar et al., LOTUS from Liana Patel et al., and Palimpzest from Chunwei Liu et al. (4/4)
September 11, 2025 at 3:38 PM
These operators are implemented in Semlib, a new library I built to help solve a class of semantic data processing problems that is underserved by current tools such as agents and conversational chatbots.

More on the story and use cases here: anishathalye.com/semlib/. (2/)
Semlib: LLM-powered Data Processing
Semlib is a Python library for building data processing and data analysis pipelines that leverage the power of large language models (LLMs).
anishathalye.com
September 11, 2025 at 3:36 PM
If you have suggestions for topics to cover in the next iteration of the course, please share them in this thread!
August 5, 2025 at 5:43 PM
Incidentally, this is how I first got interested in ML. github.com/anishathalye...
GitHub - anishathalye/neural-style: Neural style in TensorFlow! 🎨
Neural style in TensorFlow! 🎨. Contribute to anishathalye/neural-style development by creating an account on GitHub.
github.com
June 21, 2025 at 3:19 PM
We did a workshop at AIUC that: (1) implements a RAG app on top of Cursor's docs, (2) reproduces the widely-publicized failure from last week, and (3) shows how to automatically catch and reproduce this failure. All slides/code are open-sourced here: github.com/cleanlab/aiu... (5/5)
GitHub - cleanlab/aiuc-workshop: AI User Conference 2025 - Developer Day workshop
AI User Conference 2025 - Developer Day workshop - GitHub - cleanlab/aiuc-workshop: AI User Conference 2025 - Developer Day workshop
github.com
April 24, 2025 at 6:21 PM
What’s the solution? I believe that one ingredient will be intelligent systems that evaluate the output of these LLMs in real-time and keep them in check, building on and combining techniques like LLM-as-a-judge, using per-token logprobs, and statistical methods. (4/5)
April 24, 2025 at 6:21 PM
Why do such failures occur? These next-token-prediction models are nondeterministic and can be fragile. And they’re not getting consistently better over time—OpenAI’s latest models like o3 and o4-mini show higher hallucination rates compared to previous versions. (3/5)
April 24, 2025 at 6:21 PM
It’s been over a year since the well-publicized failures of Air Canada’s support bot and NYC’s MyCity bot. And these AI’s are still failing spectacularly in production, with the most recent debacle being Cursor’s AI going rogue and triggering a wave of cancellations. (2/5)
April 24, 2025 at 6:21 PM
I wonder if there's anything special in the Cursor Tab completion model or system prompt that induces this behavior.
April 16, 2025 at 10:04 PM
2/2
It works surprisingly well in practice.

cleanlab.ai/blog/rag-eva...

Hoping to see more of these real-time reference-free evaluations to give end users more confidence in the outputs of AI applications.
Real-Time Evaluation Models for RAG: Who Detects Hallucinations Best?
A comprehensive benchmark of evaluation models to automatically catch incorrect responses across five RAG applications.
cleanlab.ai
April 7, 2025 at 11:06 PM
And some repos are even organically suggested by ChatGPT. (3/3)
February 17, 2025 at 6:03 PM
Some of this might be through web search / tool use, but for at least some, knowledge about the projects is actually part of LLM model weights. (2/3)
February 17, 2025 at 6:03 PM