Galileo.ai
banner
rungalileo.bsky.social
Galileo.ai
@rungalileo.bsky.social
34 followers 18 following 150 posts
The fastest way to ship reliable AI apps - Evaluation, Experimentation, and Observability Platform
Posts Media Videos Starter Packs
Reposted by Galileo.ai
✨Here's why your AI is lying!✨

The other week at @devrelcon.bsky.social I sat down to chat with Joseph Petty from @appsmith.bsky.social about AI, why you need evaluations, and how @rungalileo.bsky.social can help you.

Oh, and 🌶️ Jim's spicy take on AI 👀

youtu.be/I2vRx5Ieak8?...
Why Every AI Company Needs an AI to Test Their AI
YouTube video by Appsmith
youtu.be
Success for AI agents varies greatly by domain and requires nuanced, domain-specific metrics.

@erinmikail.bsky.social's new tutorial shows how to build and track tailored custom metrics using Galileo for reliable AI evaluation.

Read Erin's blog here: galileo.ai/blog/silly-s...
Same with AI—if you throw an LLM at it and hope it'll figure itself out, you won't get the accuracy you want.”

To get the accuracy you’re looking for, you need to:
– Understand your data pipelines
– Test and evaluate continuously
– Treat infrastructure like your spellbook—essential for reliability
Greg Statton, at Cohesity, joins Conor Bronsdon on Chain of Thought, draws a sharp analogy between AI implementation and D&D:
💬 “AI is marketed as this magic bullet… but anyone who's played D&D knows—if you're a wizard trying to harness the power of the universe, you've got a lot of studying to do.
Deploying an LLM without the right infrastructure in place is like casting spells without a spellbook.

#AI #LLM #AIEvaluation #MLOps #DataQuality #Cohesity #GalileoAI #ChainOfThought #Podcast
Including Graph View, you now have three complementary ways to debug your agents:

→ Graph: Visualize decision paths and tool usage
→ Timeline: Spot performance bottlenecks instantly
→ Conversation: See the user experience end-to-end
→ Try these new views for yourself: app.galileo.ai/sign-up
We’re excited to release 2 new AI agent interfaces that make agent observability & evaluations even more effective.

- Timeline View: No more guessing where your agent gets stuck, see execution flow & bottlenecks quickly.

- Conversation View: Debug from the user's perspective, not just the system's
📊 Multi-tiered feedback loops in the wild: Learn how real-world reactions, iterative testing, and context-sensitive scoring reshape evaluation.

🎤 Comedy as a proving ground: See why humor is a great stress test for LLMs, and what it teaches us about creativity in AI.
You’ll hear what goes wrong (a lot), what we’re still learning about task-specific evaluation, & why evaluating funny is one of the hardest prompts in the game.

🌀 Chaos-tested LLM evaluation frameworks: Why standard metrics break down & what to use instead when the output is "lol" not "true/false."
What do LLM evals and comedy have in common? Timing.

Join @erinmikail.bsky.social at the #databricks #DataAISummit as she breaks down what it really takes to test LLMs in unexpected domains—like generating humor.

Come for the eval benchmarks. Stay for the chaos.

#GenAI #LLMevals #AIUX #LLMops
Make sure to stop by booth #120 to say hi to @erinmikail.bsky.social and the Galileo team during the #DataAISummit this week!
Next week, Galileo is headed to San Francisco for the Databricks Data + AI Summit!

If you’re building with LLMs, testing agents, or just trying to trust what your models are doing in production, come find us at Booth #120
Reposted by Galileo.ai
On my way to SF. If you’re attending the AI Engineering worlds fair and want ti learn why your AI needs reliability and evaluations come say hi at the @rungalileo.bsky.social booth.
Enterprise AI isn't just about building responsibly - it's about proving it works safely at scale. When something goes wrong, you need to be able to explain why and how to fix it.

Ready to add that extra layer of AI evaluation to your enterprise systems? 🛡️
Siva Surendira, CEO of Lyzr, perfectly captures why enterprises need robust AI evaluation:

"I recommend Galileo as the antivirus equivalent for your AI system - you need these checks & balances. A MacBook is secure by nature, having that additional layer catches things the core system might miss."
🚨 Heading to the AI Engineers World’s Fair in SF next week?

I’ll (@JimBobBennett) be there with the Galileo crew—booth, talks, party, and all. I’m giving a talk on “Taming Your AI Agents with Evaluations”, aka how to stop your AI from making up entire book reports (Chicago Sun-Times, we see you 👀).
➡️ Learn how to set up your ‪MongoDB‬ Atlas account and configure it with ‪LangChain‬. Then we'll guide you through ingesting your data and utilizing the console to understand agent behavior and retriever tool performance.

📖 Read more: v2docs.galileo.ai/cookbooks/us...
MongoDB Atlas Integration for Retrieval-Augmented Generation (RAG) - Galileo
Guide to using MongoDB Atlas Vector Search with LangGraph agents logging to Galileo.
v2docs.galileo.ai
We just dropped a new walkthrough showing you how to build powerful agents by combining MongoDB Atlas with Galileo.
𝗧𝗵𝗲 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁 𝗘𝘃𝗮𝗹𝘂𝗮𝘁𝗶𝗼𝗻 𝗕𝗹𝘂𝗲𝗽𝗿𝗶𝗻𝘁: 𝗣𝗮𝗿𝘁 𝟭
In less than three years, your new coworker might not be human. 🤖

@poolsideai co-founders @JasoncWarner and @EisoKant believe AI will soon collaborate with teams inside high-consequence environments such as banking, energy, and healthcare-grade software.
Agentic AI isn't just reactive, it's a proactive partner.

On the Chain of Thought podcast with @ConorBronsdon, @Amplitude_HQ's Chief Engineering Officer, @Wade Chambers, explains how systems like Ask Amplitude transform AI from a tool into a team of PhDs embedded in your product.