Peter Bull
banner
peter.drivendata.org
Peter Bull
@peter.drivendata.org
Co-founder DrivenData. Celebrating a decade of data for good.

ML challenges | https://www.drivendata.org/
Data projects | https://drivendata.co/
Open source | https://github.com/pjbull
🚀 New release: cloudpathlib v0.23.0

🥧 Now with Python 3.14 (π) support!
📁 New copy & move methods mean you can reduce usage of shutil 🎉

Check out the full release and docs here:
👉 cloudpathlib.drivendata.org/stable/
October 13, 2025 at 6:36 PM
Super interesting work on new proposed columnar data file format called F3 with embedded wasm binary to decode the data 🤯 (which obviates the need for 3rd party library support). Favorable comparisons on compression, throughput and random reads to existing formats.

db.cs.cmu.edu/papers/2025/...
October 10, 2025 at 6:36 PM
Very cool to see Wikimedia embracing LLM tools and launching a hybrid similarity search API and open source embeddings for Wikipedia! Also supports Q&A style queries.
www.wikidata.org/wiki/Wikidat...
October 8, 2025 at 10:27 PM
Interesting to see empirical research coming out for LLMs as education aids. In this study, active use of LLMs helped CS students debug compiler errors. Removing LLM access demonstrated no lasting learning benefit from having had access to it...

learninganalytics.upenn.edu/ryanbaker/IC...
October 6, 2025 at 6:36 PM
We just shipped two major features for cloudpathlib ✨📦 ✨ ! First, http support—treat an URL like any other path (open, read_text, join). Second, compatibility with open and os Python built-ins for seamless transition of legacy code and third-party library support.

cloudpathlib.drivendata.org
September 22, 2025 at 6:36 PM
Thought I would spot check a application someone was posting about 100% vibecoding. Can you spot the issue?

Kudos to the LLM, this is verbatim from the fastapi docs. Sometimes verbatim from the docs is not what you want for your application though....
August 13, 2025 at 10:27 PM
Enthusiastic to build on this generation of earth observation foundation embeddings like DeepMind's AlphaEarth (and more)! We already see some promising crop type (cereals vs. orchards) results and are exploring other use cases in climate resilience. deepmind.google/discover/blo...
August 8, 2025 at 6:36 PM
✨ 📦 ✨ Just released new Cookiecutter Data Science version with support for pixi and poetry as environment managers! Some of our top requested features ever. Upgrade and check it out now.

cookiecutter-data-science.drivendata.org
July 25, 2025 at 6:36 PM
Now getting organic inbound for www.zambacloud.com, our wildlife imagery processing platform, from ChatGPT! 😲
July 18, 2025 at 6:36 PM
Just in case you thought speech-to-text worked for children, the third column is what Whisper does. Somehow in the third example it accesses my inner monologue... I guess that's why we're excited about our upcoming challenge! kidsasr.drivendata.org
July 16, 2025 at 10:27 PM
BioCLIP2 looks like a stellar improvement! I'm excited to think about integrating into Zamba to for open-ended classification tasks run at scale on camera trap imagery. Definitely the potential to dramatically improve CT image utility. imageomics.github.io/bioclip-2/
June 30, 2025 at 6:36 PM
We've built so many low-fidelity prototypes in our HCD work. IMO vibecoding changes the feel of those prototypes, but doesn't change the process. Ask any designer—they'll tell you high-fidelity first iterations are often more distracting to clients than helpful.

www.semafor.com/article/06/0...
June 25, 2025 at 10:27 PM
Check out this LLM circuit trace LLM for the text: '"The statement 'this statement is false' is." It goes through a logical contradictions node, but still outputs either "true" or "false" with the highest probabilities... www.anthropic.com/research/ope...
June 23, 2025 at 6:36 PM
A new preprint shows anonymization techniques for voices make transcription accuracy substantially worse for children versus adults. This is going to be a big challenge as we work on ASR for educational settings where we emphatically need both privacy and accuracy. arxiv.org/pdf/2506.00100
June 20, 2025 at 6:36 PM
The gap between LLM prototype and production strikes again... in the worst possible place. www.propublica.org/article/trum...
June 6, 2025 at 6:36 PM
Very cool to see the multimodal conversation CANDOR dataset that we worked on used for a new paper on conversational agents! Gets agent feedback/training loops closer to the 7-38-55 rule than text only arxiv.org/abs/2505.15922
May 30, 2025 at 6:36 PM
Excited to be with the DD team in Utah. Heading today to talk geo data tools at #CNG2025. If you're at cloud native geo, say hi! Would love to talk ML competitions, social impact tools, foundation models, python libraries and more!
May 1, 2025 at 6:52 PM
The OpenAI debacle showed us bad governance. Let's talk more about: (1) how we got here, (2) structure and governance, and (3) where we go from here.

I believe there's still a world in which mission-driven orgs are critical players in AI. What do you think?

www.linkedin.com/pulse/openai...
November 28, 2023 at 5:27 PM