In our new paper, we survey major reg efforts & find they rely on benchmarking, which we know to be problematic. How did this happen & what can we do about it?
arxiv.org/pdf/2501.15693
🧠
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?
Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc
🧠
Instruction-tuned LLMs show amplified cognitive biases — but are these new behaviors, or pretraining ghosts resurfacing?
Excited to share our new paper, accepted to CoLM 2025🎉!
See thread below 👇
#BiasInAI #LLMs #MachineLearning #NLProc
We put the number of retrieved documents in RAG to the test!
💥Preprint💥: arxiv.org/abs/2503.04388
1/3
We put the number of retrieved documents in RAG to the test!
💥Preprint💥: arxiv.org/abs/2503.04388
1/3
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
LLMs can hallucinate - but did you know they can do so with high certainty even when they know the correct answer? 🤯
We find those hallucinations in our latest work with @itay-itzhak.bsky.social, @fbarez.bsky.social, @gabistanovsky.bsky.social and Yonatan Belinkov
Evaluation in the world of GenAI is more important than ever, so please consider submitting your amazing work.
CfP can be found at gem-benchmark.com/workshop
Evaluation in the world of GenAI is more important than ever, so please consider submitting your amazing work.
CfP can be found at gem-benchmark.com/workshop
In our new paper, we survey major reg efforts & find they rely on benchmarking, which we know to be problematic. How did this happen & what can we do about it?
arxiv.org/pdf/2501.15693
In our new paper, we survey major reg efforts & find they rely on benchmarking, which we know to be problematic. How did this happen & what can we do about it?
arxiv.org/pdf/2501.15693