Yotam Perlitz
@yperlitz.bsky.social
Research Scientist at @ibmresearch #NLProc, #RL.
Opinions are my own.
Opinions are my own.
Pinned
Yotam Perlitz
@yperlitz.bsky.social
· Nov 13
Save yourselves the hours (or days) inferring all 64K examples, when using HELM
In arxiv.org/pdf/2308.116... we show that 160 examples 🤯🤯🤯 is enough to get a very good picture, #ComputeIsForTraining.
with
@lchoshen.bsky.social and more
In arxiv.org/pdf/2308.116... we show that 160 examples 🤯🤯🤯 is enough to get a very good picture, #ComputeIsForTraining.
with
@lchoshen.bsky.social and more
How important are LLM evaluations to you?
A) Who cares?
B) Somewhat important (I guess?)
C) I'm an LLM, I evaluate myself.
D) Enough to join the pack
Lets talk about LLM evals here: go.bsky.app/DJpp8cy
A) Who cares?
B) Somewhat important (I guess?)
C) I'm an LLM, I evaluate myself.
D) Enough to join the pack
Lets talk about LLM evals here: go.bsky.app/DJpp8cy
November 18, 2024 at 8:50 PM
How important are LLM evaluations to you?
A) Who cares?
B) Somewhat important (I guess?)
C) I'm an LLM, I evaluate myself.
D) Enough to join the pack
Lets talk about LLM evals here: go.bsky.app/DJpp8cy
A) Who cares?
B) Somewhat important (I guess?)
C) I'm an LLM, I evaluate myself.
D) Enough to join the pack
Lets talk about LLM evals here: go.bsky.app/DJpp8cy
Save yourselves the hours (or days) inferring all 64K examples, when using HELM
In arxiv.org/pdf/2308.116... we show that 160 examples 🤯🤯🤯 is enough to get a very good picture, #ComputeIsForTraining.
with
@lchoshen.bsky.social and more
In arxiv.org/pdf/2308.116... we show that 160 examples 🤯🤯🤯 is enough to get a very good picture, #ComputeIsForTraining.
with
@lchoshen.bsky.social and more
November 13, 2024 at 6:40 PM
Save yourselves the hours (or days) inferring all 64K examples, when using HELM
In arxiv.org/pdf/2308.116... we show that 160 examples 🤯🤯🤯 is enough to get a very good picture, #ComputeIsForTraining.
with
@lchoshen.bsky.social and more
In arxiv.org/pdf/2308.116... we show that 160 examples 🤯🤯🤯 is enough to get a very good picture, #ComputeIsForTraining.
with
@lchoshen.bsky.social and more
If you haven't tried it yet:
github.com/yamadashy/re...
will can turn your repo into one file,
making it super easy to feed to a chatbot asking questions
github.com/yamadashy/re...
will can turn your repo into one file,
making it super easy to feed to a chatbot asking questions
GitHub - yamadashy/repomix: 📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large...
📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) o...
github.com
November 12, 2024 at 7:50 PM
If you haven't tried it yet:
github.com/yamadashy/re...
will can turn your repo into one file,
making it super easy to feed to a chatbot asking questions
github.com/yamadashy/re...
will can turn your repo into one file,
making it super easy to feed to a chatbot asking questions
✨ Developed a new benchmark or dataset for language models? ✨
Want the community to trust and adopt it? 🤔
Show that it (dis)agrees with common benchmarks
BenchBench makes it easy. Check it out:
👉 huggingface.co/spaces/ibm/b...
Want the community to trust and adopt it? 🤔
Show that it (dis)agrees with common benchmarks
BenchBench makes it easy. Check it out:
👉 huggingface.co/spaces/ibm/b...
BenchBench Leaderboad - a Hugging Face Space by ibm
Discover amazing ML apps made by the community
huggingface.co
November 12, 2024 at 7:47 PM
✨ Developed a new benchmark or dataset for language models? ✨
Want the community to trust and adopt it? 🤔
Show that it (dis)agrees with common benchmarks
BenchBench makes it easy. Check it out:
👉 huggingface.co/spaces/ibm/b...
Want the community to trust and adopt it? 🤔
Show that it (dis)agrees with common benchmarks
BenchBench makes it easy. Check it out:
👉 huggingface.co/spaces/ibm/b...
Seems like it indeed measure what it claims to :)
Kudus to the authors
A faster, automatic (no annotators) alternative to the Chatbot arena https://t.co/WNk3UmXRSq
Kudus to the authors
A faster, automatic (no annotators) alternative to the Chatbot arena https://t.co/WNk3UmXRSq
November 19, 2024 at 7:27 PM
Seems like it indeed measure what it claims to :)
Kudus to the authors
A faster, automatic (no annotators) alternative to the Chatbot arena https://t.co/WNk3UmXRSq
Kudus to the authors
A faster, automatic (no annotators) alternative to the Chatbot arena https://t.co/WNk3UmXRSq
https://t.co/TZlMiQdgWR
November 19, 2024 at 7:27 PM
https://t.co/TZlMiQdgWR
we've now added the decentralized arena to benchbench,
check out how it fares with other benchmarks
https://t.co/pjhtr8CPZD
check out how it fares with other benchmarks
https://t.co/pjhtr8CPZD
November 19, 2024 at 7:27 PM
we've now added the decentralized arena to benchbench,
check out how it fares with other benchmarks
https://t.co/pjhtr8CPZD
check out how it fares with other benchmarks
https://t.co/pjhtr8CPZD
Get your benchmark game on: https://t.co/yY0swLQOHZ https://t.co/3qzkcIOd7u https://t.co/5Y7QUz0Ype
November 19, 2024 at 7:27 PM
Get your benchmark game on: https://t.co/yY0swLQOHZ https://t.co/3qzkcIOd7u https://t.co/5Y7QUz0Ype
Me trying to choose the right LLM benchmark without BenchBench:
https://t.co/TZlMiQdgWR https://t.co/DQEttklUGQ
https://t.co/TZlMiQdgWR https://t.co/DQEttklUGQ
November 19, 2024 at 7:27 PM
Me trying to choose the right LLM benchmark without BenchBench:
https://t.co/TZlMiQdgWR https://t.co/DQEttklUGQ
https://t.co/TZlMiQdgWR https://t.co/DQEttklUGQ
Shoutout to @streamlit, our framework of choice! Shoutout to @huggingface for hosting our space 🤗 https://t.co/z8LFw6ZQG7
November 19, 2024 at 7:27 PM
Shoutout to @streamlit, our framework of choice! Shoutout to @huggingface for hosting our space 🤗 https://t.co/z8LFw6ZQG7
Explore the BenchBench Leaderboard to explore and visualize how established benchmarks compare: https://t.co/yY0swLQgSr
Use our Python package to perform your own BAT analysis: https://t.co/iU8favWVT6
And read the paper: https://t.co/RvCp3R6gU5 https://t.co/poHpewZkS3
Use our Python package to perform your own BAT analysis: https://t.co/iU8favWVT6
And read the paper: https://t.co/RvCp3R6gU5 https://t.co/poHpewZkS3
November 19, 2024 at 7:27 PM
Explore the BenchBench Leaderboard to explore and visualize how established benchmarks compare: https://t.co/yY0swLQgSr
Use our Python package to perform your own BAT analysis: https://t.co/iU8favWVT6
And read the paper: https://t.co/RvCp3R6gU5 https://t.co/poHpewZkS3
Use our Python package to perform your own BAT analysis: https://t.co/iU8favWVT6
And read the paper: https://t.co/RvCp3R6gU5 https://t.co/poHpewZkS3
BenchBench can prove your benchmark measures unique skills ❄️(disagreement with existing benchmarks)
Or prove it captures the essence of others aimed at (agreement), for example, agreeing with @lmsys, but efficiently. https://t.co/KwtHtTRESc
Or prove it captures the essence of others aimed at (agreement), for example, agreeing with @lmsys, but efficiently. https://t.co/KwtHtTRESc
November 19, 2024 at 7:27 PM
BenchBench can prove your benchmark measures unique skills ❄️(disagreement with existing benchmarks)
Or prove it captures the essence of others aimed at (agreement), for example, agreeing with @lmsys, but efficiently. https://t.co/KwtHtTRESc
Or prove it captures the essence of others aimed at (agreement), for example, agreeing with @lmsys, but efficiently. https://t.co/KwtHtTRESc
✨ Developed a new benchmark or dataset for language models? ✨
Want the community to trust and adopt it? 🤔
So, demonstrate its validity by comparing it to established benchmarks!
BenchBench makes it easy. Check it out:
👉 https://t.co/yY0swLQgSr
Want the community to trust and adopt it? 🤔
So, demonstrate its validity by comparing it to established benchmarks!
BenchBench makes it easy. Check it out:
👉 https://t.co/yY0swLQgSr
November 19, 2024 at 7:27 PM
✨ Developed a new benchmark or dataset for language models? ✨
Want the community to trust and adopt it? 🤔
So, demonstrate its validity by comparing it to established benchmarks!
BenchBench makes it easy. Check it out:
👉 https://t.co/yY0swLQgSr
Want the community to trust and adopt it? 🤔
So, demonstrate its validity by comparing it to established benchmarks!
BenchBench makes it easy. Check it out:
👉 https://t.co/yY0swLQgSr
Shout-out to the amazing team at IBM behind Unitxt: @ElronBandel, @MatanOrbach, yoavkatz, eladv, @LChoshen, @yotamperlitz & more!
IBM is betting big on it (IBM Research AI VP 👇) https://t.co/BKfK0JriYB
IBM is betting big on it (IBM Research AI VP 👇) https://t.co/BKfK0JriYB
November 19, 2024 at 7:28 PM
Shout-out to the amazing team at IBM behind Unitxt: @ElronBandel, @MatanOrbach, yoavkatz, eladv, @LChoshen, @yotamperlitz & more!
IBM is betting big on it (IBM Research AI VP 👇) https://t.co/BKfK0JriYB
IBM is betting big on it (IBM Research AI VP 👇) https://t.co/BKfK0JriYB
HELM just got a great upgrade!
We've integrated with Unitxt for:
Easy dataset addition
2x the datasets
Sharable & reproducible pipelines
Check out the blogpost: https://t.co/UJXwfPKzGN
And the unitxt repo
https://t.co/GeqMCoQhjv
@ElronBandel @YifanMai
We've integrated with Unitxt for:
Easy dataset addition
2x the datasets
Sharable & reproducible pipelines
Check out the blogpost: https://t.co/UJXwfPKzGN
And the unitxt repo
https://t.co/GeqMCoQhjv
@ElronBandel @YifanMai
November 19, 2024 at 7:28 PM
HELM just got a great upgrade!
We've integrated with Unitxt for:
Easy dataset addition
2x the datasets
Sharable & reproducible pipelines
Check out the blogpost: https://t.co/UJXwfPKzGN
And the unitxt repo
https://t.co/GeqMCoQhjv
@ElronBandel @YifanMai
We've integrated with Unitxt for:
Easy dataset addition
2x the datasets
Sharable & reproducible pipelines
Check out the blogpost: https://t.co/UJXwfPKzGN
And the unitxt repo
https://t.co/GeqMCoQhjv
@ElronBandel @YifanMai
Everyone knows you never have to use the full test set
We shows how much they were right 🤯!
Check out our presentation at @naacl
in Efficient/Low-Resources and Evaluation Methods for NLP (18 June 2024 @ 02:12)
or watch our video here:
https://t.co/pPOpKyLbhT
See you! https://t.co/ocVvmVBBlW
We shows how much they were right 🤯!
Check out our presentation at @naacl
in Efficient/Low-Resources and Evaluation Methods for NLP (18 June 2024 @ 02:12)
or watch our video here:
https://t.co/pPOpKyLbhT
See you! https://t.co/ocVvmVBBlW
November 19, 2024 at 7:28 PM
Everyone knows you never have to use the full test set
We shows how much they were right 🤯!
Check out our presentation at @naacl
in Efficient/Low-Resources and Evaluation Methods for NLP (18 June 2024 @ 02:12)
or watch our video here:
https://t.co/pPOpKyLbhT
See you! https://t.co/ocVvmVBBlW
We shows how much they were right 🤯!
Check out our presentation at @naacl
in Efficient/Low-Resources and Evaluation Methods for NLP (18 June 2024 @ 02:12)
or watch our video here:
https://t.co/pPOpKyLbhT
See you! https://t.co/ocVvmVBBlW
It is a great figure
and a great thing you did by sharing all your meta-data!
it had enabled a lot of great work
ours as well :)
https://t.co/9lGi8aW8IG https://t.co/Lz62fTdn7O
and a great thing you did by sharing all your meta-data!
it had enabled a lot of great work
ours as well :)
https://t.co/9lGi8aW8IG https://t.co/Lz62fTdn7O
November 19, 2024 at 7:28 PM
It is a great figure
and a great thing you did by sharing all your meta-data!
it had enabled a lot of great work
ours as well :)
https://t.co/9lGi8aW8IG https://t.co/Lz62fTdn7O
and a great thing you did by sharing all your meta-data!
it had enabled a lot of great work
ours as well :)
https://t.co/9lGi8aW8IG https://t.co/Lz62fTdn7O
Bored with all benchmarks ranking models the same?
HOLMES doesn't 💪
Probing LMs for linguistic abilities is a fresh idea, @AndreasWaldis took it to the extreme 🦸
Give it a read!
or check out the leaderboard https://t.co/Byc1Nhp3nV https://t.co/zH0RLddkID
HOLMES doesn't 💪
Probing LMs for linguistic abilities is a fresh idea, @AndreasWaldis took it to the extreme 🦸
Give it a read!
or check out the leaderboard https://t.co/Byc1Nhp3nV https://t.co/zH0RLddkID
November 19, 2024 at 7:28 PM
Bored with all benchmarks ranking models the same?
HOLMES doesn't 💪
Probing LMs for linguistic abilities is a fresh idea, @AndreasWaldis took it to the extreme 🦸
Give it a read!
or check out the leaderboard https://t.co/Byc1Nhp3nV https://t.co/zH0RLddkID
HOLMES doesn't 💪
Probing LMs for linguistic abilities is a fresh idea, @AndreasWaldis took it to the extreme 🦸
Give it a read!
or check out the leaderboard https://t.co/Byc1Nhp3nV https://t.co/zH0RLddkID
I've been working internally with this dataset
and let me tell you...
Its great! https://t.co/MOwn0OyVS3
and let me tell you...
Its great! https://t.co/MOwn0OyVS3
November 19, 2024 at 7:28 PM
I've been working internally with this dataset
and let me tell you...
Its great! https://t.co/MOwn0OyVS3
and let me tell you...
Its great! https://t.co/MOwn0OyVS3
like the color scheme 🏅 https://t.co/sdAosgxypV
November 19, 2024 at 7:28 PM
like the color scheme 🏅 https://t.co/sdAosgxypV
Using contrastive representation for optimized human evaluation 👁️👁️👁️
Nice! https://t.co/49leLodOAQ
Nice! https://t.co/49leLodOAQ
November 19, 2024 at 7:28 PM
Using contrastive representation for optimized human evaluation 👁️👁️👁️
Nice! https://t.co/49leLodOAQ
Nice! https://t.co/49leLodOAQ
Check out the paper for more insights :) https://t.co/7zhb8mGtQ0
November 19, 2024 at 7:28 PM
Check out the paper for more insights :) https://t.co/7zhb8mGtQ0
variance in evaluation has many sources,
this work really does a good job at profiling one of these https://t.co/nAf7zYDSd7
this work really does a good job at profiling one of these https://t.co/nAf7zYDSd7
November 19, 2024 at 7:28 PM
variance in evaluation has many sources,
this work really does a good job at profiling one of these https://t.co/nAf7zYDSd7
this work really does a good job at profiling one of these https://t.co/nAf7zYDSd7
these models keeps changing 💩
tomorrow this figure will have no meaning https://t.co/OsA2WfiLHn
tomorrow this figure will have no meaning https://t.co/OsA2WfiLHn
November 19, 2024 at 7:28 PM
these models keeps changing 💩
tomorrow this figure will have no meaning https://t.co/OsA2WfiLHn
tomorrow this figure will have no meaning https://t.co/OsA2WfiLHn
this is a nice to have link :) https://t.co/DYApcasZen
November 19, 2024 at 7:28 PM
this is a nice to have link :) https://t.co/DYApcasZen