Lightnews — Scholar-powered news

Marzena Karpinska

@markar.bsky.social

3.8K followers 940 following 69 posts

#nlp researcher interested in evaluation including: multilingual models, long-form input/output, processing/generation of creative texts
previous: postdoc @ umass_nlp
phd from utokyo

https://marzenakrp.github.io/

Posts Replies Media Videos

Marzena Karpinska

@markar.bsky.social

I'm not sure why people lost the ability to do related work properly but if you absolutely need to use AI at least proofread it? (And they most likely edited with ai)
www.pangram.com/history/01bf...

October 18, 2025 at 4:18 PM

Marzena Karpinska

@markar.bsky.social

Come to talk with us today about the evaluation of long form multilingual generation at the second poster session #COLM2025

📍4:30–6:30 PM / Room 710 – Poster #8

October 7, 2025 at 5:54 PM

Marzena Karpinska

@markar.bsky.social

Off to #COLM fake Fuji looks really good today.
本物は下からしか見たことがないが、今日は少なくとも偽物が上から見えて嬉しい。

October 6, 2025 at 3:01 PM

Marzena Karpinska

@markar.bsky.social

I feel like it was worth waking up early

October 6, 2025 at 2:35 PM

Marzena Karpinska

@markar.bsky.social

Happy to see this work accepted to #EMNLP2025! 🎉🎉🎉

August 20, 2025 at 8:49 PM

Marzena Karpinska

@markar.bsky.social

GPT-5 lands first place on NoCha, our long-context book understanding benchmark.

That said, this is a tiny improvement (~1%) over o1-preview, which was released almost one year ago. Have long-context models hit a wall?

Accuracy of human readers is >97%... Long way to go!

Screenshot of benchmark with gpt-5 on top with 68.46% accuracy.

August 8, 2025 at 2:13 AM

Marzena Karpinska

@markar.bsky.social

We have updated #nocha (long-context benchmark measuring how well models process book-length narratives) with #Llama4 Scout. Sadly, the performance was below the random level, much lower than the reported model's performance on a retrieval task (needle in the haystack). novelchallenge.github.io

April 7, 2025 at 4:50 AM

Marzena Karpinska

@markar.bsky.social

We have updated #nocha, a leaderboard for reasoning over long-context narratives 📖, with some new models including #Gemini 2.5 Pro which shows massive improvements over the previous version! Congrats to #Gemini team 🪄 🧙 Check 🔗 novelchallenge.github.io for details :)

Leaderboard showing performance of language models on claim verification task over book-length input. o1-preview is the best model with 67.36% accuracy followed by Gemini 2.5 Pro with 64.17% accuracy.

April 2, 2025 at 4:30 AM

Marzena Karpinska

@markar.bsky.social

An absolutely awesome lineup of language pairs for the 20th iteration of WMT 🍾🎉

February 21, 2025 at 12:48 AM

Marzena Karpinska

@markar.bsky.social

No matter how we tried to modify #LLM generated text (paraphrasing, humanization), people who frequently use LLMs for writing are consistently good at detecting model-generated text, though they change cues they rely on! Congrats @jennarussell.bsky.social on first paper!

January 28, 2025 at 3:36 PM

Marzena Karpinska

@markar.bsky.social

We've added #o1 and #Llama 3.3 70B to the #Nocha leaderboard for long-context narrative reasoning! Surprisingly, o1 performs worse than o1-preview, and Llama 3.3 70B matches proprietary models like gpt4o-mini & gemini-Flash. Check out our website for more results! More in 🧵

Screenshot of the nocha leaderboard with o1-preview model performing the best at 67.36%

December 29, 2024 at 8:02 PM

Marzena Karpinska

@markar.bsky.social

I will be present our paper on LMs performance on long-context reasoning task at #EMNLP2024 (Tue 16:00-17:30; riverfront hall) Come and chat with us! 🧚🦋

November 11, 2024 at 5:21 PM

Marzena Karpinska

@markar.bsky.social

I really wanted to run NEW #nocha benchmark claims on #o1 but it won't behave 😠
- 6k reasoning tokens is often not enough to get an ans and more means being able to process only short books
- OpenAI adds sth to the prompt: ~8k extra tokens-> less room for book+reason+generation!

Image showing prompt token count as per the tokenizer (tiktoken) which is 117,609 tokens, and as per what openai API claims it to be, which is 125,385 tokens. There is about 7000 extra tokens added coming from who knows where.

November 11, 2024 at 5:11 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news