Lightnews — Scholar-powered news

Sathvik

@sathvik.bsky.social

260 followers 240 following 53 posts

computational psycholinguistics @ umd

he/him

Posts Replies Media Videos

Sathvik

@sathvik.bsky.social

Lastly, we replicated the aggregate analysis separately on words that were and weren’t split by the morphological analyzer, and found that the predictive power of surprisal for reading time predictions was worse for words split by the BPE tokenizer than words that were not.

Replicating the cross-validated analysis of
predictive power separately for whole and split words
in the Natural Stories corpus under the different word segmentation methods. The predictive power of surprisal over words split into BPE tokens is significantly lower than words that were treated as whole. This is not the case for words segmented into morphological units.

Replicating the cross-validated analysis of
predictive power separately for whole and split words
in the Dundee corpus under the different word segmentation methods. The predictive power of surprisal over words split into BPE tokens has much higher variance than words that were treated as whole. This is not the case for words segmented into morphological units.

November 2, 2023 at 10:28 PM

Sathvik

@sathvik.bsky.social

Looking more closely, surprisal of words appears to scale incrementally with more morphemes, consistent with the cognitive prediction: each sub-unit adds more work. But surprisal of words with more BPE tokens does not.

Violinplot with distributions of surprisal of words with different numbers of subword tokens, split by corpus and segmentation method. On top of the plot, is a dotted red line that highlights that surprisal over morphological units increases incrementally, but surprisal over BPE tokens sharply increases and then plateaus for words broken into subword tokens.

November 2, 2023 at 10:26 PM

Sathvik

@sathvik.bsky.social

We then evaluated the models on eyetracking & self-paced reading data, finding in the aggregate that BPE is equally predictive of human behavior. So far so good for BPE.

However, this isn’t the whole story. Around 95% of the tokens in both evaluation corpora were never split by the BPE tokenizer.

Distribution of predictive power of surprisal under models trained under each tokenization scheme,
evaluated using 10-fold cross-validation for each corpus. There is no major difference in predictive power
associated with tokenization.

Words in the psycholinguistic corpora split into
different numbers of tokens by the BPE tokenizer. Note that in both the Dundee and Natural Stories corpora, the BPE tokenizer did not split 95% of the words into subword tokens (89% when stopwords were excluded)

November 2, 2023 at 10:25 PM

Sathvik

@sathvik.bsky.social

Honored my paper was accepted to Findings of #EMNLP2023! Many psycholinguistics studies use LLMs to estimate the probability of words in context. But LLMs process statistically derived subword tokens, while human processing doesn't. Does this matter? (w/Philip Resnik) 🧵
arxiv.org/abs/2310.17774

November 2, 2023 at 10:20 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news