Matt Goldrick
banner
mattgoldrick.bsky.social
Matt Goldrick
@mattgoldrick.bsky.social
Linguistics and cognitive science at Northwestern. Opinions are my own. he/him/his
Reposted by Matt Goldrick
Here are the slides: docs.google.com/presentation...

For context the intended audience is Lang Dev researchers who are attending a session on "what can LLMs tell us about human language".

If you have any thoughts I'd love to hear them!
SLD Plenary - Najoung Kim [external share]
Whence insights? The value of delineating human and machine CogSci Najoung Kim (Boston University) Society for Language Development Annual Symposium November 6, 2025 1
docs.google.com
November 18, 2025 at 9:31 PM
And it should be noted that this work might help us reimagine the nature of the computations underlying acquisition -- so for 'tokenization' isn't not entirely clear what the tokens should eventually be users.umiacs.umd.edu/~nhf/papers/...
November 10, 2025 at 11:04 PM
These are fantastic, and I think there's a ton of interesting work to be done here -- because tokenization/discovery of language structure is far from trivial and definitely not 'solved' in a general sense.
November 10, 2025 at 11:02 PM
I'm very excited about these models but I think we're a long way from being able to say we have in-principle solution for realistic training sizes
November 10, 2025 at 8:20 PM
I think this is @glupyan.bsky.social's link: ai.meta.com/blog/textles... which includes the Zero Speech benchmarks. I agree that self-supervised models are really interesting (I'm using them in my own work right now) but as far as I know these require huge amounts of training data
November 10, 2025 at 8:19 PM
My understanding @glupyan.bsky.social (correct me if I'm wrong!) is that tokenization is very much an open research area, esp. without any access to text -- I'm not aware of any BabyLLM work that examines audio-only or AV-only tokenization
November 10, 2025 at 4:59 PM