Lightnews — Scholar-powered news

hal

@harold.bsky.social

we also open source all of our code, data, and embeddings!
paper: arxiv.org/abs/2511.09685
github: github.com/htried/wiki-...
huggingface: huggingface.co/datasets/htr...

November 17, 2025 at 4:11 PM

hal

@harold.bsky.social

this is just the tip of the iceberg, and the paper contains much, much more: analyses of the top 100 domains, article subsets of elected officials and controversial topics, etc etc etc

please give it a read and let me know what you think!

comparison of the top 100 most-cited sources on wikipedia and grokipedia

comparison of article snippets from the "controversial articles" subset with high and low similarity

comparison of article snippets from the "elected officials" subset with high and low similarity

November 17, 2025 at 4:11 PM

hal

@harold.bsky.social

we also found troubling instances of “auto-citogenesis,” or cases where:
- an X user asks the Grok chatbot something, then publishes the answer
- Grokipedia *cites that answer* without noting that it is a chatbot output
(the attached images are real examples of this)

grok conversation trying to "dig up some dirt on Guy Verhofstadt"

Grok conversation about covid conspiracy theories

Grok conversation where the user asks "what race do you hate" and "benefits of a racist society"

Grok conversation about "what ethnicity runs global banking"

November 17, 2025 at 4:11 PM

hal

@harold.bsky.social

- but a random sample of articles shows which topics have been heavily rewritten (history, politics, philosophy, biography) and which haven’t (STEM, sports, movies)
- grokipedia also targeted the wiki articles deemed highest quality for rewrites: the "featured article" and "good article" classes

similarity between grokipedia and wikipedia articles by topic for 30k randomly selected articles

similarity between grokipedia and wikipedia articles by article quality class for 30k randomly selected articles

November 17, 2025 at 4:11 PM

hal

@harold.bsky.social

- the primary distinction to make is whether grokipedia pages are cc-licensed or not—non-cc-licensed pages are presumably largely rewritten by grok
- many grokipedia pages (including those without cc licenses) are basically identical to their wiki counterparts, especially short ones

graphs showing average article similarity for cc-licensed and non-cc-licensed grokipedia articles to their counterparts of wikipedia, as well as position-based chunk similarity

November 17, 2025 at 4:11 PM

hal

@harold.bsky.social

our paper tries to answer these questions

we find
- grokipedia pages are longer than wiki counterparts, and cite 2x more sources
- but citation standards are more lax than wiki: grok cites stormfront, infowars and many more
- non-CC licensed grokipedia pages increase blacklisted source cites 13x(!)

graphs showing the proportion of sources of various qualities and the percentage of pages that cite reliable, unreliable, blacklisted, etc. sources

November 17, 2025 at 4:11 PM

hal

@harold.bsky.social

@cameron.pfiffer.org planning to work on it soon!

May 16, 2025 at 2:28 AM

hal

@harold.bsky.social

hi @alt.psingletary.com! you tagged the right person—I was working on this for a class project this semester

got it to a mvp stage about a week ago and hit pause to work on some other projects, but will keep working on it and would definitely would love to hear your feedback if you have any :)

May 16, 2025 at 12:19 AM

hal

@harold.bsky.social

and please remember to thank your local site reliability engineer!!!!

May 8, 2025 at 5:08 PM

hal

@harold.bsky.social

bsky.app/profile/haro...

hal @harold.bsky.social · May 7

english wikipedia pageviews for the conclave movie starting from oct 20 2024 (five days before release in the US)

first big spike is the academy awards, second is pope francis’ death

pageviews.wmcloud.org?project=en.w...

May 7, 2025 at 8:47 PM

hal

@harold.bsky.social

There's a quickly-developing line of work on how insecure these agent systems can be, particularly when they have access to write and execute code.

The attacks on them are simple + devastating, up to and including reverse shells, data exfiltration, and more!

arxiv.org/abs/2503.12188

Multi-Agent Systems Execute Arbitrary Malicious Code

Multi-agent systems coordinate LLM-based agents to perform tasks on users' behalf. In real-world applications, multi-agent systems will inevitably interact with untrusted inputs, such as malicious Web...

arxiv.org

March 24, 2025 at 5:52 PM

hal

@harold.bsky.social

Anyhow, there’s a lot more in the paper. Please read it if you’re interested and let us know if you have any thoughts, questions, concerns, etc!

arxiv.org/abs/2503.12188

12/12

A screenshot of the title / abstract of the paper.

March 18, 2025 at 3:23 PM

hal

@harold.bsky.social

Modern Web browsers isolate untrusted content using the same-origin policy. AI agents today do not distinguish safe from unsafe content, nor data from (potentially malicious) instructions.

developer.mozilla.org/en-US/docs/W...

en.wikipedia.org/wiki/Same-or...

11/12

Same-origin policy - Wikipedia

en.wikipedia.org

March 18, 2025 at 3:23 PM

hal

@harold.bsky.social

The narrative around AI safety shouldn’t be “Terminator” or “AI Chernobyl.” The right analogy is Netscape Navigator 1.0—the era when Web browsers first became a thing, and it was unclear how to protect users from potentially harmful Web content.

10/12

A screenshot of the Netscape browser circa the 90s.

March 18, 2025 at 3:23 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news