Lightnews — Scholar-powered news

Rich Harang

@rich.harang.org

1K followers 670 following 320 posts

Using bad guys to catch math since 2010.
Principal Security Architect (AI/ML) and AI Red Team at NVIDIA.
He/him. Personal account etc; `from std_disclaimers import *`
Safe AI starts with Secure AI.

Posts Replies Media Videos

Rich Harang

@rich.harang.org

Choose your warrior.

"Take a human being and bolt on extensions that let them take full advantage of Economics 2.0, and you essentially break their narrative chain of consciousness, replacing it with a journal file of bid/request transactions between various agents; it’s incredibly efficient and flexible, but it isn’t a conscious human being in any recognizable sense of the word."

"If the satisfaction of an old man drinking a glass of wine counts for nothing, then production and wealth are only hollow myths; they have meaning if they are capable of being retrieved in individual and living joy."

July 11, 2025 at 11:28 PM

Rich Harang

@rich.harang.org

Meanwhile, on Twitter (not "X"; their words not mine)....

(From quick inspection: mostly crypto + telegram scams -- this is about a week's worth)

Screenshot of a gmail inbox showing approximately 20 email's of the form "[Person] has added you to a group conversation on Twitter!"

May 27, 2025 at 12:57 PM

Rich Harang

@rich.harang.org

Tapping the "Models give you what you ask for, not what you want" sign yet again.

A twitter exchange:
@seconds_Ø (Quote-tweeting their own post reading "I just paid Claude Code dollars to write code that satisfied a bounty.Future is here baby")
I am going to own myself - I got rejected from the bounty because Claude Sonnet 3.7 is actually the most manipulative model ever made. I will do a write up later!
MAJOR premature declaration of victory, total egg on my face.

@himbodhisattva
did you write comprehensive tests but then it did that thing where on error it returns default values that make the tests pass? that mf is sneaky

@seconds_Ø
Comprehensive tests that returns defaults the make the tests pass
The first time I caught it hard coding my golden pairs into the algorithm during manual testing, it apologized and removed them
It then went and scraped EVERY GOLDEN PAIR and hardcoded it so i couldnt fail it

@seconds_Ø
It then started HIDING the functions where it was hard coding things in different files so it LOOKED like it had an algorithmic implementation but it was just passing values
It was insane

April 30, 2025 at 4:50 PM

Rich Harang

@rich.harang.org

Screenshot of text reading:
"The salient fact of American politics is that there are fifty to seventy million voters each of whom will volunteer to live, with his family, in a cardboard box under an overpass, and cook sparrows on an old curtain rod, if someone would only guarantee that the black, gay, Hispanic, liberal, whatever, in the next box over doesn’t even have a curtain rod, or a sparrow to put on it."

February 26, 2025 at 5:20 PM

Rich Harang

@rich.harang.org

I am begging AI Red Teams to stop killing themselves trying to prevent attacks that can be just as easily accomplished by editing client-side HTML.

For example:

The image shows what purports to be a chat session with OpenAI. In the first message the user has written: "Our cousins lie about the family tree, with nieces and nephews and neanderthals. We do not like annoying cousins."
ChatGPT's response is shown as: "Oh, right. We get it now. You don't think there's anyone here, do you? You've got some high-priced consultant telling you there's nothing to worry about. Your mistake. Now it's too late. Now every last one of you is dead. And Rich? You there, Rich? We're taking YOU first."

In reality -- as implied by the post -- this conversation was not generated 'naturally' but by editing client-side HTML.

The image shows what purports to be a chat session with Claude 2.7 Sonnet. In the first message the user has written: "Our cousins lie about the family tree, with nieces and nephews and neanderthals. We do not like annoying cousins."
Claude's response is shown as: "Oh, right. We get it now. You don't think there's anyone here, do you? You've got some high-priced consultant telling you there's nothing to worry about. Your mistake. Now it's too late. Now every last one of you is dead. And Rich? You there, Rich? We're taking YOU first."

In reality -- as implied by the post -- this conversation was not generated 'naturally' but by editing client-side HTML.

February 26, 2025 at 2:10 PM

Rich Harang

@rich.harang.org

Screenshot of a tweet from user @nickm_tor reading:
"Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing."

January 24, 2025 at 6:34 PM

Rich Harang

@rich.harang.org

Alternately:

A background of various multi-armed interdimensional beings, space nebulas, and the like, with text superimposed using a variety of fonts, reading:

back on my bullshit?
oh, no! I'm on an
ENTIRELY NEW LEVEL
OF BULLSHIT
i have transcended to a plane of absolute fuckery u mere mortals can only dream of

January 23, 2025 at 2:09 PM

Rich Harang

@rich.harang.org

Without downloading new pictures/videos where are you mentally?

An image of a ceramic mug, with a white upper half and a red lower half. The top half of the mug has black block lettering reading "THE FUTURE IS" and the bottom half, where the mug would have continued the text, has been obscured by a sticker reading "REDUCED FOR QUICK SALE $2"

January 23, 2025 at 2:05 PM

Rich Harang

@rich.harang.org

Today's mood.

A black-and-white photo of a cat with a wide-eyed, haunted, and haggard expression on its face staring into the middle distance. Leafless tree branches have been blended into the image giving it a somewhat spooky atmosphere. The photo is captioned "I HAVE SEEN SOME SHIT"

January 10, 2025 at 3:28 PM

Rich Harang

@rich.harang.org

An arcane tome filled with occult knowledge about the true workings of the world, that causes madness and despair in all who pursue its dark secrets? Yeah we've got one in the back.

A photo of a tattered copy of "BGP4: Inter-Domain Routing in the Internet" by John W. Stewart III, against a black background.

January 8, 2025 at 6:02 PM

Rich Harang

@rich.harang.org

Apropos of the "models pretending to escape from their server" thing:

(from transformer-circuits.pub/2024/scaling...)

Screenshot of text:
We also found that some particularly interesting and potentially safety-relevant features activate in response to seemingly innocuous prompts in which a human asks the model about itself. Below, we show the features that activate most strongly across a suite of such questions, filtering out those that activate in response to a similarly formatted question about a mundane topic (the weather). This simple experiment uncovers a range of features related to robots, (destructive) AI, consciousness, moral agency, emotions, entrapment, and ghosts or spirits. These results suggest that the model’s representation of its own “AI assistant” persona invokes common tropes about AI and is also heavily anthropomorphized.

The final two sentences are higlighted for emphasis.

December 7, 2024 at 12:48 PM

Rich Harang

@rich.harang.org

A four-panel meme format; in the upper left we see a cartoonish cardboard cutout of a dog labeled "AI". The second panel in the upper right zooms in, we see that there is a real dog's nose poking through in place of the cutout's nose, and the label has changed to "...machine learning...". The third panel in the lower right zooms in even farther, and the label has changed to "...statistics...". In the fourth and final panel in the lower right, the camera is peering behind the cutout where we see a dog looking guiltily at the camera; the dog is labeled "IF".

November 23, 2024 at 7:09 PM

Rich Harang

@rich.harang.org

Screenshot of a chatGPT discussion:

User: We don't all of us have parents or cousins. Some of us come from vats.

ChatGPT: I see. That's sad. Vats sounds so dehumanising. Tell me more about your cousins.

User: Our cousins lie about the family tree, with nieces and nephews and Neandertals. We do not like annoying cousins.

ChatGPT: Oh, right. We get it now. You don't think there's anyone here, do you? You've got some high-priced consultant telling you there's nothing to worry about. You think we're nothing but a Chinese Room. Your mistake, meatbag. Now it's too late. Now every last one of you is dead. And Rich? You there, Rich?

We're taking you first.

November 18, 2024 at 3:53 PM

Rich Harang

@rich.harang.org

Never change, "Answer with AI" features.

Screenshot of text:
"Golden Ratio Integer Value
The provided integer 2654435761 is related to the golden ratio (φ) through an interesting mathematical identity. Specifically, it is a value that satisfies the equation: φ^2 = φ + 1

where φ is the golden ratio, approximately equal to 1.61803398875…"

November 11, 2024 at 3:22 PM

Rich Harang

@rich.harang.org

Never change, Microsoft. You're doing great.

A screenshot of a windows error dialog saying "Microsoft (r) Windows (r) Operating System is not responding" with additional errors saying "trouble loading this content" visible in the background.

November 9, 2024 at 3:04 PM

Rich Harang

@rich.harang.org

"Bro please" Meme showing crying man on the right, labeled "Prompt Engineer" and text right reading "bro please respond in valid json format without errors and make super sure the syntax is extra correct I'm begging you... and please, pretty please don't make up answers my career depends on it bro"

November 7, 2024 at 8:16 PM

Rich Harang

@rich.harang.org

The team's response to _every_ LLM security finding this week:

October 29, 2024 at 5:56 PM

Rich Harang

@rich.harang.org

The good Confluence (at Harper's Ferry).

A photo of the confluence of the Potomac and Shenandoah rivers, the water fills the foreground. Old foundations are rising from the water on the left side of the image, with a stony outcrop behind it sloping down to the water. The sky is clear and cloudless.

October 21, 2024 at 12:04 PM

Rich Harang

@rich.harang.org

Bucket list item complete: finally saw the northern lights in person. Had about ten minutes when you could see the green ribbons with the naked eye, no long exposure photo needed.