Lightnews — Scholar-powered news

lhl

@lhl.bsky.social

Over the past couple weeks I've been working on some Strix Halo testing in my spare time. This includes bringing up a harness for doing full sweeps for pp/tg for a variety of different model architectures, backends, and flags. Writeup just posted to r/LocalLLama: www.reddit.com/r/LocalLLaMA...

July 22, 2025 at 11:05 AM

lhl

@lhl.bsky.social

One neat thing is that experimenting with using Shisa V2 405B to regen our datasets, I'm seeing gains w/ new chosen DPO (slight boost on Qwen 3 vs original DPO), and for SFT+DPO, close to a 0.5 point gain on Shaberi averages for Llama 3.1 8B.

June 20, 2025 at 6:24 PM

lhl

@lhl.bsky.social

Recently I started doing some Qwen3 testing (Shaberi, GPT-4.1 judge) and interestingly for almost all models, reasoning yielded worse performance. Note: I need to stand multieval back up - Even though Qwen3 8B tunes appear to match the Shisa V2 12B/14B tunes, they are much worse on translation.

New table of Shaberi scores (GPT-4.1 judge)

June 15, 2025 at 5:03 AM

lhl

@lhl.bsky.social

Perhaps a more interesting side note is that I am still basically illiterate in Japanese, but wrote this presentation with almost no native speaker review/assistance - just many many rounds of LLM assistance (mainly GPT-4.5, but some help from Shisa V2 405B too! 😂) including for final editing.

June 3, 2025 at 5:15 AM

lhl

@lhl.bsky.social

We're still working on a full proper technical report (tracking down references are hard) but we have an Overview Report slide deck I posted in EN/JA here: shisa.ai/posts/shisa-...

It's my first Japanese slide deck and I super embraced the aesthetic!

June 3, 2025 at 5:11 AM

lhl

@lhl.bsky.social

Related to an earlier observation bsky.app/profile/did:... - but since, both our 70B and 405B Shisa V2 models are *stronger than GPT-4 in Japanese,* it has trouble judging them. Luckily GPT-4.1 is still able to distinguish them. 😅

June 3, 2025 at 5:08 AM

lhl

@lhl.bsky.social

BTW, right now you can chat w/ an FP8 version of Shisa V2 405B online now. If you don't speak Japanese, you can ask it to translate or even teach you some 😀 chat.shisa.ai

June 3, 2025 at 5:02 AM

lhl

@lhl.bsky.social

Today we launched one more addition to the Shisa V2 models: Shisa V2 405B. This is new Llama 3.1 405B post-tune that is the strongest model ever trained in Japan! It matches GPT-4o and DeepSeek-V3 in JA MT-Bench. Read more here: shisa.ai/posts/shisa-...

JA MT-Bench Radar Charts: Our post-training has improved the Japanese performance of the Llama 3.1 405B Instruct base model across all evaluation categories. Shisa V2 405B's JA MT-Bench scores are competitive with the flagship models published by both leading US and Chinese Frontier Labs.

June 3, 2025 at 4:59 AM

lhl

@lhl.bsky.social

OK, first JA slide deck in the books. 😅 (Thanks, ChatGPT 4.5.)

Shisa V2 405B scores above GPT-4o latest in JA MT-Bench

Shisa V2 405B scores on par with the latest DeepSeek V3 and and GPT-4o in every category in JA MT-Bench

May 27, 2025 at 4:19 AM

lhl

@lhl.bsky.social

When your model is sufficiently better than the judge model, it may just start throwing a lot of 10s in its scoring 😂 (based on our overall eval battery shisa-v2 70b is a fair amount better than gpt-4 and gpt-4-turbo, but that's the standard judge used for 1:1 comparisons...)

May 23, 2025 at 5:34 AM

lhl

@lhl.bsky.social

I've recently been poking at Strix Halo. For those interested in using it for inference, it's about expected (except for surprisingly bad llama.cpp HIP perf): www.reddit.com/r/LocalLLaMA... - but for those looking to do work (PyTorch, etc)... the current state is not good.

PyTorch FA perf on Strix Halo (gfx1151) is quite awful.

May 14, 2025 at 5:46 PM

lhl

@lhl.bsky.social

Each DPO for the 405B took all 256 H100s at our disposal and took about 3300 GPU hours. By comparison, doing a full SFT+DPO on our Shisa V2 70B "only" took about 1200 H100 hours.

DPO mini-sweep. The calculated-scaled LR did end up being the best overall performer.

April 28, 2025 at 12:29 PM

lhl

@lhl.bsky.social

Over the weekend, I finished up our Llama 405B run (4th group I know of to do a FFT?). It was a real beast to train, but beats our Shisa V2 70B (as well as GPT-4 and GPT-4 Turbo) using basically our Shisa V2 recipe. It is, I believe the best performing LLM (JA and EN) to ever be trained in Japan.

Chart showing JA + EN performance of Shisa V2 and a new 405B FFT vs others

April 28, 2025 at 12:25 PM

lhl

@lhl.bsky.social

The new Llama 4 release has been a bit of a mess. I've been busy so waited for a vLLM stable release blog.vllm.ai/2025/04/05/l... (w/ inference accuracy validation) to see if it's really that bad... Run on an H100 node, they do OK on EN/JA benchmarks (including some unreleased/just created ones)

April 7, 2025 at 10:02 AM

lhl

@lhl.bsky.social

quasar-alpha looks... quite good

April 5, 2025 at 6:25 PM

lhl

@lhl.bsky.social

Finally at a point where I can just kick back and wait for results...

March 29, 2025 at 4:04 AM

lhl

@lhl.bsky.social

I never noticed this before. OpenAI Deep Research has some new tricks up its sleeve?

March 28, 2025 at 3:50 PM

lhl

@lhl.bsky.social

I've been going through some of the rl releases from last year I've been meaning to try out, like SPIN github.com/uclaml/SPIN - I implemented a DPO version w/ tuned hyperparameters, and despite decent trajectories, it fails hard (each iteration eval'd worse than the last)

Each iteration does slightly better on accuracy and reward margins

each iteration also gets significantly worse on evals

March 17, 2025 at 6:39 PM

lhl

@lhl.bsky.social

Recently tested SimPO vs DPO and got similar to others w/ DPO better even when (grey line) using the "V2" optimized hyperparams w/ same ArmoRM dataset on similar model (a llama3.1-8b SFT) - used trl 0.13.0 since there's a multi-GPU bug w/ CPOTrainer: github.com/huggingface/...

DPO vs SimPO eval results corroborate the training curves

March 14, 2025 at 5:28 AM

lhl

@lhl.bsky.social

gpt 4.5 vs o1-pro (Reasoned for 2m 6s)

March 4, 2025 at 11:23 AM

lhl

@lhl.bsky.social

Over the past couple days I've been running GPT4.5 through about 3M tokens worth of multi-lingual evals (including some newly created ones) and while it didn't blow away the other top models I tested, it did meaningfully beat everything else in almost every eval. (I also tested 4o as a reference.)

March 3, 2025 at 9:41 PM

lhl

@lhl.bsky.social

For anyone trying to apply SGLang's sweet looking dp-attention github.com/sgl-project/... btw, it's currently busted: github.com/sgl-project/... - as it takes me 30min every time to load the server I'll just wait for mainline fix (same w/ not running torch.compile - takes forever, segfaults)

February 27, 2025 at 7:23 PM

lhl

@lhl.bsky.social

HF_TRANSFER gud

February 27, 2025 at 8:10 AM

lhl

@lhl.bsky.social

Posted by @vgel.me on the other site

Funny meme (sorry too much to type but if you can send to a VLM it can explain it)

January 25, 2025 at 5:44 PM

lhl

@lhl.bsky.social

I've been doing some inference throughput/latency testing (focused on lowest TTFT) and testing various quants and engines. The bs=1 optimized (but server-capable) kernels scale pretty poorly. (Also, while vLLM and SGLang both can use Marlin kernels but SGLang's latency seems better across the board)

SGLang W8A8 (Marlin) vs MLC q4fp16_1 throughput and latency

SGLang W8A8 (Marlin) vs ExLlamaV2 EXL2 BPW 6.0 throughput and latency

SGLang W8A8 (Marlin) vs vLLM W8A8 (Marlin) throughput and latency

January 19, 2025 at 8:52 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news