lhl
banner
lhl.bsky.social
lhl
@lhl.bsky.social
Easily distracted, currently building open source AI. Living online since FidoNet
Over the past couple weeks I've been working on some Strix Halo testing in my spare time. This includes bringing up a harness for doing full sweeps for pp/tg for a variety of different model architectures, backends, and flags. Writeup just posted to r/LocalLLama: www.reddit.com/r/LocalLLaMA...
July 22, 2025 at 11:05 AM
One neat thing is that experimenting with using Shisa V2 405B to regen our datasets, I'm seeing gains w/ new chosen DPO (slight boost on Qwen 3 vs original DPO), and for SFT+DPO, close to a 0.5 point gain on Shaberi averages for Llama 3.1 8B.
June 20, 2025 at 6:24 PM
Recently I started doing some Qwen3 testing (Shaberi, GPT-4.1 judge) and interestingly for almost all models, reasoning yielded worse performance. Note: I need to stand multieval back up - Even though Qwen3 8B tunes appear to match the Shisa V2 12B/14B tunes, they are much worse on translation.
June 15, 2025 at 5:03 AM
Perhaps a more interesting side note is that I am still basically illiterate in Japanese, but wrote this presentation with almost no native speaker review/assistance - just many many rounds of LLM assistance (mainly GPT-4.5, but some help from Shisa V2 405B too! 😂) including for final editing.
June 3, 2025 at 5:15 AM
We're still working on a full proper technical report (tracking down references are hard) but we have an Overview Report slide deck I posted in EN/JA here: shisa.ai/posts/shisa-...

It's my first Japanese slide deck and I super embraced the aesthetic!
June 3, 2025 at 5:11 AM
Related to an earlier observation bsky.app/profile/did:... - but since, both our 70B and 405B Shisa V2 models are *stronger than GPT-4 in Japanese,* it has trouble judging them. Luckily GPT-4.1 is still able to distinguish them. 😅
June 3, 2025 at 5:08 AM
BTW, right now you can chat w/ an FP8 version of Shisa V2 405B online now. If you don't speak Japanese, you can ask it to translate or even teach you some 😀 chat.shisa.ai
June 3, 2025 at 5:02 AM
Today we launched one more addition to the Shisa V2 models: Shisa V2 405B. This is new Llama 3.1 405B post-tune that is the strongest model ever trained in Japan! It matches GPT-4o and DeepSeek-V3 in JA MT-Bench. Read more here: shisa.ai/posts/shisa-...
June 3, 2025 at 4:59 AM
OK, first JA slide deck in the books. 😅 (Thanks, ChatGPT 4.5.)
May 27, 2025 at 4:19 AM
When your model is sufficiently better than the judge model, it may just start throwing a lot of 10s in its scoring 😂 (based on our overall eval battery shisa-v2 70b is a fair amount better than gpt-4 and gpt-4-turbo, but that's the standard judge used for 1:1 comparisons...)
May 23, 2025 at 5:34 AM
I've recently been poking at Strix Halo. For those interested in using it for inference, it's about expected (except for surprisingly bad llama.cpp HIP perf): www.reddit.com/r/LocalLLaMA... - but for those looking to do work (PyTorch, etc)... the current state is not good.
May 14, 2025 at 5:46 PM
Each DPO for the 405B took all 256 H100s at our disposal and took about 3300 GPU hours. By comparison, doing a full SFT+DPO on our Shisa V2 70B "only" took about 1200 H100 hours.
April 28, 2025 at 12:29 PM
Over the weekend, I finished up our Llama 405B run (4th group I know of to do a FFT?). It was a real beast to train, but beats our Shisa V2 70B (as well as GPT-4 and GPT-4 Turbo) using basically our Shisa V2 recipe. It is, I believe the best performing LLM (JA and EN) to ever be trained in Japan.
April 28, 2025 at 12:25 PM
The new Llama 4 release has been a bit of a mess. I've been busy so waited for a vLLM stable release blog.vllm.ai/2025/04/05/l... (w/ inference accuracy validation) to see if it's really that bad... Run on an H100 node, they do OK on EN/JA benchmarks (including some unreleased/just created ones)
April 7, 2025 at 10:02 AM
quasar-alpha looks... quite good
April 5, 2025 at 6:25 PM
Finally at a point where I can just kick back and wait for results...
March 29, 2025 at 4:04 AM
I never noticed this before. OpenAI Deep Research has some new tricks up its sleeve?
March 28, 2025 at 3:50 PM
I've been going through some of the rl releases from last year I've been meaning to try out, like SPIN github.com/uclaml/SPIN - I implemented a DPO version w/ tuned hyperparameters, and despite decent trajectories, it fails hard (each iteration eval'd worse than the last)
March 17, 2025 at 6:39 PM
Recently tested SimPO vs DPO and got similar to others w/ DPO better even when (grey line) using the "V2" optimized hyperparams w/ same ArmoRM dataset on similar model (a llama3.1-8b SFT) - used trl 0.13.0 since there's a multi-GPU bug w/ CPOTrainer: github.com/huggingface/...
March 14, 2025 at 5:28 AM
gpt 4.5 vs o1-pro (Reasoned for 2m 6s)
March 4, 2025 at 11:23 AM
Over the past couple days I've been running GPT4.5 through about 3M tokens worth of multi-lingual evals (including some newly created ones) and while it didn't blow away the other top models I tested, it did meaningfully beat everything else in almost every eval. (I also tested 4o as a reference.)
March 3, 2025 at 9:41 PM
For anyone trying to apply SGLang's sweet looking dp-attention github.com/sgl-project/... btw, it's currently busted: github.com/sgl-project/... - as it takes me 30min every time to load the server I'll just wait for mainline fix (same w/ not running torch.compile - takes forever, segfaults)
February 27, 2025 at 7:23 PM
HF_TRANSFER gud
February 27, 2025 at 8:10 AM
Posted by @vgel.me on the other site
January 25, 2025 at 5:44 PM
I've been doing some inference throughput/latency testing (focused on lowest TTFT) and testing various quants and engines. The bs=1 optimized (but server-capable) kernels scale pretty poorly. (Also, while vLLM and SGLang both can use Marlin kernels but SGLang's latency seems better across the board)
January 19, 2025 at 8:52 AM