Re: Stop false review statements

Next message: Mike Rapoport: "Re: [RFC V2 01/14] mm: Abstract printing of pxd_val()"
Previous message: Mauro Carvalho Chehab: "Re: Stop false review statements"
In reply to: Greg KH: "Re: Stop false review statements"
Next in thread: Mauro Carvalho Chehab: "Re: Stop false review statements"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Theodore Tso

Date: Sun May 17 2026 - 14:57:54 EST

On Sun, May 17, 2026 at 11:17:06AM -0700, Roman Gushchin wrote:
>
> I actually tried to run it with ollama on my
> personal framework 13. Adding nominal support is trivial, but the
> whole thing is not really useful: I can get maybe few hundreds
> tokens per second using a quantified model with reduced quality; an
> average sashiko review is consuming 3.5 millions tokens (with Gemini
> 3.1 pro, it’s also model-dependent).

I'm curious. What hardware and LLM model were you using? A few
hundred tokens per second seems surprising high. My initial
research[1] showes that an M5 Max Macbook Pro costing 5 or 6 kilobucks
can do 31.6 tokens/second on a 27B 4-bit Quanitized model (Qwen 3.5).

[1] https://www.reddit.com/r/LocalLLaMA/comments/1rzkw4x/m5_max_128g_performance_tests_i_just_got_my_new/

The model matters of course. With Gemma 3 27B and a 6-bit
quantization, it's 21 tokens/s, and with Deepseek R1 8B Q6_K, it's
72.8 tokens/second. But unless you're using a really low-end model,
or a really expensive, splufty hardware platform, I haven't seen
reports of hundreds of tokens per second on hardware costing a
reasonable amount of memory. (I'll set aside the question of whether
spending $6k for a fully spec'ed out M5 Max Macbook Pro, or $15k for a
fully spec'ed out M3 Ultra Mac Studio is "reasonable".)

As a result I'm not entirely sure how realistic it is to do reviews
using "free" (you still have to pay $$$ for the hardware) local,
open-weight LLM's if an average review requires around 3.5 million
tokens.

Cheers,

- Ted