Re: Introduce Sashiko (agentic review of Linux kernel changes)

From: Lorenzo Stoakes (Oracle)

Date: Wed Mar 18 2026 - 14:50:37 EST

On Wed, Mar 18, 2026 at 11:33:22AM -0700, Roman Gushchin wrote:
> "Lorenzo Stoakes (Oracle)" <ljs@xxxxxxxxxx> writes:
>
> > On Tue, Mar 17, 2026 at 03:31:11PM +0000, Roman Gushchin wrote:
> >> Hello,
> >>
> >> I'm happy to share something my colleagues and I have been working on
> >> for the last several months:
> >> Sashiko - an agentic system for Linux kernel changes.
> >>
> >> First, Sashiko is available as a service at:
> >> * https://sashiko.dev
> >>
> >
> > ...
> >
> > (For one I'm going to go fix some bugs on my series I saw reported there).
> >
> > I think over time as the approach/model is refined this will get a LOT
> > better, it seems these things can acelerate quickly.
>
> Hi Lorenzo,
>
> Thank you for kind words!

No problem, thanks for your hard work! :)

>
> RE false positives: I think Chris's prompts were initially heavily
> biased towards avoiding false positives, but it comes at the cost of
> missing real issues (in general, I don't have hard data on % of findings).
> Now he also is looking to relax it a bit, to my knowledge.
> But then there are different models in use, different protocols, etc.
>
> I also have a notion of issue severity and I was thinking about
> e.g. sending out only reviews revealing critical & high severity bugs
> (e.g. memory corruptions & panics). Or maybe send the feedback to the
> author in any case (e.g. for fixing typos), but cc maintainers only if
> there are serious concerns.
>
> And obviously no pressure, I won't enable any public email sending
> unless there is a consensus across maintainers of the corresponding
> subsystem.

I think maybe an opt-in thing might work for some of us?

But yeah we can take our time with this, Andrew is looking, I am for sure.

Oh and one data point -
https://lore.kernel.org/linux-mm/cover.1773846935.git.ljs@xxxxxxxxxx/

Read the v3 change log for a list of the issues it correctly raised for that
series, so it's definitely useful.

It was about maybe 50/50 noise/signal I think?

But as you can see that's already very useful thank you and has fixed a
bunch of bugs in that codde!

I'm not sure what Chris is planning, and I keep not going to the AI
meetings for various reasons (other stuff clashing/away/tired sometimes :)
but I wonder how we will sync up with Chris's review bot experiments?

>
> >>
> >> * What's next?
> >>
> >> This is our first version and it's obviously not perfect. There is a
> >> long list of fixes and improvements to make. Please, don't expect it to
> >> be 100% reliable, even though we'll try hard to keep it up and running.
> >> Please use github issues or email me any bug reports and feature
> >> requests, or send PR's.
> >
> > Of course, it's all much appreicated!
> >
> >>
> >> As of now, Sashiko only provides a web interface;
> >> however, Konstantin Ryabitsev is already adding sashiko.dev support to b4,
> >> and SeongJae Park is adding support to hkml.
> >> That was really fast, thank you!
> >
> > Thanks to Konstantantin and SJ too but the web interface is pretty nice I
> > must say so thanks for that! :)
> >
> >>
> >> We're working on adding an email interface to Sashiko, and soon Sashiko
> >> will be able to send out reviews over email - similar to what the bpf
> >> subsystem already has. It will be opt-in by subsystem and will have options
> >
> > Like I said, I think it's a bit premature for mm at least _at this point_
> > but I'm sure it'll get there.
>
> I'd really appreciate (and actually need) yours and other maintainers and
> developers feedback here. Even though I can't fix every single false
> positive as a code issue, I can hopefully tackle some common themes.

Is there a way for us to point out which parts of a review are signal and
which are noise?

If you could update the web interface for feedback that'd be really handy,
though I guess there's the painful stuff of having to have users and
etc. for that :)

>
> Chris did a fantastic work on the bpf subsystem (and several others) by
> manually analyzing replies to the AI feedback and adjusting prompts. Now
> we need to repeat this for all other subsystems.

Yeah, I'm happy to feedback if there's a fairly low friction way of doing
it, but constant workload makes it hard if it requires much more effort :)

>
> >
> > For now I think we need to get the false positive rate down a fair bit
> > otherwise it might be a little distracitng.
> >
> > But people are _already_ integrating the web interface into workflows, I
> > check it now, and Andrew is already very keen :) see:
> >
> > https://lore.kernel.org/all/20260317121736.f73a828de2a989d1a07efea1@xxxxxxxxxxxxxxxxxxxx/
> > https://lore.kernel.org/all/20260317113730.45d5cef4ba84be4df631677f@xxxxxxxxxxxxxxxxxxxx/
> >
> >> to CC only the author of the patch, maintainers, volunteers, or send a
> >> fully public reply. If you're a maintainer and have a strong preference
> >> to get reviews over email, please let me know.
> >
> > Well as maintainer I think 'not quite yet' but probably soon is the answer
> > on that one!
> >
> >>
> >> We also desperately need better benchmarks, especially when it comes to
> >> false positives. Having a decent vetted set of officially perfect
> >> commits can help with this.
> >
> > Not sure perfect commits exist in the kernel certainly not mine :P
>
> Same here :) This is why it's so hard.

Yes, but worthwhile! LLMs are surprisingly good at figuring out issues in
things, it's a real strength.

And it's already improving the code.

>
> >
> >>
> >> Finally, some subsystems have a good prompts coverage and some don't. It
> >> doesn't have to be lengthy documentation (and it might actually be
> >> counter-productive), but having a small list of things to look at - some
> >> high-level concepts which are hard to grasp from the code, etc. - can
> >> help a lot with both bug discovery and false positives.
> >
> > I guess best contributed to Chris's review-prompts repo right?
>
> Both works for me now, we'll figure out with Chris how to sync our
> prompts. The small problem is that we're using various models, tools and
> review protocols and barely can test each other's setup. And it's all
> very fragile, so it's not exactly trivial.
> But we'll figure out something soon.

Yeah, part of the fun I guess :)

>
> In general we need to carefully separate instructions (like which tools
> to use, which prompts to load etc) from factual data. Then we can easily
> use the factual data with various tooling around.

Hopefully I find some time to contribute some mm-specific stuff too :)

So far claude + Chris's prompts are working pretty great for me, I do see
it hallucinate or get things wrong sometimes but it's generally good.

Overall I continue to find the more 'creative' the task the worse it does,
the more you can constrain it to a problem domain the better it does.

>
> Thanks!

Cheers, Lorenzo