Re: [QUESTION] problems report: rcu_read_unlock_special() called in irq_exit() causes dead loop

From: Qi Xi
Date: Tue Jul 01 2025 - 05:21:06 EST


Hello everyone,

Friendly ping about this problem :)

Qi

On 2025/6/6 2:56, Joel Fernandes wrote:

On 6/4/2025 8:26 AM, Paul E. McKenney wrote:
Or just don't send subsequent self-IPIs if we just sent one for the
rdp. Chances are, if we did not get the scheduler's attention during
the first one, we may not in subsequent ones I think. Plus we do send
other IPIs already if the grace period was over extended (from the FQS
loop), maybe we can tweak that?
Thanks a lot for your reply. I think it's hard for me to fix this issue as
above without introducing new bugs. I barely understand the RCU code. But I'm
very glad to help test if you have any code modifiction need to. I have
the VM and the syskaller benchmark which can reproduce the problem.
Sure, I understand. This is already incredibly valuable so thank you again.
Will request for your testing help soon. I also have a test module now which
can sort-off reproduce this. Keep you posted!
Oh sorry I meant to ask - could you provide the full kernel log and also is
there a standalone reproducer syzcaller binary one can run to reproduce it in a VM?
Sorry, I communicate with the teams who maintain the syzkaller tools. He said
I can't send the syskaller binary out of the company. Sorry, but I can help to
reproduce. It's not complicate and not time consuming.

I found the origin log which use kernel v6.6. But it's not complete.
Then I reprouce the problem using the latest kernel.
Both logs are attached as attachments.

Looking at both the v6.6 version and Joel's fix, I am forced to conclude
that this bug has been there for a very long time. Thank you for your
testing efforts and Joel for the fix!
Thanks. I am still working on polishing the fix Xiongfeng tested. I hope to have
it out next week for review. As we discussed I will split the context-tracking
API into a separate patch and will also add a separate documentation
comment-patch on why we need the irq_work.

thanks,

- Joel