Re:Re: Re: Re: Re: Re: [PATCH] media:v4l2-async:debugfs for registered subdevices

From: luo.liu.linux

Date: Fri Mar 20 2026 - 02:54:12 EST

Hi Laurent ，Sakari,

1，Current debugging tools (KASAN, CONFIG_DEBUG_LIST) operate on a passive defense model:

they only alert when explicit errors occur, such as corrupted pointers or illegal memory access. They are fundamentally blind to "logical omissions"—states where data should have been removed but wasn't.
Consider the scenario where a driver forgets to call v4l2_async_unregister_subdev():

List Debug remains silent: Since no list_del() is executed, the sd->async_list node remains structurally intact within subdev_list, with valid prev and next links.

KASAN remains silent: It detects access to freed memory, not the existence of dangling pointers. If the orphaned node is never traversed after the driver frees its memory, no Use-After-Free is triggered.

In the scenario I described in my previous email，the bug remains dormant and invisible during normal operation. The driver could run flawlessly for years, only revealing the issue under specific stress conditions

2，I noticed that the implementation of v4l2_async_unregister_subdev() reveals a critical state dependency:

void v4l2_async_unregister_subdev(struct v4l2_subdev *sd)
{
// ...
if (!sd->async_list.next)
return; // Guard check implies the node must be linked to proceed
// ...
}

This guard check (if (!sd->async_list.next)) highlights the fragility of the state machine. If a driver mismanages its lifecycle or simply omits this call, the subdev remains permanently stranded in subdev_list. This is a logical consistency error—the data structure is valid but semantically incorrect—rather than a memory corruption issue

3，Together with pending_subdevs_show, this interface provides a holistic view of the subsystem's health. It enables teams to proactively identify logical flaws during the development cycle, eliminating the reliance on luck or stress tests to uncover these deep-seated state management bugs

Regards,

Luo

At 2026-03-20 04:30:37, "Laurent Pinchart" <laurent.pinchart@xxxxxxxxxxxxxxxx> wrote:
>On Tue, Mar 17, 2026 at 07:14:43PM +0800, luo.liu.linux wrote:
>>
>> Hi Sakari,
>>
>> The existing pending_async_subdevices interface provides excellent
>> visibility into the notifier_list (the 'waiter' side).
>>
>> To achieve full symmetry and complete debuggability, we should also
>> expose the subdev_list (the 'provider' side).These two views solve
>> different problems:
>>
>> 1 Notifier List: Diagnoses why binding is stalled (missing sub-devices).
>>
>> 2 Subdev List: Diagnoses state inconsistencies (e.g., sub-devices
>> present but unmatched) and verifies resource cleanup upon unbind.
>>
>> From practical experience, lacking visibility into subdev_list makes
>> it difficult to distinguish between a sub-device probe failure and an
>> async matching failure.
>>
>> Adding this interface would provide a holistic view of the async
>> engine's state, which has proven essential for rapid issue
>> localization in complex driver stacks.
>
>I agree with Sakari here. There are plenty of other debugging tools in
>the kernel that can be used to diagnose the kind of issues you've
>described. I think this patch adds more noise than value.
>

>--
>Regards,
>

>Laurent Pinchart

On Fri, Mar 13, 2026 at 09:50:56PM +0800, luo.liu.linux wrote:
>
> Hi Sakari,
>
> Apologies if my previous explanation wasn't clear enough.
>
> To clarify, the primary goal of this interface is not merely to verify if insmod/rmmod succeeds,
> but to validate the correctness of the asynchronous subdevice registration and unregistration paths,
> specifically ensuring that resource allocation and reclamation are handled properly.
>
> I would like to share a real-world scenario that motivated this patch:
>
> We had a camera subsystem pipeline like sensor -> dphy -> mipi-csi2 -> isp
> subdevice driver that appeared to function perfectly for six months. insmod and rmmod completed without any errors,
> and the system seemed stable during normal operation. However, just before a major release, a QA engineer performed
> stress testing involving rapid, repeated cycles of insmod and rmmod, which eventually triggered a kernel crash.
>
> During the debugging process, I inspected the internal global lists:
>
> static LIST_HEAD(subdev_list);
> static LIST_HEAD(notifier_list);
>
> By dumping the subdev_list via this debugfs interface, I discovered that a D-PHY subdevice entry remained in the list even
> after its driver was unloaded. Crucially, the output explicitly showed the device name, allowing me to immediately pinpoint
> the D-PHY driver as the culprit, rather than blindly troubleshooting other components in the pipeline (such as the sensor or ISP).
>
> This was the critical clue that led me to the root cause:
>
> The D-PHY subdriver's remove function was missing a call to v4l2_async_cleanup(sd). Consequently, the subdevice was never properly
> unregistered from the async framework, leading to a use-after-free or stale pointer issue during the stress test.
>
> Without this debugfs interface, detecting such "silent" registration leaks is extremely difficult.
> The driver loads and unloads without reporting errors, and standard logs (dmesg) often provide
> no indication that an entry was left behind in the core framework's list until a crash occurs under specific timing conditions.
>
>
> Given this experience, I believe this interface provides a vital visibility point for engineers to:
>
> 1,Verify that subdevices are correctly removed from the global list upon driver unload.
> 2,Catch missing cleanup calls (like v4l2_async_cleanup) early in the development cycle, rather than discovering them through random crashes in stress testing.

I guess you'd have found this with either KASAN or linked list debugging?

--
Sakari Ailus