Re: [PATCH] Bluetooth: hci_core: Don't queue tx_work while draining workqueue

From: Heitor Alves de Siqueira

Date: Wed May 20 2026 - 10:43:25 EST


On Mon May 18, 2026 at 5:48 PM -03, Luiz Augusto von Dentz wrote:
> Hi Heitor,
>
> On Mon, May 18, 2026 at 11:21 AM Heitor Alves de Siqueira
> <halves@xxxxxxxxxx> wrote:
>>
>> Hi Luiz,
>>
>> On Fri May 15, 2026 at 10:33 AM -03, Luiz Augusto von Dentz wrote:
>> > Hi Heitor,
>> >
>> > On Thu, May 14, 2026 at 10:54 AM Heitor Alves de Siqueira
>> > <halves@xxxxxxxxxx> wrote:
>> >>
>> >> On Wed May 13, 2026 at 11:04 PM -03, Hillf Danton wrote:
>> >> > On Wed, 13 May 2026 15:55:23 -0300 Heitor Alves de Siqueira wrote:
>> >> >> Syzbot reported a warning when L2CAP calls queue_work() on the hdev
>> >> >> workqueue while it's being drained. This can happen during device reset or
>> >> >> close paths for hci_send_acl(), hci_send_sco() and hci_send_iso().
>> >> >>
>> >> >> The workqueue is drained in hci_dev_do_reset() and in hci_dev_close_sync():
>> >> >> - hci_dev_close_sync() clears the HCI_UP bit before draining
>> >> >> - hci_dev_do_reset() sets HCI_CMD_DRAIN_WORKQUEUE before draining
>> >> >>
>> >> >> Add these checks before queuing tx_work, and free the SKB if it's not
>> >> >> queued for transmission.
>> >> >>
>> >> >> Fixes: 3eff45eaf817 ("Bluetooth: convert tx_task to workqueue")
>> >> >> Reported-by: syzbot+97721dd81f792e838ba0@xxxxxxxxxxxxxxxxxxxxxxxxx
>> >> >> Closes: https://syzkaller.appspot.com/bug?extid=97721dd81f792e838ba0
>> >> >> Signed-off-by: Heitor Alves de Siqueira <halves@xxxxxxxxxx>
>> >> >> ---
>> >> >> net/bluetooth/hci_core.c | 18 ++++++++++++++++++
>> >> >> 1 file changed, 18 insertions(+)
>> >> >>
>> >> >> diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
>> >> >> index c46c1236ebfa..5d5f8ad7d1a8 100644
>> >> >> --- a/net/bluetooth/hci_core.c
>> >> >> +++ b/net/bluetooth/hci_core.c
>> >> >> @@ -3278,6 +3278,12 @@ void hci_send_acl(struct hci_chan *chan, struct sk_buff *skb, __u16 flags)
>> >> >>
>> >> >> BT_DBG("%s chan %p flags 0x%4.4x", hdev->name, chan, flags);
>> >> >>
>> >> >> + if (!test_bit(HCI_UP, &hdev->flags) ||
>> >> >> + hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE)) {
>> >> >> + kfree_skb(skb);
>> >> >> + return;
>> >> >> + }
>> >> >> +
>> >> >> hci_queue_acl(chan, &chan->data_q, skb, flags);
>> >> >>
>> >> >> queue_work(hdev->workqueue, &hdev->tx_work);
>> >> >>
>> >> > What you add is not enough, go and see how HCI_CMD_DRAIN_WORKQUEUE is
>> >> > checked in hci_cmd_work(), and in hci_dev_do_reset() for why.
>> >>
>> >> I see, I missed the RCU guards for the device flags. Sorry about that,
>> >> I'll add them to v2.
>> >> Thanks for the catch!
>> >
>> > Actually this whole thing might be because we try to clean the
>> > workqueue without actually closing the hdev, so I suspect that if we
>> > just remove all the code from hci_dev_do_reset with
>> > hci_dev_do_close+hci_dev_do_open, it might work better and align with
>> > how things work over the MGMT interface:
>> >
>> > diff --git a/net/bluetooth/hci_core.c b/net/bluetooth/hci_core.c
>> > index c46c1236ebfa..6782bbc9b6a7 100644
>> > --- a/net/bluetooth/hci_core.c
>> > +++ b/net/bluetooth/hci_core.c
>> > @@ -539,47 +539,13 @@ static int hci_dev_do_reset(struct hci_dev *hdev)
>> >
>> > hci_req_sync_lock(hdev);
>> >
>> > - /* Drop queues */
>> > - skb_queue_purge(&hdev->rx_q);
>> > - skb_queue_purge(&hdev->cmd_q);
>> > -
>> > - /* Cancel these to avoid queueing non-chained pending work */
>> > - hci_dev_set_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
>> > - /* Wait for
>> > - *
>> > - * if (!hci_dev_test_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE))
>> > - * queue_delayed_work(&hdev->{cmd,ncmd}_timer)
>> > - *
>> > - * inside RCU section to see the flag or complete scheduling.
>> > - */
>> > - synchronize_rcu();
>> > - /* Explicitly cancel works in case scheduled after setting the flag. */
>> > - cancel_delayed_work(&hdev->cmd_timer);
>> > - cancel_delayed_work(&hdev->ncmd_timer);
>> > -
>> > - /* Avoid potential lockdep warnings from the *_flush() calls by
>> > - * ensuring the workqueue is empty up front.
>> > - */
>> > - drain_workqueue(hdev->workqueue);
>> > -
>> > - hci_dev_lock(hdev);
>> > - hci_inquiry_cache_flush(hdev);
>> > - hci_conn_hash_flush(hdev);
>> > - hci_dev_unlock(hdev);
>> > -
>> > - if (hdev->flush)
>> > - hdev->flush(hdev);
>> > -
>> > - hci_dev_clear_flag(hdev, HCI_CMD_DRAIN_WORKQUEUE);
>> > -
>> > - atomic_set(&hdev->cmd_cnt, 1);
>> > - hdev->acl_cnt = 0;
>> > - hdev->sco_cnt = 0;
>> > - hdev->le_cnt = 0;
>> > - hdev->iso_cnt = 0;
>> > + ret = hci_dev_close_sync(hdev);
>> > + if (ret)
>> > + goto done;
>> >
>> > - ret = hci_reset_sync(hdev);
>> > + ret = hci_dev_open_sync(hdev);
>> >
>> > +done:
>> > hci_req_sync_unlock(hdev);
>> > return ret;
>> > }
>> >
>> > Seem to work here and as a side effect we get notified over the MGMT
>> > when a user uses hciconfig hci0 reset:
>> >
>> > # tools/hciconfig hci0 reset
>> > bluetoothd[34]: src/shared/mgmt.c:can_read_data() [0x0000] event 0x0006
>> > bluetoothd[34]: src/adapter.c:new_settings_callback() Settings: 0x01fc0ac0
>> > bluetoothd[34]: src/adapter.c:settings_changed() Changed settings: 0x00000001
>> > bluetoothd[34]: src/adapter.c:settings_changed() Pending settings: 0x00000000
>> > bluetoothd[34]: src/adapter.c:cancel_passive_scanning()
>> > bluetoothd[34]: src/adapter.c:adapter_stop() adapter /org/bluez/hci0
>> > has been disabled
>> > bluetoothd[34]: src/shared/mgmt.c:can_read_data() [0x0000] event 0x0008
>> > bluetoothd[34]: src/shared/mgmt.c:can_read_data() [0x0000] event 0x0006
>> > bluetoothd[34]: src/adapter.c:new_settings_callback() Settings: 0x01fc0ac1
>> > bluetoothd[34]: src/adapter.c:settings_changed() Changed settings: 0x00000001
>> > bluetoothd[34]: src/adapter.c:settings_changed() Pending settings: 0x00000000
>> > bluetoothd[34]: src/adapter.c:adapter_start() adapter /org/bluez/hci0
>> > has been enabled
>> > bluetoothd[34]: src/adapter.c:trigger_passive_scanning()
>> > bluetoothd[34]: [signal] org.freedesktop.DBus.Properties.PropertiesChanged
>>
>> Thank you for the review, and for the suggestion! There seems to be a
>> number of similar syzbot reports for the hdev workqueue, so maybe the
>> close+open approach is a simpler solution indeed. I've originally
>> considered implementing a wrapper/helper for workqueue submission in
>> hci_core.c, but if we can eliminate the race condition altogether that'd
>> be even better.
>>
>> My only concern is that there seem to be slight differences between
>> what hci_dev_do_reset() does now and what we'd get with
>> hci_dev_close_sync()+hci_dev_open_sync() (e.g. we wouldn't use
>> HCI_CMD_DRAIN_WORKQUEUE anymore). Would this alternative approach be
>> suitable for the stable kernels? And if not, do you think the checks
>> from v1 would be appropriate, in that case?
>
> I think, the HCI_Reset is a heavy hammer because it's a global reset.
> Getting all the pieces to cooperate has been causing all sort of
> issues, and in any case I think what hci_dev_do_reset does is simply
> wrong or outdated; for instance it doesn't seem to affect LE, such as
> cancelling scanning or advertising. This is likely because interfaces
> like MGMT don't use it and require a power cycle to reset.

ACK, that makes sense. Please disregard this patch, I'll send a
follow-up with the reset changes and any missing bits from the current
async reset path.

Thank you for the pointers, Luiz!