Re: [PATCH v2 05/28] docs: kdoc_re: add a C tokenizer

From: Mauro Carvalho Chehab

Date: Tue Mar 17 2026 - 03:03:56 EST

On Mon, 16 Mar 2026 16:29:37 -0700
Randy Dunlap <rdunlap@xxxxxxxxxxxxx> wrote:

> Uh, I find this review confusing.
> Do your (Jon) comments refer to the code above them?
> (more below)

I was about to comment the same thing: it sounds that b4 review did a
big mess with your comments, as it is very hard to identify what part
of the code you're referring to.

I'll reply to your comments on a separate e-mail - at least the ones I
understand.

>
>
> On 3/16/26 4:03 PM, Jonathan Corbet wrote:
> > On Thu, 12 Mar 2026 15:54:25 +0100, Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx> wrote:
> >> Handling C code purely using regular expressions doesn't work well.
> >>
> >> Add a C tokenizer to help doing it the right way.
> >>
> >> The tokenizer was written using as basis the Python re documentation
> >> tokenizer example from:
> >> https://docs.python.org/3/library/re.html#writing-a-tokenizer
> >>
> >> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@xxxxxxxxxx>
> >> Message-ID: <c63ad36c81fe043e9e33ca55630414893f127413.1773074166.git.mchehab+huawei@xxxxxxxxxx>
> >> Message-ID: <8541ffa469647db1a7154f274fb2d55b4c127dcb.1773326442.git.mchehab+huawei@xxxxxxxxxx>
> >
> > This is a combined effort to review this patch and to try out "b4 review",
> > we'll see how it goes :).
> >
> >> diff --git a/tools/lib/python/kdoc/kdoc_re.py b/tools/lib/python/kdoc/kdoc_re.py
> >> index 085b89a4547c0..7bed4e9a88108 100644
> >> --- a/tools/lib/python/kdoc/kdoc_re.py
> >> +++ b/tools/lib/python/kdoc/kdoc_re.py
> >> @@ -141,6 +141,240 @@ class KernRe:
> >> [ ... skip 4 lines ... ]
> >> +
> >> + @staticmethod
> >> + def __str__(val):
> >> + """Return the name of an enum value"""
> >> + return TokType._name_by_val.get(val, f"UNKNOWN({val})")
> >> +
> >
> > What is this class supposed to do?
> >
> >> [ ... skip 27 lines ... ]
> >> + _name_by_val = {v: k for k, v in dict(vars()).items() if isinstance(v, int)}
> >> +
> >> + # Dict to convert from string to an enum-like integer value.
> >> + _name_to_val = {k: v for v, k in _name_by_val.items()}
> >> +
> >> + @staticmethod
> >
> > This stuff strikes me as a bit overdone; _name_to_val is really just the
> > variable list for the class, right?
> >
> >> [ ... skip 30 lines ... ]
> >> + f"{self.brace_level}, {self.paren_level}, {self.bracket_level})"
> >> +
> >> +#: Tokens to parse C code.
> >> +TOKEN_LIST = [
> >> + (CToken.COMMENT, r"//[^\n]*|/\*[\s\S]*?\*/"),
> >> +
> >
> > So these aren't "tokens", this is a list of regexes; how is it intended
> > to be used?
> >
> >> + (CToken.STRING, r'"(?:\\.|[^"\\])*"'),
> >> + (CToken.CHAR, r"'(?:\\.|[^'\\])'"),
> >> +
> >> + (CToken.NUMBER, r"0[xX][0-9a-fA-F]+[uUlL]*|0[0-7]+[uUlL]*|"
> >
> > How does "[\s\S]*" differ from plain old "*" ?
> >
> >> [ ... skip 15 lines ... ]
> >> + (CToken.STRUCT, r"\bstruct\b"),
> >> + (CToken.UNION, r"\bunion\b"),
> >> + (CToken.ENUM, r"\benum\b"),
> >> + (CToken.TYPEDEF, r"\bkinddef\b"),
> >> +
> >> + (CToken.NAME, r"[A-Za-z_][A-Za-z0-9_]*"),
> >
> > "-" and "!" never need to be escaped.
> >
> >> +
> >> + (CToken.SPACE, r"[\s]+"),
> >> +
> >> + (CToken.MISMATCH,r"."),
> >> +]
> >> +
> >
> > "kinddef" ?
>
> What does that refer to?
>
> >
> >> +#: Handle C continuation lines.
> >> +RE_CONT = KernRe(r"\\\n")
> >> +
> >> +RE_COMMENT_START = KernRe(r'/\*\s*')
> >> +
> >
> > Don't need the [brackets] here
>
> what brackets?
>
> >
> >> [ ... skip 6 lines ... ]
> >> +
> >> + When converted to string, it drops comments and handle public/private
> >> + values, respecting depth.
> >> + """
> >> +
> >> + # This class is inspired and follows the basic concepts of:
> >
> > That seems weird, why don't you just initialize it here?
>
> I can't tell what that comments refers to.
>
> >> [ ... skip 14 lines ... ]
> >> + source = RE_CONT.sub("", source)
> >> +
> >> + brace_level = 0
> >> + paren_level = 0
> >> + bracket_level = 0
> >> +
> >
> > Do you mean "iterator" here?
>
> Ditto.
>
> >> [ ... skip 33 lines ... ]
> >> + in this particular case, it makes sense, as we can pick the name
> >> + when matching a code via re_scanner().
> >> + """
> >> + global re_scanner
> >> +
> >> + if not re_scanner:
> >
> > Putting __init__() first is fairly standard, methinks.
> >
> >> [ ... skip 15 lines ... ]
> >> +
> >> + for tok in self.tokens:
> >> + if tok.kind == CToken.BEGIN:
> >> + show_stack.append(show_stack[-1])
> >> +
> >> + elif tok.kind == CToken.END:
> >
> > I still don't understand why you do this here - this is all constant, right?
> >
> >> + prev = show_stack[-1]
> >> + if len(show_stack) > 1:
> >> + show_stack.pop()
> >> +
> >> + if not prev and show_stack[-1]:
> >
> > So you create a nice iterator structure, then just put it all together into a
> > list anyway?
> >
>

Thanks,
Mauro