Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp does not set variables #310

Open
The-Markitecht opened this issue Sep 20, 2024 · 6 comments
Open

regexp does not set variables #310

The-Markitecht opened this issue Sep 20, 2024 · 6 comments

Comments

@The-Markitecht
Copy link
Contributor

Welcome to Jim version 0.82
. set re [string map {{ } {}} { ^ \s* (\S+) \s+ (\w+) \s+ (\d+ \. \d+ \. \d+ \. \d+ / \d+)+ \s* }]  ;# delete spaces used for legibility
^\s*(\S+)\s+(\w+)\s+(\d+\.\d+\.\d+\.\d+/\d+)+\s*
. set raw {wlan0            UP             192.168.60.8/24}
wlan0            UP             192.168.60.8/24
. regexp -all -inline $re $raw  ;# matches as expected
{wlan0            UP             192.168.60.8/24} wlan0 UP 192.168.60.8/24
. set devNm  ;# should not exist at this point
can't read "devNm": no such variable
[error] . puts [regexp -all $re $raw whole devNm devState devIPs]  ;# matches as expected (1)
1
. set devNm  ;# should be set!
can't read "devNm": no such variable
[error] . set whole    ;# should be set!
can't read "whole": no such variable
[error] . set devState    ;# should be set!
can't read "devState": no such variable
[error] . set devIPs    ;# should be set!
can't read "devIPs": no such variable

@The-Markitecht
Copy link
Contributor Author

The-Markitecht commented Sep 20, 2024

Removing -all avoids this issue, but might not provide the behavior any given application needs. Seems like an issue.
Even if Jim is not patched for that in the future, the docs should explicitly mention that -all is completely incompatible with match variables (or is it just expressions anchored with ^ ? no). And that command should throw an error.

@The-Markitecht
Copy link
Contributor Author

The-Markitecht commented Sep 20, 2024

Also, repeated submatches are not captured, even with -all

. set raw {wlan0            UP             192.168.60.8/24 1.2.3.4/6}
wlan0            UP             192.168.60.8/24 1.2.3.4/6
. set re [string map {{ } {}} { ^ \s* (\S+) \s+ (\w+) ( \s+ \d+ \. \d+ \. \d+ \. \d+ / \d+ )+ }]
^\s*(\S+)\s+(\w+)(\s+\d+\.\d+\.\d+\.\d+/\d+)+
. puts [regexp -all -inline $re $raw]
{wlan0            UP             192.168.60.8/24 1.2.3.4/6} wlan0 UP { 1.2.3.4/6}
. puts [regexp -inline $re $raw]
{wlan0            UP             192.168.60.8/24 1.2.3.4/6} wlan0 UP { 1.2.3.4/6}

Note how the whole match includes the repetition, but the capture group doesn't.

EDIT: while -all docs specify "last match only", it also does that regardless of use of -all. See the final example. So -all is not a factor.

EDIT: tclsh8.6 behaves the same as jimsh (final match only). That makes sense if -all is supposed to only apply to the entire expression, not capture groups. But that's a weird design. Capture groups in other regex engines (such as C# from M$ not Mono) can capture multiple matches.

EDIT: Per https://regex101.com/ the only other engine that captures repeats and reports them individually is C#. I guess my memory failed me there (I have actually used that scenario in C# before). But, if + or * quantifiers on the capture group are not supported, i.e. matched but not reported, then why accept them being in the expression at all? That's misleading. The engine rejects (with a descriptive error) other nonsensical expressions such as unbalanced brackets, but not this one. With a quantifier present, shouldn't it reject capturing groups and only accept non-capturing groups? But Jim's built-in engine is documented to not support any non-capturing groups at all. So I guess we're left with no way at all to individually report multiple matches from a capture group. Jim's docs are explicit "last match only" when describing -all as it relates to the entire expression as a whole. But that doesn't address the behavior of capture groups at all, either with or without -all (I expected to report a list of that group's matches in the capture variable, at least with -all). Nor does it mention that the same limits apply when using -inline (they do). In fact it mentions "match variables" there, implying the opposite: that those limits don't apply when using -inline. I know this is splitting hairs, but splitting hairs is what regex is for right? :-)

Is this a distinct issue separate from #310 ?

@msteveb
Copy link
Owner

msteveb commented Sep 22, 2024

I can understand why you might want this, but the general rule for Jim Tcl is to follow the behaviour of Tcl (all else being equal) so I wouldn't be inclined to change this until/unless Tcl does the same.

@msteveb msteveb closed this as completed Sep 22, 2024
@The-Markitecht
Copy link
Contributor Author

The-Markitecht commented Oct 1, 2024

Please reopen this. I know the thread text became confusing as I kept researching the issue. But given your reply, that still leaves 3 unresolved issues here:

  1. tclsh does not behave the same as jimsh for the original test case (-all). tclsh sets the variables. I re-tested that just now. Yes, the test cases are confusing. But the only jimsh behavior that does match tclsh is: when jimsh does manage to set some match or submatch variables (because I avoided -all), they are last-match-only, just like tclsh. But the -all case still does not match tclsh. tclsh does set the variables, jimsh doesn't. Its match variables are totally, entirely, completely incompatible with -all.

  2. The docs should state explicitly that match variables are totally, entirely, completely incompatible with -all. They are not set in any case having -all. Docs do not say or imply that currently. In fact they imply the opposite, which is misleading.

  3. jimsh knows it's never going to set those variables. So, does our error checking also have to be as bad as tclsh? At a minimum, jimsh should throw an error when -all was given together with one or more match variables.

I believe those fixes fit neatly within the existing doctrine of jimsh. Discussion?

I also was going to include these further points below, but actually these below are obviated by not supporting non-capturing groups at all. Capturing groups are the only kind of groups in jimsh; there is no other. So they must support quantifiers greater than 1 for any regex to be viable. So ignore these points below:

3-b) Personally I also think it would be wise to go further, throwing an error on any capturing groups having + or any other quantifier that could result in more than one match, since those are worse than useless, they're actively misleading. Zero matches or one match are OK (as long as I avoided -all) such as ? or {1,1} or {0,1}. But that fix would deviate from tclsh because I doubt it detects that misleading case either. So let's put that fix aside (3-b) and focus on (1) (2) (3) and (4). I believe their fixes fit neatly within the existing doctrine of jimsh. ~~

  1. Even if jimsh doesn't throw an error on (3-b), the docs can, and should. They should explicitly state that using submatch variables makes quantifiers greater than 1 on those capturing groups completely useless and misleading. That will save people several hours of testing each month or year as they discover and rediscover that issue.

EDIT: Jim actually does support non-capturing groups since 0.73. So (3-b) and (4) are actually relevant. But functionality in dcea42c below does adequately address those. The "last match only" behavior is documented. But I'd still like to see it more obvious there, like "Quantifiers greater than one are useful when applied to non-capturing parentheses, but can be misleading when applied to capturing parentheses, since some occurrences of the group won't be captured. For best legibility, avoid that misleading pattern unless you actually need it, which is rare." I'd put that in the description of capturing groups.

msteveb added a commit that referenced this issue Oct 1, 2024
To match the documentation and Tcl, if match vars are
given with -all, the last match is assigned.

Fixes #310

Signed-off-by: Steve Bennett <[email protected]>
@msteveb
Copy link
Owner

msteveb commented Oct 1, 2024

OK, I missed that incompatibility. Happy to be compatible with Tcl on this (and match the docs)

. regexp -all a(b(c)d|e(f)g)h abcdhaefgh a b c d
2
. puts $a,$b,$c,$d
aefgh,efg,,f

See dcea42c

@msteveb msteveb reopened this Oct 1, 2024
@The-Markitecht
Copy link
Contributor Author

Tested and works. Even with -indices. Thanks!

-all is cool because it actually provides the choice of returning the first match (by default) or the last one (-all). That can be useful when parsing the output of a shell command such as ping or ps. Or when consuming a series of matches backward (from last to first) in a loop, truncating as we go, or replacing each match as we go. I can see that would be especially handy when replacing, because that should be done last-to-first (due to the length changing) for speed.

On that note, thanks also for regsub -command. That was a noticeable omission from both interps for many years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants