Author Topic: Regex word order question (issue?)  (Read 4541 times)

JimmieC

  • Senior Community Member
  • Posts: 490
  • Hero Points: 17
Regex word order question (issue?)
« on: March 23, 2016, 05:54:08 PM »
I have C++ classes that have the following methods:
CHost::SetAlarmMode
CHost::SetAlarm
class CAlarm { public: void SetAlarm(bool b_AlarmOn) { bAstat = b_AlarmOn; } ... }

I am searching the project with Perl regex, Match case, Match whole word.

EDIT: also List matching lines once only is checked.

When I search the project with "SetAlarm|SetAlarmMode" (without quotes), I only get results for SetAlarm.

When I search the project with "SetAlarmMode|SetAlarm" (without quotes), I get results for both SetAlarmMode & SetAlarm.

This is probably an inherent rule in regex that I don't know about. But, it seems that the order of OR'd the search words can affect the results.

SlickEdit Pro 2015 (v20.0.0.12 64-bit)
The same occurs on SE Pro 2011 (that I am running on my XP-32 bit VM)

Regards,
Jim
« Last Edit: March 23, 2016, 05:55:50 PM by JimmieC »

Clark

  • SlickEdit Team Member
  • Senior Community Member
  • *
  • Posts: 6864
  • Hero Points: 528
Re: Regex word order question (issue?)
« Reply #1 on: March 24, 2016, 01:27:22 AM »
Definitely a limitation of regular expressions. Order does matter

JimmieC

  • Senior Community Member
  • Posts: 490
  • Hero Points: 17
Re: Regex word order question (issue?)
« Reply #2 on: March 24, 2016, 02:54:09 PM »
That kind of puts a damper on my trust of regex search.

If search "whole-word-1 OR whole-word-2" yields different results than "whole-word-2 OR whole-word-1".

Can someone explain why this is the case in regex? At least if I understand it, I can format my searches accordingly. I'm not saying it's bad or wrong, but I must certainly be missing out on search results because I don't get it.

Jim

Marcel

  • Senior Community Member
  • Posts: 261
  • Hero Points: 26
Re: Regex word order question (issue?)
« Reply #3 on: March 24, 2016, 03:10:35 PM »
The regex engine will compare simultaneously all alternatives from left to right, character by character, and will stop once an alternative matches. Because both of your alternatives have a common stem, the engine will stop at "SetAlarm". In your case, specifying a word boundary, such as "\bSetAlarm\b|\bSetAlarmMode\b" would have helped.


JimmieC

  • Senior Community Member
  • Posts: 490
  • Hero Points: 17
Re: Regex word order question (issue?)
« Reply #4 on: March 24, 2016, 03:21:53 PM »
Hi Marcel,
Thanks for the clarification.

As far as the "\b" word boundary, isn't that covered behind-the-scenes because I checked the "Match whole word" checkbox?

I have many other SetAlarmxxxx... variables and functions in this firmware. That is why I use the "whole-word" checkbox. Additionally, I checked "Match case" but I think that is beside the point.

Jim

Clark

  • SlickEdit Team Member
  • Senior Community Member
  • *
  • Posts: 6864
  • Hero Points: 528
Re: Regex word order question (issue?)
« Reply #5 on: March 24, 2016, 03:27:51 PM »
Word match is handled differently. Not the same at all.  It has to support non regex search.

mwb1100

  • Senior Community Member
  • Posts: 156
  • Hero Points: 13
Re: Regex word order question (issue?)
« Reply #6 on: March 24, 2016, 04:38:47 PM »
The regex engine will compare simultaneously all alternatives from left to right, character by character, and will stop once an alternative matches. Because both of your alternatives have a common stem, the engine will stop at "SetAlarm".

I don't understand why that behavior would make "SetAlarm|SetAlarmMode" match differently than "SetAlarmMode|SetAlarm".

Clark

  • SlickEdit Team Member
  • Senior Community Member
  • *
  • Posts: 6864
  • Hero Points: 528
Re: Regex word order question (issue?)
« Reply #7 on: March 24, 2016, 08:28:30 PM »
It doesn't compare simultaneously. It compares them in order and terminates at the first match. All regex engines MUST do this in order to match correctly. As for word matching (the check box), it really depends on implementation.

b

  • Senior Community Member
  • Posts: 325
  • Hero Points: 26
Re: Regex word order question (issue?)
« Reply #8 on: March 24, 2016, 08:52:30 PM »
A better Perl RE would be: \bSetAlarm(Mode)?\b
This allows for the common word bound prefix with Mode being optional.

However, trying other tools with PCRE support do support SetAlarm|SetAlarmMode (and reversed) so I am puzzled why SE would fails as REs usually are greedy and would find both.   Even a one liner to Perl shows the results (whether word bounded or not):

cat test.txt | perl -ane 'if(/SetAlarm|SetAlarmMode/) {print $_}'

Where
cat >test.txt <<EOF
foo
bar
baz
SetAlarm
but
SetAlarmMode
Set
SetAlarmFoo
buck
noSetAlarm
EOF


Clark

  • SlickEdit Team Member
  • Senior Community Member
  • *
  • Posts: 6864
  • Hero Points: 528
Re: Regex word order question (issue?)
« Reply #9 on: March 24, 2016, 09:36:10 PM »
If SlickEdit special cased this by changing the users regex, then order wouldn't matter. Definitely possible. There are some word matching features not supported by the dialog (only available in macro code) which couldn't be done this way but I doubt anyone would care.

Marcel

  • Senior Community Member
  • Posts: 261
  • Hero Points: 26
Re: Regex word order question (issue?)
« Reply #10 on: March 24, 2016, 10:21:37 PM »
I think the OP was perplexed by the fact that the search for "SetAlarm|SetAlarmMode" would highlight CHost::SetAlarmMode, and not CHost::SetAlarmMode.

Most Regex NFA's (Perl, PCRE, ..) perform ordered alternation, left to right, with the first match winning. The engine won't try to find a longer match (i.e. isn't greedy). This may result in less matching text than expected.  Ordered alternation is very powerful but can also be confusing the first time around.

Mastering Regular Expressions V2 pages 174+ (Is Alternation Greedy?) explains the algorithm. It also features a chapter on "Ordered alternation pitfalls", covering the problem discussed here.

JimmieC

  • Senior Community Member
  • Posts: 490
  • Hero Points: 17
Re: Regex word order question (issue?)
« Reply #11 on: March 25, 2016, 06:13:03 PM »
Thanks for all the points here.

My assumption that "Match whole word" would produce a modified regex was incorrect. Also, great test case and comments on greedy vs. non-greedy searches. I will refrain from relying on "Match whole word" and build the the correct regex to enforce whole word.

Regards,
Jim