0

How do I find whether a column of a CSV contains another using mlr's DSL?

In other words I have a CSV

a,b
test and,test and more

and want to find out whether 'test and' (a) is included in 'test and more' (b)

E Lisse
  • 3
  • 1

1 Answers1

0

Note: I have edited my reply, using the great comment of @Kusalananda


If you have

a,b
test*and,test*and more
lorem,ipsum
whether,Finding whether a string

you can run

mlr --csv put 'if($b != ssub($b,$a,"")){$test=1}else{$test=0}' input.csv

to get

a b test
test*and test*and more 1
lorem ipsum 0
whether Finding whether a string 1

I'm using the ssub function, to check if I have or not a string replace in b - no regexing, no characters are special - using strings I have in a.

if($b != ssub($b,$a,"")), if after string replace b is not equal to itself, then a is contained in b.

If you want simply to filter, you can run

mlr --csv filter '$b != ssub($b,$a,"")' input.csv

to get

a b
test*and test*and more
whether Finding whether a string

Thank you @Kusalananda

aborruso
  • 2,618
  • 10
  • 26
  • I found the cause, but not the solution Having an asterix '*' in a field messes up the regex. Any ideas for that? – E Lisse Jan 14 '23 at 20:06
  • I do not understand. Do it solve exactly your example? If no, could you explain why not? – aborruso Jan 14 '23 at 20:07
  • Could you add it in your example in the question? – aborruso Jan 14 '23 at 20:09
  • 1
    The issue is that `$a` must be interpreted as a string, not as a regular expression. For example, `.*` is included in the string `It matches ".*"`, but not in `It does not match`. – Kusalananda Jan 14 '23 at 20:41
  • @ELisse I have edited my reply – aborruso Jan 14 '23 at 21:54
  • 1
    @aborruso You have now made a workaround for a specific case of regular expression, but not for others. It would be better to avoid using `$a` as a regular expression _at all_, as in `if ($b == ssub($b,$a,""))`. Also consider using `filter` insteadof `put` as the question requires filtering records, not adding or modifying fields: `mlr --csv filter '$b != ssub($b,$a,"")'` – Kusalananda Jan 15 '23 at 06:15
  • @Kusalananda as usual, you are great. I add it in my reply, but I have no merit – aborruso Jan 15 '23 at 08:46
  • @ELisse please try the new way, and let us know – aborruso Jan 15 '23 at 09:01
  • Brilliant! I can't use the `filter` solution in my particular case as I need concatenate `$a . " " . $b . " " . $c` with an eventual `clean-whitespace`. Unless you can propose something equally elegant, of course :-)-O – E Lisse Jan 15 '23 at 13:59
  • 1
    @ELisse If you want to clarify your question (by adding restrictions or further information about your the way you need to handle your data), please do so by editing the question. This answer resolves your query exactly as it is currently posed. – Kusalananda Jan 15 '23 at 15:00