ack : get the 10th (or bigger nth) matching/capturing group

Question

I think I might have just searched wrong, but I didn't find any answer. If there's a duplicate, please just let me know, and I can take this down.

Problem Background

I'm using ack (link), which has Perl 5 under the hood, to get n-grams - especially higher-order n-grams. I can get up to 9-grams using the syntax I know (basically up to $9), but I haven't been able to get the 10-grams. Using $10 just gives me $1 with a 0 after it. Things like $(10) and ${10} did not solve the problem. I'm NOT interested in a solution using a language-modelling toolkit, I want to use ack.

One dataset I'm using is the complete works of Mark Twain

( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt ).

I've parsed things clean (see the Parsing Note at the end of the post) and saved the parsed result as TWAIN_Mark_complete_parsed.txt.

I've been fine getting from 2-grams, with the code and partial results for that being

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) +(?=(\S+) +)' \
    --output '$1 $2' | \
    sort | uniq -c | \
    sort -rn > Twain_2grams.txt
## `time` info not shown
$ head -n 2 Twain_2grams.txt
  18176 of the
  13288 in the

all the way up to 9-grams, with

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
    --output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
    sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
     17 to mrs jane clemens and mrs moffett in st
     17 mrs jane clemens and mrs moffett in st louis

(N.B. I meta-program the ack commands, rather than just typing every single one.)

The Problem / What I've Tried

My first try with 10-grams, as well as the result, was

time cat TWAIN_Mark_complete_parsed.txt | \
    ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
    --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
    sort | uniq -c | sort -rn > Twain_10grams.txt

$ head -n 2 Twain_10grams.txt
     17 to mrs jane clemens and mrs moffett in st to0
     17 mrs jane clemens and mrs moffett in st louis mrs0

To better see what's happening,

cf. this SO answer (and this comment) for details of how to get that colored diff with word-by-word difference highlighting. Basically apt or yum for colordiff, then pip for diff-highlight.

Using $(10) instead of $10 gives the first two lines of output as

     17 to mrs jane clemens and mrs moffett in st $(10)
     17 mrs jane clemens and mrs moffett in st louis $(10)

(two minutes later).

Using ${10} instead of $10 gives the first two lines of output as

     17 to mrs jane clemens and mrs moffett in st ${10}
     17 mrs jane clemens and mrs moffett in st louis ${10}

That's as far as my thoughts have gone.

Expected/Desired Output

Note that there is a statistical (very non-zero and finite) possibility of the real output being different from the one shown here. The top two results for 9-grams were not distinct sequences of words. Other possible parts of a more-common 10-gram might be found by looking at the top 10 most frequent 9-grams - using head instead of head -n 2. Even so, I'm fairly certain that not even this would guarantee that we have the two most frequent 10-grams. I hope, however, that I'm making it clear enough what I'm wanting to accomplish.

~~17 to mrs jane clemens and mrs moffett in st louis~~
~~3 mrs jane clemens and mrs moffett in st louis honolulu~~

Edit I've already found another set that changes expected output to (possibly not the actual output, but one that changes it from the simple model I used before.)

     17 to mrs jane clemens and mrs moffett in st louis
      7 happiness in his home had been wounded and bruised almost

That would be for the head -n 2 that I've been using to show what kind of results I get.

I don't want to get it by the same process I'm going to use here.

$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
   TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
     17 to mrs jane clemens and mrs moffett in st louis

$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
   TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      3 mrs jane clemens and mrs moffett in st louis honolulu
      2 mrs jane clemens and mrs moffett in st louis san
      2 mrs jane clemens and mrs moffett in st louis no
      2 mrs jane clemens and mrs moffett in st louis 224
      1 mrs jane clemens and mrs moffett in st louis wash
      1 mrs jane clemens and mrs moffett in st louis wailuku
      1 mrs jane clemens and mrs moffett in st louis virginia
      1 mrs jane clemens and mrs moffett in st louis the
      1 mrs jane clemens and mrs moffett in st louis sept
      1 mrs jane clemens and mrs moffett in st louis on
      1 mrs jane clemens and mrs moffett in st louis hartford
      1 mrs jane clemens and mrs moffett in st louis carson

Edit The code used to find the newer second-place frequency was

$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      6 shelley's happiness in his home had been wounded and bruised
      1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
      1 home had been wounded and bruised almost to death thirdly
      1 home had been wounded and bruised almost to death secondly
      1 home had been wounded and bruised almost to death it
      1 home had been wounded and bruised almost to death fourthly
      1 home had been wounded and bruised almost to death first
      1 home had been wounded and bruised almost to death fifthly
      1 home had been wounded and bruised almost to death and

Edit from Comment

@Inian made a great comment:

This is documented in the release notes - github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md - You're now restricted to the following variables: $1 thru $9, $, $., $&, $` , $' and $+_

For future people, I'm putting a version, archived today, of the RELEASE-NOTES

The man page for ack does have the lines

$1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that".

but I was hoping there was a way to get higher numbers. With the info from the RELEASE-NOTES, that hope seems mostly gone.

However, I still wonder if anyone has a work-around or hack, whether using ack or any of the more 'standard' *NIX-type terminal tools. My preference, in order, would be perl, grep, awk, sed. If there's something similar to ack (i.e. just command-line parsing, NOT an NLP-toolkit-based solution), I'm interested in that, too.

I think it might be better to pose this as a new question. If you answer here, great. If I end up posting a new question, I will put the link here: for now, this is just a link to this same question.

Parsing Note

To get my corpus ready for n-gram analysis, here was my parsing.

tr [:upper:] [:lower:] < TWAIN_Mark_complete_orig.txt | \
# upper case to lower case and avoid useless use of cat
 tr '\n' ' ' | \
# newlines into spaces, so we can later make it one line, single-spaced
 sed -E "s/[^a-z0-9 '*-]+//g" | \
# get rid of everything but letters, numbers, and a few other symbols (corpus)
 awk '{$0=$0;$1=$1}1' > TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output
:

Yes, that could all be on one line (and without the trailing && :), but this makes for easier reading as well as explanation of why I'm doing what I'm doing.

System Details

$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name:                   Microsoft Windows 10 Enterprise
Version:                10.0.17134 N/A Build 17134
Manufacturer:           Microsoft Corporation
Configuration:          Member Workstation
Build Type:             Multiprocessor Free

This is documented in the release notes - https://github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md - _You're now restricted to the following variables: $1 thru $9, $_, $., $&, $` , $' and $+_ — Inian, Jun 17 '20 at 15:47
Thanks @Inian for letting me know. I hadn't thought of looking at the release notes part. Now, running `man ack | grep -n [$]` and then, as a result, `man ack | head -n 340 | tail -35`, I realize I should have looked harder. I'm going to edit the question to ask if anyone knows a hack/workaround, unless someone lets me know that it would be better to ask a new question. — bballdave025, Jun 17 '20 at 16:20
Even with the `man` stuff, it's not explicitly clear that one can't go past `$9`. Thanks for the link to the `RELEASE-NOTES`. — bballdave025, Jun 17 '20 at 16:30
Dunno abt `ack`, but in `perl` if you want capture variables you may do this: `$ten = substr($_, $-[10],$+[10]-$-[10]);` and similarly for any order. — Rakesh Sharma, Jun 18 '20 at 05:23
Could you please tell me if my answer helps or should I erase it? — , Jun 19 '20 at 06:15
Your answer is very helpful; thanks for giving the different options. — bballdave025, Jun 21 '20 at 17:03

meuh · Accepted Answer · 2020-06-17T17:45:03.073

3

Here's a possible hack, though I am not a perl expert. Looking at the all-in-one source file, it seems that ack is made to handle only a single character after $ in the output string. Changing this to accept multiple characters is no doubt feasible, but to keep hacks simple, you can extend 0..9 with abc.... For example, I made these changes to accept $a and $b as $10 and $11 (shown as a diff -u)

@@ -188,7 +188,7 @@
         $opt_output =~ s/\\r/\r/g;
         $opt_output =~ s/\\t/\t/g;
 
-        my @supported_special_variables = ( 1..9, qw( _ . ` & ' +  f ) );
+        my @supported_special_variables = ( 1..9, qw( a b _ . ` & ' +  f ) );
         @special_vars_used_by_opt_output = grep { $opt_output =~ /\$$_/ } @supported_special_variables;
 
         # If the $opt_output contains $&, $` or $', those vars won't be
@@ -924,6 +924,8 @@
                 # on them not changing in the process of doing the s///.
 
                 my %keep = map { ($_ => ${$_} // '') } @special_vars_used_by_opt_output;
+                $keep{a} = $10;
+                $keep{b} = $11;
                 $keep{_} = $line if exists $keep{_}; # Manually set it because $_ gets reset in a map.
                 $keep{f} = $filename if exists $keep{f};
                 my $special_vars_used_by_opt_output = join( '', @special_vars_used_by_opt_output );

If you just want to go up to the 10th match, however, you can use $+ as it shows the text matched by the last bracket of the last successful search pattern.

edited Jun 17 '20 at 17:45

answered Jun 17 '20 at 17:32

meuh

49,672
2
52
114

1

That's the kind of beautiful, hacky answer I love to see. Going into the source code is an awesome way to get things done. : ) It's also nice to be able to get at least the 10th match with the `$+`; I didn't know that trick. Thanks. – bballdave025 Jun 17 '20 at 18:33
**Not Quite Verified $+ !** Input: `time cat TWAIN_Mark_complete_parsed.txt | ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $+' | sort | uniq -c | sort -rn | head -n 2` ;;; Output: `7 in his home had been wounded and bruised almost death` \n `7 happiness in his home had been wounded and bruised to` \n\n `real 2m9.714s` \n `user 2m17.277s` \n `sys 0m1.369s` ;;; **END** It's not catching the `17 to mrs jane clemens and mrs moffett in st louis`; Maybe using letters will fix it, or maybe I misunderstood something – bballdave025 Jun 17 '20 at 20:27
**Verified `$a` ... hack** I am pretty sure this one, though I'm not sure why the `$+` doesn't work - another question, perhaps. Input: `cat TWAIN_Mark_complete_parsed.txt | ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' --output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $a' | sort | uniq -c | sort -rn | head` ;;; Output: `17 to mrs jane clemens and mrs moffett in st louis` \n `8 ---- ---- ---- ---- ---- ---- ---- ---- ---- ----` \n `7 in his home had been wounded and bruised almost to` \n `7 his home had been wounded and bruised almost to death` \n **. . .**. – bballdave025 Jun 17 '20 at 20:37
Using all of the upper-case letters and all lower-case letters except 'f', since `$f` is already set aside for something with files, I can now search for up to 60-grams (9+26+25=60). The last value of `n` for which there are repeating n-grams (in the Twain dataset) is ``. Input: `` – bballdave025 Jun 17 '20 at 20:52

score 1 · Answer 2 · 2020-06-18T07:29:36.617

Three alternate solutions:

ack version 2

It seems that in ack version 2 the variables $10 $11 etc are valid:

$ echo 'abcdefghijklmn' | 
  ack '(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)' \
  --output '$1 $2 $3 $11'

a b c k

$ ack --version
ack 2.24
Running under Perl 5.28.1 at /usr/bin/perl

Which, to get overlapping strings will be:

echo 'abcdefghijklmn' |
    ack '(.)(?=(.)(.)(.)(.)(.)(.)(.)(.)(.)(.))' \
    --output '$1 $2 $3 $11'
a b c k
b c d l
c d e m
d e f n

Perl5

However, the same could be done in perl directly by:

echo 'abcdefghijklmn' | 
    perl -ne 'while($_ =~ /(.)(?=(.)(.)(.)(.)(.)(.)(.)(.)(.)(.))/g ){
        print $1," ",$2," ",$11," ","\n" }'
a b k
b c l
c d m
d e n

So, to find and print words (separated by one or more spaces):

echo "word1 word2 word3 word4 word5 word6" |
    perl -ne 'while($_ =~ /(\S+) +(?=(\S+) +(\S+) +(\S+))/g ){$,=" ";print $1,$2,$3,$4,"\n" }'

word1 word2 word3 word4 
word2 word3 word4 word5 
word3 word4 word5 word6

The printed lines have a trailing space (hope you don't mind).

Perl6

Or you can try Perl6 (Raku) using the :ov (overlap) modifier:

echo "one two three four five" | 
    perl6 -ne 'my @var = $_.match(/ <|w> \w+ [" "+ \w+]**2 <|w> /, :ov); say @var.join("\n") ;'

one two three
two three four
three four five

which, by changing a single number, will match other counts:

echo "one two three four five" | 
perl6 -ne 'my @var = $_.match(/ <|w> \w+ [" "+ \w+]**3 <|w> /, :ov); say @var.join("\n") ;'

one two three four
two three four five

Results

With perl5 the result will be:

perl -ne 'while($_ =~ /(\S+) +(?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))/g ){
 $,=" ";
 print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,"\n" 
}' TWAIN_Mark_complete_parsed.txt | 
    sort | 
    uniq -c | 
    sort -rn >Twain_10grams5.txt

Note that Perl6 was unable to complete (too much memory) for such large test text (Perl6 is still too new). Using ack was quite slower than perl5 but files were identical.

head -n 10 Twain_10grams5.txt
     17 to mrs jane clemens and mrs moffett in st louis 
      8 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 
      7 in his home had been wounded and bruised almost to 
      7 his home had been wounded and bruised almost to death 
      7 happiness in his home had been wounded and bruised almost 
      6 shelley's happiness in his home had been wounded and bruised 
      5 was by the social fireside in the time of the 
      5 thing indeed if you would like to listen to it 
      5 laughable thing indeed if you would like to listen to 
      5 it was in this way that he found out that

Very nice and complete! I will try all of those versions. I'm thinking of adding a question benchmarking these Perl/Ack methods and comparing them to any pure `grep`, `awk`, etc. — bballdave025, Jun 21 '20 at 16:57
I like the generalization to character n-grams from stings. I didn't know that Perl6 was out; that's good to know, too! — bballdave025, Jun 21 '20 at 17:00