I think I might have just searched wrong, but I didn't find any answer. If there's a duplicate, please just let me know, and I can take this down.
Problem Background
I'm using ack (link), which has Perl 5 under the hood, to get n-grams - especially higher-order n-grams. I can get up to 9-grams using the syntax I know (basically up to $9), but I haven't been able to get the 10-grams. Using $10 just gives me $1 with a 0 after it. Things like $(10) and ${10} did not solve the problem. I'm NOT interested in a solution using a language-modelling toolkit, I want to use ack.
One dataset I'm using is the complete works of Mark Twain
( wget http://www.gutenberg.org/cache/epub/3200/pg3200.txt && mv pg3200.txt TWAIN_Mark_complete_orig.txt ).
I've parsed things clean (see the Parsing Note at the end of the post) and saved the parsed result as TWAIN_Mark_complete_parsed.txt.
I've been fine getting from 2-grams, with the code and partial results for that being
time cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) +(?=(\S+) +)' \
--output '$1 $2' | \
sort | uniq -c | \
sort -rn > Twain_2grams.txt
## `time` info not shown
$ head -n 2 Twain_2grams.txt
18176 of the
13288 in the
all the way up to 9-grams, with
time cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
--output '$1 $2 $3 $4 $5 $6 $7 $8 $9' | \
sort | uniq -c | sort -rn > Twain_9grams.txt
## time info not shown
$ head -n 2 Twain_9grams.txt
17 to mrs jane clemens and mrs moffett in st
17 mrs jane clemens and mrs moffett in st louis
(N.B. I meta-program the ack commands, rather than just typing every single one.)
The Problem / What I've Tried
My first try with 10-grams, as well as the result, was
time cat TWAIN_Mark_complete_parsed.txt | \
ack '(\S+) (?=(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+) +(\S+))' \
--output '$1 $2 $3 $4 $5 $6 $7 $8 $9 $10' | \
sort | uniq -c | sort -rn > Twain_10grams.txt
$ head -n 2 Twain_10grams.txt
17 to mrs jane clemens and mrs moffett in st to0
17 mrs jane clemens and mrs moffett in st louis mrs0
To better see what's happening,
cf. this SO answer (and this comment) for details of how to get that colored diff with word-by-word difference highlighting. Basically apt or yum for colordiff, then pip for diff-highlight.
Using $(10) instead of $10 gives the first two lines of output as
17 to mrs jane clemens and mrs moffett in st $(10)
17 mrs jane clemens and mrs moffett in st louis $(10)
(two minutes later).
Using ${10} instead of $10 gives the first two lines of output as
17 to mrs jane clemens and mrs moffett in st ${10}
17 mrs jane clemens and mrs moffett in st louis ${10}
That's as far as my thoughts have gone.
Expected/Desired Output
Note that there is a statistical (very non-zero and finite) possibility of the real output being different from the one shown here. The top two results for 9-grams were not distinct sequences of words. Other possible parts of a more-common 10-gram might be found by looking at the top 10 most frequent 9-grams - using head instead of head -n 2. Even so, I'm fairly certain that not even this would guarantee that we have the two most frequent 10-grams. I hope, however, that I'm making it clear enough what I'm wanting to accomplish.
17 to mrs jane clemens and mrs moffett in st louis
3 mrs jane clemens and mrs moffett in st louis honolulu
Edit I've already found another set that changes expected output to (possibly not the actual output, but one that changes it from the simple model I used before.)
17 to mrs jane clemens and mrs moffett in st louis
7 happiness in his home had been wounded and bruised almost
That would be for the head -n 2 that I've been using to show what kind of results I get.
I don't want to get it by the same process I'm going to use here.
$ grep -o "to mrs jane clemens and mrs moffett in st [^ ]\+" \
TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
17 to mrs jane clemens and mrs moffett in st louis
$ grep -o "mrs jane clemens and mrs moffett in st louis [^ ]\+" \
TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
3 mrs jane clemens and mrs moffett in st louis honolulu
2 mrs jane clemens and mrs moffett in st louis san
2 mrs jane clemens and mrs moffett in st louis no
2 mrs jane clemens and mrs moffett in st louis 224
1 mrs jane clemens and mrs moffett in st louis wash
1 mrs jane clemens and mrs moffett in st louis wailuku
1 mrs jane clemens and mrs moffett in st louis virginia
1 mrs jane clemens and mrs moffett in st louis the
1 mrs jane clemens and mrs moffett in st louis sept
1 mrs jane clemens and mrs moffett in st louis on
1 mrs jane clemens and mrs moffett in st louis hartford
1 mrs jane clemens and mrs moffett in st louis carson
Edit The code used to find the newer second-place frequency was
$ grep -o "[^ ]\+ happiness in his home had been wounded and bruised" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
6 shelley's happiness in his home had been wounded and bruised
1 his happiness in his home had been wounded and bruised
$ grep -o "shelley's happiness in his home had been wounded and [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
6 shelley's happiness in his home had been wounded and bruised
$ grep -o "happiness in his home had been wounded and bruised [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 happiness in his home had been wounded and bruised almost
$ grep -o "in his home had been wounded and bruised almost [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 in his home had been wounded and bruised almost to
$ grep -o "his home had been wounded and bruised almost to [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
7 his home had been wounded and bruised almost to death
$ grep -o "home had been wounded and bruised almost to death [^ ]\+" TWAIN_Mark_complete_parsed.txt | sort | uniq -c | sort -rn
1 home had been wounded and bruised almost to death thirdly
1 home had been wounded and bruised almost to death secondly
1 home had been wounded and bruised almost to death it
1 home had been wounded and bruised almost to death fourthly
1 home had been wounded and bruised almost to death first
1 home had been wounded and bruised almost to death fifthly
1 home had been wounded and bruised almost to death and
Edit from Comment
@Inian made a great comment:
This is documented in the release notes - github.com/beyondgrep/ack3/blob/dev/RELEASE-NOTES.md - You're now restricted to the following variables: $1 thru $9, $, $., $&, $` , $' and $+_
For future people, I'm putting a version, archived today, of the RELEASE-NOTES
The man page for ack does have the lines
$1 through $9
The subpattern from the corresponding set of capturing parentheses.
If your pattern is "(.+) and (.+)", and the string is "this and that',
then $1 is "this" and $2 is "that".
but I was hoping there was a way to get higher numbers. With the info from the RELEASE-NOTES, that hope seems mostly gone.
However, I still wonder if anyone has a work-around or hack, whether using ack or any of the more 'standard' *NIX-type terminal tools. My preference, in order, would be perl, grep, awk, sed. If there's something similar to ack (i.e. just command-line parsing, NOT an NLP-toolkit-based solution), I'm interested in that, too.
I think it might be better to pose this as a new question. If you answer here, great. If I end up posting a new question, I will put the link here: for now, this is just a link to this same question.
Parsing Note
To get my corpus ready for n-gram analysis, here was my parsing.
tr [:upper:] [:lower:] < TWAIN_Mark_complete_orig.txt | \
# upper case to lower case and avoid useless use of cat
tr '\n' ' ' | \
# newlines into spaces, so we can later make it one line, single-spaced
sed -E "s/[^a-z0-9 '*-]+//g" | \
# get rid of everything but letters, numbers, and a few other symbols (corpus)
awk '{$0=$0;$1=$1}1' > TWAIN_Mark_complete_parsed.txt && \
# collapse all multiple spaces to one space (includes tabs), save to output
:
Yes, that could all be on one line (and without the trailing && :), but this makes for easier reading as well as explanation of why I'm doing what I'm doing.
System Details
$ uname -a
CYGWIN_NT-10.0 MY_MACHINE 3.0.7(0.338/5/3) 2019-04-30 18:08 x86_64 Cygwin
$ bash --version | head -n 1
GNU bash, version 4.4.12(3)-release (x86_64-unknown-cygwin)
$ ack --version | head -n 2
ack v3.3.1 (standard build)
Running under Perl v5.26.3 at /usr/bin/perl.exe
$ systeminfo | sed -n 's/^OS\ *//p'
Name: Microsoft Windows 10 Enterprise
Version: 10.0.17134 N/A Build 17134
Manufacturer: Microsoft Corporation
Configuration: Member Workstation
Build Type: Multiprocessor Free
