1

I have a big file of N-Triples/N-Quads
I want to be left with a collection of only subjects.
On smaller files I could have achieved this using:

awk '{print $1}' | uniq

but awk fails for large files lines. (several MB).

How can I achieve something similar with sed/grep -o/etc'...?

EDIT:

awk fails with:

awk: program limit exceeded: maximum number of fields size=32767
    FILENAME="file.nq" FNR=308254 NR=308254
gilad hoch
  • 187
  • 6
  • 1
    How does `awk` fail on large files? Have you tried piping the input in (`awk '{print $1}' < file`)? – Stephen Kitt Dec 21 '16 at 12:37
  • 1
    @StephenKitt yes. I accidently wrote large files (it is a large file, but that's not the problem), I should have written long lines. question is edited. – gilad hoch Dec 21 '16 at 12:43

3 Answers3

2

[update] Some lines have too many (blank-separated) fields. Try grep instead of awk:

grep -E -o '^[^[:space:]]+' your_input_file | uniq

I would advise against using sed for this, as it would do a lot of extra-work on each line (to remove the end of the line) on a very big file. Same for awk: the line parsing is unnecessary.

xhienne
  • 17,075
  • 2
  • 52
  • 68
1

With sed:

sed 's/^ *\([^ ]*\) .*$/\1/g' | uniq

This replaces each line with the first sequence of non-spaces.

A faster variant using two greps (to handle lines with leading spaces, as AWK does):

grep -o "^[[:space:]]*[^[:space:]]*" | grep -o "[^[:space:]]*" | uniq
Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
1

Answering the edited question, with long lines.

A trick you can use for this case is to use tr to interchange spaces and newlines. There are various ways you can get the first record out of the first line. Your problem then becomes one of finding lines which contain a space

 { echo # output a newline to get the first record
   cat file
 } | tr ' \n' '\n ' |
   sed -n '/ /s/.* //p' 

Or

 tr ' \n' '\n ' < file | sed -ne '1p' -e '/ /s/.* //p'

The idea is that you change

this is a long line
and this is another

to

this
is
a
long
line and
this
is
another

so then tools with line length limits don't have problems. If you have tab characters between the fields then you probably want tr ' \t\n' '\n\n '

icarus
  • 17,420
  • 1
  • 37
  • 54