3

I have a process that generates output mostly in lexicographically sorted order according to a (timestamp) field, but occasionally the lines will be output in the wrong order:

2014-08-14 15:42:02.019220203 ok
2014-08-14 15:42:03.523164367 ok
2014-08-14 15:42:04.525655832 ok
2014-08-14 15:42:06.523324269 ok
2014-08-14 15:42:05.930966407 oops
2014-08-14 15:42:07.643347946 ok
2014-08-14 15:42:07.567283110 oops

How can I identify each location where the data are "unsorted"?

Expected output (or similar):

2014-08-14 15:42:05.930966407 oops
2014-08-14 15:42:07.567283110 oops

I need a solution that works as the data are generated (e.g. in a pipeline); it's less useful if it only operates on complete files. sort --check would be ideal but it only outputs the first point of disorder; I need a full listing.

ecatmur
  • 245
  • 2
  • 9

2 Answers2

4
awk 'NR>1 && $0"" < last; {last=$0}'

Prints the lines that sort before the preceding line. The $0"" is to force lexical comparison (on the output of seq 10 it would spot 10 as sorting before 9).

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
1

I think that shell string comparisons should respect lexicographical order (according to the current locale, of course) - so perhaps you could do something like

#!/bin/bash

lastline=""
while IFS= read -r line; do 
  [[ "$line" < "$last" ]] && printf '%s\n' "$line"
  last="$line"
done < <(your process)
steeldriver
  • 78,509
  • 12
  • 109
  • 152
  • 2
    On the output of `seq 1000000`, I find it's about 50 times as slow as the `gawk` equivalent (12x for `ksh93`, 20x for `zsh` and `mksh`, and `mawk` is twice as fast as `gawk`). IMO at least, using `while read` loops to process text is bad practice. At least here, you didn't fall into the usual pitfalls though. – Stéphane Chazelas Aug 14 '14 at 15:22