0

I have a data gathering device which records per second. Sometimes, randomly it drops a second or more. For example

166,1.09424,240,76,132
168,1.10088,215,76,132
169,1.10765,213,78,131
170,1.11458,198,79,131

It can gather 1M data points in a session, and might miss >100

Looking around I found this AWK command

awk '{ while (NR + shift < $1) { print (NR + shift) " NA"; shift++ }; print } END { shift++; while (NR + shift < 13) { print (NR + shift) " NA"; shift++ } }' /tmp/test1

from this answer

but that gives new lines when they aren't needed:

166 NA
167 NA
168 NA
169 NA
170 NA

What am I doing wrong?

tink
  • 6,160
  • 2
  • 21
  • 30
Steve Shiny
  • 53
  • 1
  • 6

1 Answers1

1

A few issues with your command line:

  1. Your CSV uses comma delimiters, but your awk command assumes the default space delimiter. This is the primary reason you get unnecessary fills.
  2. The END case in your command exists only to fill the end result to a total minimum of 12 lines. This should therefore be removed, or the test should be changed to a higher minimum, perhaps user-specified.
  3. Most tools used to process CSV don't like lines with varying numbers of fields. Your padding lines should therefore contain the same number of fields as normal lines.

Taken all together, and assuming your first record starts with second 1, we have this:

awk -F, -v min_secs=2000 '
{ 
  while (NR + shift < $1) {
    print (NR + shift) ",NA,NA,NA,NA"
    shift++ 
  }
  print
}
END {
  while (NR + shift <= min_secs) {
    print (NR + shift) ",NA,NA,NA,NA"
    shift++ 
  }
}
' /tmp/test1
Adrian
  • 2,366
  • 1
  • 10
  • 9
  • I tried to make it into one line awk -F, -v min_secs=2000 ' {while (NR + shift < $1) {print (NR + shift) ",NA,NA,NA,NA" shift++}print} END {while (NR + shift <= min_secs) {print (NR + shift) ",NA,NA,NA,NA" shift++}} ' wrong.txt > fixed.txt and then ran it, however it didn't add the new line 166,1.09424,240,76,132 168,1.10088,215,76,132 169,1.10765,213,78,131 170,1.11458,198,79,131 but it also added a bunch at the end - as a test I used a small sample file that stops at 300 lines 300,2.08307,331,58,111 301,2.08754,361,60,112 301,NA,NA,NA,NA1 302,NA,NA,NA,NA2 – Steve Shiny Feb 16 '19 at 02:57
  • Oh that doesn't format very well - you can see I'm a noob! – Steve Shiny Feb 16 '19 at 02:58
  • @SteveShiny If you're squeezing it all into a one-liner (which I don't recommend for clarity reasons), you have to put the semicolons between the `print` and `shift++` statements back in. That's why you're seeing increasing serial numbers at the end of each dummy line: `print` thinks `shift++` is another argument to it, instead of an entirely separate statement. – Adrian Feb 16 '19 at 03:02
  • I'm doing something wrong. I copied and pasted the command, and got the same result. I'm on Mac if that makes a difference? – Steve Shiny Feb 16 '19 at 03:39
  • Actually I partially take that back... it found one missing point and added a line for that, but not another 223,1.59792,373,68,110 224,1.60427,373,66,109 225,NA,NA,NA,NA 226,1.61923,373,63,110 227,1.62166,373,60,110 165,1.08769,223,77,131 166,1.09424,240,76,132 168,1.10088,215,76,132 169,1.10765,213,78,131
    grrr how to do new lines?
    – Steve Shiny Feb 16 '19 at 03:43
  • @SteveShiny Your `awk` script assumes a _monotonically increasing_ serial field, i.e. `n(row+1)` must equal `n(row) + 1` for all rows. This is a standard expectation for almost all data sets, but the sample you just posted does _not_ meet that constraint. If this is a case that you need to be handled, edit your question to specify _exactly_ what should happen. A set of "before" and "after" samples, highlighting **all** the possible data quality issues and showing the sanitized final result, would be best. – Adrian Feb 16 '19 at 03:53
  • OK, I'll start a new question that's better formed – Steve Shiny Feb 16 '19 at 05:18
  • https://unix.stackexchange.com/questions/500998/add-lines-if-they-are-missing-from-a-data-series – Steve Shiny Feb 16 '19 at 05:41