3

I have a file that has some missing data point's value and the missing value are shown as ****. I need to select rows having consecutive 7 columns with value less than 10. When I run my script it also gives those rows that have **** in consecutive columns.

I can solve it easily by replacing all **** with a higher value. But, I don't want to change my input file. I want to do something so that my script treat **** as a number(greater than 10 i.e. str=****=100). How can I do that?

sample input consecutive7pointDown10.input-

2     3    4    5    6    7    8   0  12   14   23
2     3    4    12   6    7    8   0  1     2   23
**** **** **** **** **** **** **** 8 ****  **** 12

My script's result consecutive7pointDown10.output-

2     3    4    5    6    7    8    0    12    14   23
**** **** **** **** **** **** ****  8   ****  ****  12

But, expected output

2     3    4    5    6    7    8    0    12  14   23

My script consecutive7pointDown10 is as follows-

#!/bin/bash
########################################################################################################################
# This script results rows having at most 10°C in consecutive at most 7 points.
# input = scriptname.input
# output = scriptname.output
########################################################################################################################
input=`basename "$0"`.input
output=`basename "$0"`.output
awk '{
    for(i=4;i<=34-6;i++)
        {   
            if($i<=10 && $(i+1)<=10 && $(i+2)<=10 && $(i+3)<=10 && $(i+4)<=10 && $(i+5)<=10 && $(i+6)<=10)
            {
                print
                next
            }
        }
}' $input > $output
αғsнιη
  • 40,939
  • 15
  • 71
  • 114
alhelal
  • 1,271
  • 4
  • 17
  • 26
  • how about completely skipping lines having `****`? or is there a case where such lines can still have 7 consecutive columns less than 10 and you want that in output? – Sundeep Oct 23 '17 at 15:38
  • 1
    you could temporarily change input line (won't change input file) by using `gsub(/\*{4}/,100)` before the for-loop – Sundeep Oct 23 '17 at 15:42
  • @Sundeep yes, I am finding this type. I will try this. I will notify you. – alhelal Oct 23 '17 at 15:58
  • @Sundeep doesn't give right answer. [script image](https://imgur.com/a/HlCbu) [output image](https://imgur.com/a/Cyd75) – alhelal Oct 23 '17 at 16:01

2 Answers2

1

You can use awk as following to avoid repeating checking 7 consecutive columns by using a flag to increment when those all meet the condition, or reset it when opposite otherwise.

awk '{c=0; split($0,arr,/ +/);
    for(x in arr) if(arr[x]<10 && arr[x]>=0) {
        if(++c==7){ print $0; next } }else{c=0} }' infile

Here we used awk's split function «split(string, array [, fieldsep [, seps ] ])» to split the lines (The $0 represent the whole line in awk) into the array named arr separated by one or more spaces.

Next loop over array elements and checking if its value is between 10 and 0 then increment a flag called c and print the line if it's reached to 7 (means 7 consecutive elements (columns) meet the condition); Otherwise rest the flag with 0.


Or doing the same way without splitting the line into array.

awk '{c=0; for(i=1;i<=NF;i++) if($i<10 && $i>=0) {
    if(++c==7){ print $0; next } }else{c=0} }' infile

In your case as you are going to filter start from column#4 to the end, then you would need something like. The NF represent the number of fields/column in each line proceed by awk.

$ time awk '{c=0; for(i=4;i<=NF;i++) if($i<10 && $i>=0) {
    if(++c==7) {print $0; next} }else{c=0} }' infile
real    0m0.317s
user    0m0.156s
sys     0m0.172s

Or in regex mode, again applied on your Original file where it contains only floating point numbers, you can use below grep command which is more efficient and ~6 times faster than awk (Where used with -P flag, see Grep -E, Sed -E - low performance when '[x]{1,9999}' is used, but why?), but considering flexibility of awk solution as you can change the ranges + will work if Integer/Float/mixed of both numbers.

$ time grep -P '([^\d]\d\.\d[^\d]){7}' infile
real    0m0.060s
user    0m0.016s
sys     0m0.031s

Or in another way:

$ time grep -P '(\s+\d\.\d\s+){7}' infile
real    0m0.057s
user    0m0.000s
sys     0m0.031s

Or compatibility in grep, sed or awk:

$ time grep -E '([^0-9][0-9]\.[0-9][^0-9]){7}' infile
real    0m0.419s
user    0m0.375s
sys     0m0.063s
$ time sed -En '/([^0-9][0-9]\.[0-9][^0-9]){7}/p' infile
real    0m0.367s
user    0m0.172s
sys     0m0.203s
$ time awk '/([^0-9][0-9]\.[0-9][^0-9]){7}/' infile
real    0m0.361s
user    0m0.219s
sys     0m0.172s
αғsнιη
  • 40,939
  • 15
  • 71
  • 114
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/67580/discussion-between-s-and-minimax). – αғsнιη Oct 24 '17 at 12:49
  • I think your answer should be this `awk '{ for(i=4;i<=34;i++) if($i<10 && $i>=0){if(++c==7) print $0;next}else{c=0}}' infile` instead of `awk '{ for(i=4;i<=34;i++) if($i<10 && $i>=0){if(++c==7) print $0}else{c=0}}' infile` – alhelal Oct 24 '17 at 16:26
  • and `.......;next;c=0`. Am I right? – alhelal Oct 24 '17 at 16:44
  • Nop, in this case `c=0` will never run. – αғsнιη Oct 24 '17 at 16:54
  • I add `c=0` in starting for each row, otherwise it increments previous `c's` value when `c<7` in previous row. `awk '{for(i=4;i<=NF;i++)if($i<10 && $i>=0){if(++c==7) {print $0;c=0;next}}else{c=0}}' infile` result one row, but it is not true for this [data](https://paste.ubuntu.com/25810903/) – alhelal Oct 24 '17 at 17:23
  • @alhelal, please check my updates which it works now even for your [another data shared sample](https://paste.ubuntu.com/25810903/). – αғsнιη Oct 24 '17 at 18:51
  • You didn't explain the `time`? – alhelal Oct 25 '17 at 06:48
  • That's used for [getting time cost once output of a command resulted](https://unix.stackexchange.com/q/10745/72456). – αғsнιη Oct 25 '17 at 07:34
  • why [this script](https://paste.ubuntu.com/25822785/)(although about similar as you) results nothing for [this data](https://raw.githubusercontent.com/al2helal/ThesisWork/master/consecutive7pointDown10.input)?. There may be a silly error that was not found by me. – alhelal Oct 26 '17 at 11:26
  • Hi, it's not same, in yours the `else` statement is for inner `if` while it must be for outer `if` statement. see mine above. – αғsнιη Oct 26 '17 at 13:49
1
awk '/(\<[0-9]\s+){7}/{print}' input.txt

or

sed -rn '/(\b[0-9]\s{1,}){7}/p' input.txt

will do the job.

Explanation for the awk (the same logic for the sed):

  • /(\<[0-9]\s+){7}/{print} - print lines containing the pattern.

  • \< - Matches a word boundary; that is it matches if the character to the right is a “word” character and the character to the left is a “non-word” character.

  • [0-9]\s+ - one digit from 0 to 9, then one or more spaces.
  • (\<[0-9]\s+){7} - matches, if the \<[0-9]\s+ pattern is repeated seven times.

Input

2     3    4    5    6    7    8   0  12   14   23
2     3    4    12   6    7    8   0  1     2   23
**** **** **** **** **** **** **** 8 ****  **** 12

Output

2     3    4    5    6    7    8   0  12   14   23

EDIT:

For floating numbers with the one digit precision (9.2, 8.1, 7.5, etc).

awk '/(\<[0-9]\.[0-9](\s+|$)){7}/{print}' input.txt
MiniMax
  • 4,025
  • 1
  • 17
  • 32
  • 1
    Clever way! But assuming the values are always in single digit, but as OP's [original file](https://raw.githubusercontent.com/al2helal/ThesisWork/master/consecutive7pointDown10.input) doesn't grantee that, so this doesn't answer his/her question. – αғsнιη Oct 23 '17 at 20:46
  • @αғsнιη How it is not guaranteed? OP said: "I need to select rows having consecutive 7 columns with value less than 10." If the number is less than 10, then one digit is guaranteed. May be negative numbers will be the problem, like `-7`, `-5`. – MiniMax Oct 23 '17 at 21:02
  • @αғsнιη Oh, file. I look at it, now. Then, yes, pattern should be corrected. But it is easy, I think. I thought, only integers (whole numbers) are possible. – MiniMax Oct 23 '17 at 21:05
  • @MiniMax then write for floating. – alhelal Oct 24 '17 at 03:49
  • @alhelal See EDIT section. – MiniMax Oct 24 '17 at 11:08