Parse a text file using shell script

Question

I am stuck with this activity ,I have a txt file like below

0112 00000 34 JOB RECOVERY status poll (ORDERID 2N000, RUNNO 0001) ACCEPTED, OWNER
0112 00000 35 JOB RECOVERY status poll (ORDERID 2N000, RUNNO 0001)STARTED , APPL TYPE
0112 00000 36 JOB PROCESS Kafka(ORDERID 2N001, RUNNO 0001) ACCEPTED , OWNER
0112 00001 37 JOB PROCESS Kafka (ORDERID 2N001, RUNNO 0001) STARTED, APPL_TYPE
0112 00001 38 JOB RECOVERY  status poll(ORDERID 2N000, RUNNO 0001) ENDED OK ,ELAPSED - 0.02 SEC
0112 00003 39 JOB PROCESS (ORDERID 2N001, RUNNO 0001) ENDED OK, ELAPSED - 2.28 SEC

I need to get elapsed - value for each orderid for each job , I need like if orderid is 2N000, then the elapsed I should get-0.02 sec. like this for each orderid I need to get from the file using shell script.

I need the output like

orderid    jobname           ELAPSED
2N000      RECOVERY status   0.02
2NOO1      PROCESS  Kafka   2.28

What have you tried so far and where are you stuck? Is the text really this haphazardly formatted, with random spaces inserted around commas and other spaces seemingly missing in front of parentheses? — Kusalananda, Jan 14 '22 at 06:19
yes, I have tried to get the orderid ,jobname and elapsed time using awk command ...awk '/elapsed/' {print $5,$14,$7). but the problem is the job , we cant take always the same coulmn number like 5 , since the jobname is having space in between also — user510083, Jan 14 '22 at 06:25
1. Why only two lines of results for six lines of data? If you're coalescing please explain how. If you're filtering please explain why — roaima, Jan 14 '22 at 07:44
2. Why double space in `PROCESS Kafka` output when there is a single space in every similar source line? — roaima, Jan 14 '22 at 07:45
3. You say that for order id `2N000` you want an elapsed time of `-0.02`, but you don't show that in the example output — roaima, Jan 14 '22 at 07:47
4. I see that someone has tried to fix your formatting. Did you really want the output as originally written, `orderid jobname ELAPSED 2N000 RECOVERY status 0.02 2NOO1 PROCESS Kafka 2.28`? — roaima, Jan 14 '22 at 07:50
I will be having dsingle or double spaces in between jobname , that is dynamic .in the six lines of data I need only orderid , jobname and elapsedtime @roaima — user510083, Jan 14 '22 at 08:16
"_i want the ouput like this shown above_", in the comment or in the question? Both are "above". — roaima, Jan 14 '22 at 08:40
Please [edit your question](https://unix.stackexchange.com/posts/686294/edit) to provide clairifications. I want to be able to delete my comments once your question addresses the issues I've raised. Do not put your responses in the comments as they can be lost, deleted, or simply not seen by people wanting to help you — roaima, Jan 14 '22 at 08:41
Judging from the way treat punctuation in your comments here (spaces before and after dots and commas, sometimes), I'd say you have manually modified the data that you present in the question (it shows the same hallmark use of spacing around punctuation characters). How can we assume that the data will be uniformly formatted (and therefore easily parsable) when you show data that has been so obviously manipulated? — Kusalananda, Jan 14 '22 at 19:58
If by "using shell script" you mean "only using shell builtins and without using mandatory POSIX text processing tools like sed, awk, etc." then you should read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice). — Ed Morton, Jan 15 '22 at 12:00

score 0 · Answer 1 · answered Jan 14 '22 at 10:26

This sed script should do what you are looking for:

sed '/ELAPSED/!d;s/.*JOB \([^(]*\)(ORDERID \([^,]*\).*- \([0-9.]*\).*/\2 \1 \3/'

Maybe you need to adapt it to your actual data, so I explain what it does:

/ELAPSED/!d deletes all lines that don't (!) include ELAPSED, because the ELAPSED lines include all information you need. If that string could appear elsewhere, you need to adapt the script accordingly
The following substitute command contains a complex regular expression that is supposed to identify the correct parts you want to extract from the line:
- .*JOB matches everything upto the JOB keyword. Again, if JOB can also occur inside the jobname, you'll need additional criteria, but how can I know
- [^(]* matches everything before the opening ( of the orderid. This part is surrounded by  so we can place it in the replacement with \1. Please note that you will get the full job name RECOVERY status poll, not leaving out the poll like your output does!
- (ORDERID matches what it says, so the next part will be the orderid
- [^,]* matches everything before the next comma. This is again surrounded by , so it can be referred to as \2`
- .*- matches everything including the last dash and the following space. This is hopefully eating up everything before the elapesed time
- [0-9.]* is a couple of digits and dots. This should fit the elapsed time and is substring number \3
- .* matches the rest of the line, probably only SEC
The replacement string \2 \1 \3 pastes the three elements in the desired order, pasted with a whitespace in between. Adapt this as needed.
If you want the column headers in the first row, do it by yourself.

I'm trying hard to believe you, but I can't. `sed` is an essential part even of the smallest embedded linux systems, mandatory even for buildroot. — Philippos, Jan 14 '22 at 10:33
@user510083 if sed is not installed on your system then you're not on a POSIX-compliant Unix system since sed is one of the mandatory POSIX tools that MUST exist on such systems. Given that we'd need more information about the tools you DO have than just tagging your question with `linux` and `text-processing`. — Ed Morton, Jan 15 '22 at 11:58

score 0 · Answer 2 · answered Jan 14 '22 at 10:55

I also had overlooked "using shell script". So I tried with awk:

BEGIN {
  # TAB between the words
  print "orderid        jobname Elapsed"
}
/ ACCEPTED/ {
   p = match($0,/^.... ..... .. ... ([A-Za-z ]*).*ORDERID (.....)/,A)
   if (p>0) {
     # print A[2]
     O[A[2]] = A[1]
   }
}
/ELAPSED/ {
   p = match($0,/ORDERID (.....).*ELAPSED - (.*) SEC$/,A)
   # TAB between the double quotes
   print A[1] " " O[A[1]] "     " A[2]
}

Which returns a tab separated

orderid jobname Elapsed
2N000   RECOVERY status poll    0.02
2N001   PROCESS Kafka   2.28

Parse a text file using shell script

2 Answers2