Extract data between two patterns from a huge (forced) text file

Question

I have filename.json. If I parse it in terminal with

file filename.json

output is:

filename.json: UTF-8 Unicode text, with very long lines  

wc -l filename.json    
1 filename.json

If I parse it as a json using jq then I'll have to mention what section of the data I want it to print like an id,summary,author,etc,. I have thousands of json which are similar in structure but section where I want the data is stored as either "summary","description","review",etc,. Since, there are thousands of JSON files I don't want to check in each one of them. But I know that the data I want resides between two patterns

"title": and "url":

$ cat filename.json

gives:

{"source":"PhoneArena","author":"","title":"Apple's US Black Friday shopping event has gift cards galore for select iPhones, iPads, and more","description":"As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ...","url":"https://www.phonearena.com/news/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more_id111287","urlToImage":"https://i-cdn.phonearena.com/images/article/111287-two_lead/Apples-US-Black-Friday-shopping-event-has-gift-cards-galore-for-select-iPhones-iPads-and-more.jpg","publishedAt":"2018-11-23T09:05:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},{"source":"PhoneArena","author":"","title":"Verizon's top Black Friday bargain is a free Moto G6, no trade-in required","description":"That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ...","url":"https://www.phonearena.com/news/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required_id111285","urlToImage":"https://i-cdn.phonearena.com/images/article/111285-two_lead/Verizons-top-Black-Friday-bargain-is-a-free-Moto-G6-no-trade-in-required.jpg","publishedAt":"2018-11-23T07:54:00Z","dataRefreshedTime":"2018-11-23T09:43:09Z","category":"phone_news_reviews","resource":"PhoneArena"},

So, I want to print everything between the patterns, but in the terminal the files is of 1 line and pattern appears multiple times. Only way I could think of is to print between the two patterns till the end of the file.

I tried using sed:

sed -n '^/title/,/^url/p' filename.json

but it prints blank.

I want the data to further input to do language analysis using machine learning techniques.

Any suggestion on other ways to print between the patterns, also patterns repeat multiple times. So, I want the data to print between each of the repeats.

Expected result is to print as CSV or tsv:

1 "As confirmed earlier this week, a four-day Black Friday and Cyber Monday shopping event is underway, offering Apple Store gift cards with purchases of select iPhone models, three iPad variants, an assortment of Macs, the entire Apple Watch Series 3 family, as well as the HomePod, both Apple TV versions, and select Beats headphones.That ..."

2 "That made it virtually impossible for retailers like Best Buy and B&H Photo Video to outdo themselves come the actual Black Friday frenzy, but luckily, that’s what carriers are (sometimes) good for.Enter Verizon, which revealed a wide range of killer deals on popular high-end ..."

etc,.

till the end of the file.

This does _not_ look like JSON to me. Did you for some reason omit all `{}`, `[]` and `:,`? — Kusalananda, Nov 23 '18 at 09:13
That makes it *very* difficult to help you as we can't test on real data. — Kusalananda, Nov 23 '18 at 09:29
How about just testing on the above data? Because I am printing it as it is shown above, removing {}, [] and :,?. I can link to the JSON too — CCC, Nov 23 '18 at 09:34
Why not supply the JSON in the question? That would allow for a more robust solution than having to do haphazard parsing of nearly free form text. — Kusalananda, Nov 23 '18 at 09:36
Link to the file https://drive.google.com/open?id=1rR7H6rQozr6GKyix_Ekgz1yM_mIDu6ID . There are many files like this, where it is "description" or "information" or "somethingelse" — CCC, Nov 23 '18 at 09:47
Possible duplicate of [How to extract data from a JSON file](https://unix.stackexchange.com/questions/243428/how-to-extract-data-from-a-json-file) — pLumo, Nov 23 '18 at 10:02
@RoVo I am sorry but it is not a duplicate. May be you missed that the data I want to extract doesn't go by same name or same number of line. The sample in google drive has same name to the data I want but there are many JSON with different name. Hence, I pasted the sample here instead of giving the link to the file. So please don't down vote as duplicate if it sounds similar, it is not. — CCC, Nov 23 '18 at 17:18
It's your question, but it seems --to me-- to be the wrong direction to take this from structured text to unstructured text. If you have json to begin with, post *representative* examples and see if there's a solution. If not, you can always fall back to parsing unstructured text. I'll vote to reopen if you explain what happened to the other two examples in the input (by maybe showing them in the sample output). — Jeff Schaller, Nov 24 '18 at 04:27
Hi, I agree that there are better ways to parse JSON. I tried 'jq' but to get the data, I have to know what that section is called like "description","review" etc,. There are thousands of JSON and in each the data I want is called by different names. I want to automate the script regardless of what the section name is, the text has to be extracted. I saw the data I want always resides between "title" and "url". So, I just want a text between them printed. I am able to print the pattern but not the text and id of the data between two patterns. Am I making it clear or worse? — CCC, Nov 24 '18 at 04:55
Could you post the result you expect from that file in your question? — , Nov 25 '18 at 01:52
What you are posting as expected output seems to follow the `"description":` tag (not `"title":` as you asked). Which one should be used?. — , Nov 25 '18 at 23:05

score 1 · Accepted Answer · 2018-11-25T23:52:32.387

TL;DR

In ksh,bash,zsh:

sed -e $'s,"title":,\1,g' -e $'s,"url":,\2,g' -e $'s,^[^\1]*,,' -e $'
         s,\1\\([^\2]*\\)\2[^\1]*,\\1\\\n,g' infile

sed

One character delimiters.

The canonical solution for one character delimiters lets assume @ and # as an example, is:

sed 's,^[^@]*,,;s,@\([^#]*\)#[^@]*,\1 ,g' infile

That will - remove every character from the start that is not a @ - extract characters that are between the first @ to the next first # that follows.

For each line of the input file infile.

General delimiters.

Any other delimiter could be converted to the answer above by simply converting each delimiter string to one character.

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1 /g' infile

Instead of space (\1), in your case, you can use newlines, which written for GNU sed are simply (\1\n):

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1\n/g' infile

For other (older) seds Add an explicit newline:

sed -e 's,"title":,@,g' -e 's,"url":,#,g' -e 's/^[^@]*//;s/@\([^#]*\)#[^@]*/\1\
/g' infile

If there is the risk that the delimters used above could be inside the file, choose another ones that are warateed to not exist inside the file. If that seems to be a problem, the start and end delimiters could be a control character like Ctrl-A ( or encoded: ^A, as hex: Ox01 or as octal \001 ). You can type that in a shell console by typing Ctrl-V Ctrl-A. You will see a ^A in the command line:

sed -e 's,"title":,^A,g' -e 's,"url":,^B,g' -e 's,^[^^A]*,,;s,^A\([^^B]*\)^B[^^A]*,\1\n,g' infile

Or, if it is too cumbersome to type, either use (ksh,bash,zsh):

sed -e $'s,"title":,\1,g' -e $'s,"url":,\2,g' -e $'s,^[^\1]*,,' -e $'s,\1\\([^\2]*\\)\2[^\1]*,\\1\\\n,g' infile

Or, if your sed supports it:

sed -e 's,"title":,\o001,g' -e 's,"url":,\o002,g' -e 's,^[^\o001]*,,' -e 's,\o001\([^\o002]*\)\o002[^\o001]*,\1\o012,g' infile

if delimiter is "description":

If the starting tag is actually "description": (from your example of output), just use it instead of "title":

The output from above (from the file you linked before in your question):

"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

If you need to number the lines, sed it again with sed -n '=;p;g;p':

| sed -n '=;p;g;p'
1
"Black Friday deal: Palm companion phone is $150 off at Verizon, but there's a catch","description":"",

2
"LG trademarks potential names for its foldable phone, one fits a crazy concept found in patents","description":"",

3
"Blackview's Black Friday promo discounts the BV9500 Pro and other rugged phones on Amazon","description":"Advertorial by Blackview: the opinions expressed in this story may not reflect the positions of PhoneArena! disclaimer   amzn_assoc_tracking_id = 'phone0e0d-20';amzn_assoc_ad_mode = 'manual';amzn_assoc_ad_type ...",

AWK

Similar logic implemented in awk:

awk -vone=$'\1' -vtwo=$'\2' '{
            gsub(/"title":/,one);
            gsub(/"url":/,two);
            sub("^[^"one"]*"one,"")
            gsub(two"[^"one"]*"one,ORS)
            sub(two"[^"two"]*$","")
           } 1' infile

Extract data between two patterns from a huge (forced) text file

1 Answers1

TL;DR

sed

One character delimiters.

General delimiters.

if delimiter is "description":

AWK