2

I can get a url link and search for text that starts with file: but I'm having issues parsing it from there.

Example:

wget -qO- http://website.com/site/ | tr \" \\n | grep -w file:\* > output.txt

The wget command Gives me the output:

file: 'http://website.com/site/myStream/playlist.m3u8?wmsAuthSign=c2VydmVyXs',

I'm trying to get the output to look like.

http://website.com/site/myStream/playlist.m3u8?wmsAuthSign=c2VydmVyXs

My goal is to have a bash script that includes several source / list of url's that will be looped through and each processed / grep'd output url will be on it's own separate line.

http://website.com/site/myStream/playlist.m3u8?wmsAuthSign=c2VydmVyXs

As requested:

Here's an example of the output of what wget -qO- http://website.com/site/ sends back.

player.setup({
  file: 'http://website.com/site/myStream/playlist.m3u8?wmsAuthSign=c2VydmVyXs',
  width: "100%",
  aspectratio: "16:9",

});
Rick T
  • 337
  • 1
  • 7
  • 19
  • If you want to parse html, I recommend to use something [made for parsing html](https://unix.stackexchange.com/questions/6389/how-to-parse-hundred-html-source-code-files-in-shell) instead of `grep`. Depending on the actual HTML files you want to parse, you may get away with `grep`, but there'll be plenty of variants your regular expression won't catch. – dirkt Aug 25 '19 at 07:20
  • 2
    can you show an example of the ACTUAL output of the `wget` command **before** any processing with tr or grep? – cas Aug 25 '19 at 07:22
  • @cas ok I updated the question to include just what the `wget -qO- http://website.com/site/` with no processing outputs. – Rick T Aug 25 '19 at 07:58
  • ok, so it's not returning HTML. looks like it's returning a function call with embedded json. `lynx -dump` won't work for that at all. – cas Aug 25 '19 at 08:01

1 Answers1

2

This will do what you want:

wget -qO- http://website.com/site/ | \
  sed -n -e "/^ *file: */ { s/^ *file: *'//; s/', *$//p}" > output.txt
cas
  • 1
  • 7
  • 119
  • 185
  • I installed it unfortunately it creates a blank file... the line `wget -qO- http://website.com/site/ | tr \" \\n | grep -w file:\* > output.txt` works it just doesn't fully parse it the way I need it. – Rick T Aug 25 '19 at 07:33
  • then please add a sample of the raw output of `wget` to your question. Without that, only rough guesses are possible. – cas Aug 25 '19 at 07:35
  • It's highlighted in the question under `Example`? is it not showing up in the question? – Rick T Aug 25 '19 at 07:36
  • also, does the `lynx` command about **without** the `grep` show a list of links? does your `http://website.com/site/` require authentication? if so, you might need to use the `-auth=ID:PASSWD` option. or visit the site manually with lynx using the `--cookie_file` option and then use the same cookie file with the `lynx -dump ...` later. – cas Aug 25 '19 at 07:38
  • @RickT no, it's not highlighted under example. the only thing that's there is the wget command. I'm asking for a sample of the **OUTPUT** of the wget command, not the wget command itself. – cas Aug 25 '19 at 07:41
  • It was poor wording on my part. the output is `file: 'http://website.com/site/myStream/playlist.m3u8?wmsAuthSign=c2VydmVyXs',` I've also labelled it as such in the question. – Rick T Aug 25 '19 at 07:45
  • if the output of `wget` is large, then pick a small, **relevant** section of the output to use as the sample. 5 or 10 lines will probably be enough, depending on how fugly the HTML is. – cas Aug 25 '19 at 07:45
  • again, no. That is not the **raw** output - that's the output after being piped through `tr` and `grep`. – cas Aug 25 '19 at 07:45
  • Thanks that did it!! – Rick T Aug 25 '19 at 08:07
  • 1
    yeah, it works at the moment. It's fragile, though - any change in the data returned by the web site could and probably will break it. e.g. just removing all linefeeds from the data will a) probably be valid for the official/expected client software, but b) break the above until you edit the sed script to cope with the new situation. – cas Aug 25 '19 at 08:11