How to get all paths from a website using cURL

Question

curl //website// will get me the source code but from there how would I filter our every unique path and obtain the number of them?

the question:

Use cURL from your machine to obtain the source code of the "https://www.inlanefreight.com" website and filter all unique paths of that domain. Submit the number of these paths as the answer.

from the question, I do not know the meaning of "UNIQUE PATHS", but I think it means something similar to what you get from executing $wget -p

I used this method and it worked somehow

wget --spider --recursive https://www.inlanefreight.com

this will show

Found 10 broken links.

https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.svg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/testimonial-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/css/grabbing.png
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff2
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/subscriber-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot?
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/fun-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.ttf

FINISHED --2020-12-06 05:34:58--
Total wall clock time: 2.5s
Downloaded: 23 files, 794K in 0.1s (5.36 MB/s)

at the bottom. assuming 23 downloads and 10 broken links all add up to be the unique path I got 33 and it was the correct answer.

can you explain more specific what you expect to get? (at best with an example. use the edit function of the question instead of answering) — blaimi, Dec 06 '20 at 02:23

score 4 · Answer 1 · edited Jul 08 '21 at 07:30

4

This is what I came up with:

 curl https://www.inlanefreight.com/ | grep -Po 'https://www.inlanefreight.com/\K[^"\x27]+' | sort -u  | wc -l

I don't know if its intended to be solved using regex tough.

edited Jul 08 '21 at 07:30

AdminBee

21,637
21
47
71

answered Jan 04 '21 at 10:53

mr mojo

41
1

score 3 · Answer 2 · answered Dec 06 '20 at 05:48

I used this method and it worked somehow

$ wget --spider --recursive https://www.inlanefreight.com

this will show-

Found 10 broken links.

https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.svg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/testimonial-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/css/grabbing.png
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.woff2
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/subscriber-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.eot?
https://www.inlanefreight.com/wp-content/themes/ben_theme/images/fun-back.jpg
https://www.inlanefreight.com/wp-content/themes/ben_theme/fonts/glyphicons-halflings-regular.ttf

FINISHED --2020-12-06 05:34:58--
Total wall clock time: 2.5s
Downloaded: 23 files, 794K in 0.1s (5.36 MB/s)

-at the bottom. Now, assuming 23 downloads and 10 broken links all add up to be the unique path I got 33 and it was the correct answer.

score 1 · Answer 3 · answered Jul 17 '21 at 02:31

Using only cURL and these filtering tools: grep, tr, sort, cut, and wc plus one additional tool uniq. My result was incorrect (34), 33 is correct. Still not sure which path is duplicated. :(

curl https://www.inlanefreight.com --insecure > ilf

cat ilf | grep "https://www.inlanefreight.com" > ilf.1

cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | uniq -c > ilf.2

cat ilf.2 | wc -l

$> 34

I suspect this is the source of the duplication (cat ilf.2 for these lines)

<snip>
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F
1 https://www.inlanefreight.com/index.php/wp-json/oembed/1.0/embed?url=https%3A%2F%2Fwww.inlanefreight.com%2F&#038;format=xml
<snip>

to fix this cut on "?"

cat ilf.1 | tr " " "\n" | sort | grep "inlanefreight.com" | cut -d'"' -f2 | sort | cut -d"'" -f2 | sort | cut -d"?" -f1 | uniq -c | wc -l
$> 33

The correct answer is 33.

emmiller · Answer 4 · 2021-07-29T13:11:07.473

1

This is an answer based on what you've learned in that module only:

curl https://www.inlanefreight.com > htb.txt && cat htb.txt | tr " " "\n" | cut -d"'" -f2 | cut -d'"' -f2 | grep "www.inlanefreight.com" | sort -u | wc -l

edited Jul 29 '21 at 13:11

answered Jul 29 '21 at 13:05

emmiller

11
2

score 0 · Answer 5 · answered Dec 06 '20 at 05:27

TL;DR;: you can't.

from the wget manpage:

“-p This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.”

this is a feature of wget. curl is a software/library to execute single http-commands (simplified). wget has some features like downloading whole websites and stuff which requires interpretation of the content. While this was working in times of Web 1.0, this feature is not very useful anymore because websites are loading additional files via javascript which not even gets interpreted by wget. The website of https://www.inlanefreight.com is a wordpress-site with a theme from https://themeansar.com/ so you can buy it from there, interpret it, write a script and hope that you did it corret.

But come on, https://www.inlanefreight.com has 6 pages and a single pdf; you can count that faster by clicking than I needed to figure out that it's wordpress.

How to get all paths from a website using cURL

5 Answers5