17

I've found only puf (Parallel URL fetcher) but I couldn't get it to read urls from a file; something like

 puf < urls.txt

does not work either.

The operating system installed on the server is Ubuntu.

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
Moonwalker
  • 313
  • 1
  • 2
  • 5
  • This could be done with Python and pycurl library and a little bit of glue logic in a script. But I don't know of a "canned" tool for that. – Keith Apr 08 '12 at 08:29
  • @Keith Is this approach better than using some async library as gevent with urllib? – Moonwalker Apr 08 '12 at 18:37
  • urllib is not designed to be used asynchronously. The libcurl has it's own async loop and can be set up to do at least 1000 simultaneous fetches using the "multi" interface. – Keith Apr 09 '12 at 05:02
  • @Keith I like your answer best so could you write it as a "real" answer to take due credit for it? – Moonwalker Apr 10 '12 at 02:21

6 Answers6

33

Using GNU Parallel,

$ parallel -j ${jobs} wget < urls.txt

or xargs from GNU Findutils,

$ xargs -n 1 -P ${jobs} wget < urls.txt

where ${jobs} is the maximum number of wget you want to allow to run concurrently (setting -n to 1 to get one wget invocation per line in urls.txt). Without -j/-P, parallel will run as many jobs at a time as CPU cores (which doesn't necessarily make sense for wget bound by network IO), and xargs will run one at a time.

One nice feature that parallel has over xargs is keeping the output of the concurrently-running jobs separated, but if you don't care about that, xargs is more likely to be pre-installed.

ephemient
  • 15,640
  • 5
  • 49
  • 39
7

aria2 does this.

http://sourceforge.net/apps/trac/aria2/wiki/UsageExample#Downloadfileslistedinafileconcurrently

Example: aria2c http://example.org/mylinux.iso

slamp
  • 28
  • 6
user17591
  • 1,078
  • 7
  • 6
  • 1
    This answer would be improved with an actual example that solves the asked problem, instead, this qualifies as a link-only answer. https://meta.stackexchange.com/questions/225370/your-answer-is-in-another-castle-when-is-an-answer-not-an-answer – Jeff Schaller Sep 12 '17 at 09:58
2

Part of GNU Parallel's man page contains an example of a parallel recursive wget.

https://www.gnu.org/software/parallel/man.html#example-breadth-first-parallel-web-crawler-mirrorer

HTML is downloaded twice: Once for extracting links and once for downloading to disk. Other content is only downloaded once.

If you do not need the recursiveness ephemient's answer seems obvious.

Ole Tange
  • 33,591
  • 31
  • 102
  • 198
  • Just a late FYI that any parallel plus wget "solution" is both inherently inefficient because it requires downloading content *twice*, slow because of all the multiphase downloading and it's also not nice to sysops whom have to pay for all your wasting of bandwidth because you didn't use an efficient solution. – dhchdhd Sep 18 '17 at 08:23
  • There is no longer an example for recursive wget at the end of that link. This is a great example why link only responses are bad. – Mikko Rantalainen Jun 29 '22 at 08:03
  • @MikkoRantalainen It was still there, just further down the page. – Ole Tange Jun 30 '22 at 17:36
  • Thanks. That implementation still loads all pages twice, once by `lynx` and another time by `wget`. – Mikko Rantalainen Jul 13 '22 at 10:42
2

You can implement that using Python and the pycurl library. The pycurl library has the "multi" interface that implements its own even loop that enables multiple simultaneous connections.

However the interface is rather C-like and therefore a bit cumbersome as compared to other, more "Pythonic", code.

I wrote a wrapper for it that builds a more complete browser-like client on top of it. You can use that as an example. See the pycopia.WWW.client module. The HTTPConnectionManager wraps the multi interface.

Keith
  • 7,828
  • 1
  • 27
  • 29
2

This works, and won't local or remote DoS, with proper adjustments:

(bandwidth=5000 jobs=8; \
 parallel      \
   --round     \
   -P $jobs    \
   --nice +5   \
   --delay 2   \
   --pipepart  \
   --cat       \
   -a urls.txt \
     wget                                \
       --limit-rate=$((bandwidth/jobs))k \
       -w 1                              \
       -nv                               \
       -i {}                             \
)
dhchdhd
  • 338
  • 2
  • 7
1

You can also try:

#!/bin/bash
cat urls.txt | xargs -n 1 -P 2 wget -q

or in loop with -b option

#!/bin/bash
while read file; do
    wget ${file} -b
done < urls.txt
svp
  • 1
  • 2
  • 2
    thank you for down voting!!! cheers!!! – svp Jun 22 '22 at 05:00
  • 1
    I cannot understand why this was downvoted either, so I added one upvote to balance things. I still think that this answer is redundant except for the `wget -b` part because `xargs -n 1 -P` was already mentioned by @ephemient a year earlier. – Mikko Rantalainen Jun 29 '22 at 08:01
  • Repeat of another answer on a question with lots of solutions and accepted answers. – number9 Jun 29 '22 at 17:23
  • 1
    @number9 is that the rule of the forum? I think forum does not allow you add alternative answers to the question which has accepted answers. The answer I have given in never a repeat but an alternate way to do it. – svp Jun 30 '22 at 06:14
  • 1
    @dare_devils I don't think forum has such rule. But expert people like `number9` can down vote the answers. Any way answer is useful to people like me who are beginner in the field. – svp Jun 30 '22 at 06:21
  • @dare_devils, You are arguing that changing _a_ command line _switch_ from a variable ${jobs} to 2 is an "alternate" answer. I do not think that adds to the discussion. – number9 Jun 30 '22 at 12:47
  • @number9 `cat urls.txt` and `< urls.txt` are two different commands according to me. Also, you did not notice my answer in the second option. Is this also an argument?. I am fine with your downvote. People who use my command can upvote... – svp Jul 01 '22 at 04:41