You may already know how handy the command-line tool wget is for grabbing a particular file over HTTP, but wget has many options you may not know about, including recursive retrieval, mirroring-specific options, a slew of ways to handle connection issues, and a few ways to deal with websites that assume you're an interactive user. Here's how to turn wget from a one-trick pony into a whole circus of performing horses.
If all you want is a single page, you've probably already used a command like
to save a copy of the linked page locally, as thatpage.html. However, by default, wget won't save any image files or stylesheets that might be a part of your page, so if you try to view it, the page might not look the way you expect it to. To get the page and all its supporting elements, use
wget --page-requisites http://www.example.com/thatpage.html
As well as reading options from the command line, wget also looks for default options in the global wgetrc file (usually found at /etc/wgetrc or /usr/local/etc/wgetrc, depending on your system setup), and in the local file ~/.wgetrc. Any of the options discussed here can be added to one of the wgetrc files in order to apply to every download by default.
You can also read in a stack of URLs at once by listing them in a file:
wget -i urllist.txt
Because wget is non-interactive, you can kick it off with the -b argument and log out, and it'll get on with your job in the background:
wget -b -o wget.out http://www.example.com/thatpage.html
-o file sends the output to the specified file. (If running in the background and no output file is specified with -o, output will go to the file wget-log.) -O filename does something a little different – it saves the source file to filename instead of to its own name. If you've specified multiple source files, they'll be concatenated and all written to this file.
The default output is verbose; to minimize output, use -nv. You can also change the appearance of wget's progress bar with --progress=dot.
If you want to fetch a large chunk of a website, or if you want to mirror a website, the --recursive (-r) option is your friend. If you just specify -r, wget will download the page you point to, and every (internal) page linked to from that page, and so on, up to five links away from your starting point. (So if page 1 links to page 2 which links to page 3, and so on, you'll download pages 1-6 but not page 7.) You can change the number of links wget should follow from the default of 5 by using -l n, or use -l inf to turn the limit off altogether. You can also use --no-parent (-np) to avoid ascending a directory level, meaning that you download only a subsite of a particular website.
Recursion is even more useful if you add the -k option, which turns all absolute internal links into relative links. In other words, if you're downloading http://www.example.com/test.html and it has a link to http://www.example.com/anothertest.html, that link will be edited within the local file to read anothertest.html. This enables you to browse the site entirely locally and offline. To turn on this option together with a number of other options that are likely to be useful for mirroring, use the -m option.
wget behaves differently when it comes to duplicate files when you're downloading recursively. As a rule, when you download a file whose filename already exists locally, wget keeps the original copy and names the new one filename.html.1 (and filename.html.2 next time, and so on). However, if you use wget -r, re-downloading a file will simply overwrite the old version with the new one. To avoid this, use -r -nc to preserve the older version and keep the newer one from being downloaded from the server. If using non-recursive wget, -nc will also prevent a new version from being downloaded, so you won't get file.html.1 downloaded at all. (So in non-recursive mode, it's not really "no clobbering" but "no versioning.")
Here are a few directory-related options you may find useful when downloading recursively:
Sometimes you may find that a file only partially downloads – perhaps your connection flaked out halfway through. When that happens, you can use wget's --continue (-c) option:
wget -c http://www.example.com/bigfile.tgz
If there's a file in the local directory called bigfile.tgz, wget will try to fetch the rest of it from where it leaves off. wget is actually smart enough to do this itself if you're still within the same session; you only need the -c option if you're starting a new invocation of wget (for instance in a new terminal window). Be aware also that if wget can't get the rest of the file (perhaps the server doesn't support part-downloads), it will refuse to start a new download so as not to clobber the existing content. In that case, if you really want a new download, you have to remove the part-file and start over. Remember as well that wget isn't entirely magic; it can't tell if a file has been changed on the server since your first download attempt. If that has happened, you'll get a garbled file and will have to start over.
wget will automatically retry downloads that failed entirely, but by default it will do so immediately. It may be more useful to set it to wait for a short while before the retry, so that any problems at the server end have a chance of being fixed. Using --waitretry=seconds will do this, using a linear backoff strategy. This means that if you specify five seconds, wget will wait one second between the first and second tries, two seconds between the second and third tries, and so on up to five seconds between the fifth and sixth tries, at which point it will give up. This option is usually set to default to 10 in the global wgetrc file that's provided with the standard wget package.
If you have a slow connection, you may wish to limit the amount of bandwidth that wget is allowed, which you can do with --limit-rate=amount, where amount represents bits per second. For example, to limit the rate to 20Kbps, run:
wget --limit-rate=20k http://www.example.com/bigpage.html
You can add this setting to your .wgetrc (or to the global one), so it's used as a default, with the line:
If you're downloading recursively or otherwise fetching a large number of files, it's considered polite to use the --wait=n option, which tells wget to wait n seconds between retrievals and thus helps avoid server overload at the other end. Some websites look for particular usage patterns and use them to block automated retrieval. To get around them, you can use --random-wait together with --wait to vary the time between requests more randomly (between 0 and 2 * wait seconds).
Finally, a few more miscellaneous useful wget options:
wget -N file.html
--user-agent="user agent string"
With a little experimentation, wget can make your online life easier – and far more straightforward to automate.
Allowed tags: <a> link, <b> bold, <i> italics