provides software and services that enable enterprises
Live Chat 1-888-673-6564

Open Source Software Technical Articles

  • Home
  • Search
  • Source Code Scanning Tools
  • Products and Support
  • Services
  • Cloud Services
  • Open Source Training
  • Enterprise OSS Blog
  • Wazi Technical Blog
  • About Wazi
  • Attributions and Licensing
  • Supply Chain Compliance
  • How to Contribute
  • Contributors
  • Resources Library
  • Partners
  • Customers
  • Community
  • Company
  • Careers
  • News and Events
  • Contact Us

Subscribe to Wazi by Email

Your email:

click-here-to-chat-with-an-online-representative


Enterprise Developer Support 24 x 7 for Apache, CentOS, Tomcat, PostSQL and more. Get a Support Quote by clicking here!


Latest Posts

  • Build your own custom modules for Drupal 7
  • CentOS system administration using text-based user interfaces
  • Quickly create custom software packages with FPM
  • More easy RSS for your websites via Google and Yahoo! APIs
  • Get RSS for your website using jQuery and PHP
  • JSF tip: How to create bookmarkable pages
  • MySQL Workbench simplifies MySQL management tasks
  • Use Perl to enhance ModSecurity
  • The secret to great reporting with Drupal 7
  • A more colorful LibreOffice unveiled

Connect with Us!

Current Articles | RSS Feed RSS Feed

Grab the Internet with wget

Posted by Juliet Kemp on Fri, Jan 13, 2012
  
Email This Email Article  
Tweet  
  

You may already know how handy the command-line tool wget is for grabbing a particular file over HTTP, but wget has many options you may not know about, including recursive retrieval, mirroring-specific options, a slew of ways to handle connection issues, and a few ways to deal with websites that assume you're an interactive user. Here's how to turn wget from a one-trick pony into a whole circus of performing horses.



If all you want is a single page, you've probably already used a command like



wget http://www.example.com/thatpage.html


to save a copy of the linked page locally, as thatpage.html. However, by default, wget won't save any image files or stylesheets that might be a part of your page, so if you try to view it, the page might not look the way you expect it to. To get the page and all its supporting elements, use



wget --page-requisites http://www.example.com/thatpage.html


As well as reading options from the command line, wget also looks for default options in the global wgetrc file (usually found at /etc/wgetrc or /usr/local/etc/wgetrc, depending on your system setup), and in the local file ~/.wgetrc. Any of the options discussed here can be added to one of the wgetrc files in order to apply to every download by default.



You can also read in a stack of URLs at once by listing them in a file:



wget -i urllist.txt


Because wget is non-interactive, you can kick it off with the -b argument and log out, and it'll get on with your job in the background:



wget -b -o wget.out http://www.example.com/thatpage.html


-o file sends the output to the specified file. (If running in the background and no output file is specified with -o, output will go to the file wget-log.) -O filename does something a little different – it saves the source file to filename instead of to its own name. If you've specified multiple source files, they'll be concatenated and all written to this file.





The default output is verbose; to minimize output, use -nv. You can also change the appearance of wget's progress bar with --progress=dot.



Mirroring and Recursive Download



If you want to fetch a large chunk of a website, or if you want to mirror a website, the --recursive (-r) option is your friend. If you just specify -r, wget will download the page you point to, and every (internal) page linked to from that page, and so on, up to five links away from your starting point. (So if page 1 links to page 2 which links to page 3, and so on, you'll download pages 1-6 but not page 7.) You can change the number of links wget should follow from the default of 5 by using -l n, or use -l inf to turn the limit off altogether. You can also use --no-parent (-np) to avoid ascending a directory level, meaning that you download only a subsite of a particular website.



Recursion is even more useful if you add the -k option, which turns all absolute internal links into relative links. In other words, if you're downloading http://www.example.com/test.html and it has a link to http://www.example.com/anothertest.html, that link will be edited within the local file to read anothertest.html. This enables you to browse the site entirely locally and offline. To turn on this option together with a number of other options that are likely to be useful for mirroring, use the -m option.



wget behaves differently when it comes to duplicate files when you're downloading recursively. As a rule, when you download a file whose filename already exists locally, wget keeps the original copy and names the new one filename.html.1 (and filename.html.2 next time, and so on). However, if you use wget -r, re-downloading a file will simply overwrite the old version with the new one. To avoid this, use -r -nc to preserve the older version and keep the newer one from being downloaded from the server. If using non-recursive wget, -nc will also prevent a new version from being downloaded, so you won't get file.html.1 downloaded at all. (So in non-recursive mode, it's not really "no clobbering" but "no versioning.")



Here are a few directory-related options you may find useful when downloading recursively:




    • --no-directories (-nd): forces wget not to create a directory hierarchy, so all files are downloaded in the same directory.


    • -nH: removes the default host directory prefix, so http://www.example.com/test is stored in the directory test/ rather than in www.example.com/test.


    • --cut-dirs=n: this enables you to better control where files are saved locally. If you recursively retrieved http://www.example.com/dir/subdir/mydirectory/, it would be stored in www.example.com/dir/subdir/mydirectory. With -nH, it would be stored in dir/subdir/mydirectory. However, with -nH --cut-dirs=2, it would be saved in mydirectory.


19a98812-f823-48dc-841e-bf029c63c6d7

Handling Connection Issues



Sometimes you may find that a file only partially downloads – perhaps your connection flaked out halfway through. When that happens, you can use wget's --continue (-c) option:



wget -c http://www.example.com/bigfile.tgz


If there's a file in the local directory called bigfile.tgz, wget will try to fetch the rest of it from where it leaves off. wget is actually smart enough to do this itself if you're still within the same session; you only need the -c option if you're starting a new invocation of wget (for instance in a new terminal window). Be aware also that if wget can't get the rest of the file (perhaps the server doesn't support part-downloads), it will refuse to start a new download so as not to clobber the existing content. In that case, if you really want a new download, you have to remove the part-file and start over. Remember as well that wget isn't entirely magic; it can't tell if a file has been changed on the server since your first download attempt. If that has happened, you'll get a garbled file and will have to start over.



wget will automatically retry downloads that failed entirely, but by default it will do so immediately. It may be more useful to set it to wait for a short while before the retry, so that any problems at the server end have a chance of being fixed. Using --waitretry=seconds will do this, using a linear backoff strategy. This means that if you specify five seconds, wget will wait one second between the first and second tries, two seconds between the second and third tries, and so on up to five seconds between the fifth and sixth tries, at which point it will give up. This option is usually set to default to 10 in the global wgetrc file that's provided with the standard wget package.



If you have a slow connection, you may wish to limit the amount of bandwidth that wget is allowed, which you can do with --limit-rate=amount, where amount represents bits per second. For example, to limit the rate to 20Kbps, run:

wget --limit-rate=20k http://www.example.com/bigpage.html


You can add this setting to your .wgetrc (or to the global one), so it's used as a default, with the line:



limit_rate=20k


If you're downloading recursively or otherwise fetching a large number of files, it's considered polite to use the --wait=n option, which tells wget to wait n seconds between retrievals and thus helps avoid server overload at the other end. Some websites look for particular usage patterns and use them to block automated retrieval. To get around them, you can use --random-wait together with --wait to vary the time between requests more randomly (between 0 and 2 * wait seconds).



Other Options



Finally, a few more miscellaneous useful wget options:




    • --timestamping (-N) offers timestamping of downloaded files. It sets the last modified date of the local file to be the same as it was on the server. You can then use -N in a subsequent wget operation to retrieve only files that have changed since the last download. Thus wget -N file.html would get file.html only if it had been modified on the server since your last download.


    • --server-response (-S) prints headers and responses, as well as retrieving files, which can be useful for basic debugging if there's a problem.


    • --referer=url is useful when dealing with sites that require a specific Referer page. You're most likely to discover this by experimentation (e.g. trying to download a page with wget and discovering that it doesn't download correctly). You can also use --user-agent="user agent string" to fake looking like a browser, which is another problem that can arise when you try to automate downloads. (See my recent cURL article for more on user agent strings, among other things.)


    • --http-user=username --http-passwd=password allows you to specify a username and password for a site. Of course that's not very secure – anyone with access to the process list on your machine would be able see your password. Saving them instead in .wgetrc (and setting that file's permissions to hide it from other users) is safer. For further security, you can add those lines only just before you start the download, then delete them again once it has begun.



With a little experimentation, wget can make your online life easier – and far more straightforward to automate.

Follow @openlogic
Follow @CloudSwing

This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.Follow @openlogic
Follow @OSCloudServices

This work is licensed under a Creative Commons Attribution 3.0 Unported License
Creative Commons License.
Tags: Technical, Utility, wget

Comments

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Loading...
Error sending email
Email sent successfully

Email article
Email To : 
Your name : 
Message : (maximum 200 characters)
Home | Search | Source Code Scanning Tools | Products and Support | Services | Cloud Services | Open Source Training | Enterprise OSS Blog | Wazi Technical Blog | Resources Library | Partners | Customers | Community | Company | Careers | News and Events | Contact Us
Products
OpenLogic Exchange (OLEX)
License Compliance Module
OSS Discovery
OSS Deep Discovery
OpenUpdate
Services
Open Source Support
CentOS Support
Scanning & Compliance
Open Source Training
Professional Services
Solutions
Support & Indemnification
Open Source Governance
Open Source Scanning
Open Source Provisioning
Consulting & Training
Contact Us
1-888-673-6564


© 2013 OpenLogic, Inc. All rights reserved.
Site Map  |  Privacy Policy