Grab the Internet with wget

By on Friday, January 13th, 2012 in Technical | Related Software Packages: | Keywords:

You may already know how handy the command-line tool wget is for grabbing a particular file over HTTP, but wget has many options you may not know about, including recursive retrieval, mirroring-specific options, a slew of ways to handle connection issues, and a few ways to deal with websites that assume you’re an interactive user. Here’s how to turn wget from a one-trick pony into a whole circus of performing horses.

If all you want is a single page, you’ve probably already used a command like

wget http://www.example.com/thatpage.html

to save a copy of the linked page locally, as thatpage.html. However, by default, wget won’t save any image files or stylesheets that might be a part of your page, so if you try to view it, the page might not look the way you expect it to. To get the page and all its supporting elements, use

wget --page-requisites http://www.example.com/thatpage.html

As well as reading options from the command line, wget also looks for default options in the global wgetrc file (usually found at /etc/wgetrc or /usr/local/etc/wgetrc, depending on your system setup), and in the local file ~/.wgetrc. Any of the options discussed here can be added to one of the wgetrc files in order to apply to every download by default.

You can also read in a stack of URLs at once by listing them in a file:

wget -i urllist.txt

Because wget is non-interactive, you can kick it off with the -b argument and log out, and it’ll get on with your job in the background:

wget -b -o wget.out http://www.example.com/thatpage.html

-o file sends the output to the specified file. (If running in the background and no output file is specified with -o, output will go to the file wget-log.) -O filename does something a little different – it saves the source file to filename instead of to its own name. If you’ve specified multiple source files, they’ll be concatenated and all written to this file.

The default output is verbose; to minimize output, use -nv. You can also change the appearance of wget’s progress bar with --progress=dot.

Mirroring and Recursive Download

If you want to fetch a large chunk of a website, or if you want to mirror a website, the --recursive (-r) option is your friend. If you just specify -r, wget will download the page you point to, and every (internal) page linked to from that page, and so on, up to five links away from your starting point. (So if page 1 links to page 2 which links to page 3, and so on, you’ll download pages 1-6 but not page 7.) You can change the number of links wget should follow from the default of 5 by using -l n, or use -l inf to turn the limit off altogether. You can also use --no-parent (-np) to avoid ascending a directory level, meaning that you download only a subsite of a particular website.

Recursion is even more useful if you add the -k option, which turns all absolute internal links into relative links. In other words, if you’re downloading http://www.example.com/test.html and it has a link to http://www.example.com/anothertest.html, that link will be edited within the local file to read anothertest.html. This enables you to browse the site entirely locally and offline. To turn on this option together with a number of other options that are likely to be useful for mirroring, use the -m option.

wget behaves differently when it comes to duplicate files when you’re downloading recursively. As a rule, when you download a file whose filename already exists locally, wget keeps the original copy and names the new one filename.html.1 (and filename.html.2 next time, and so on). However, if you use wget -r, re-downloading a file will simply overwrite the old version with the new one. To avoid this, use -r -nc to preserve the older version and keep the newer one from being downloaded from the server. If using non-recursive wget, -nc will also prevent a new version from being downloaded, so you won’t get file.html.1 downloaded at all. (So in non-recursive mode, it’s not really “no clobbering” but “no versioning.”)

Here are a few directory-related options you may find useful when downloading recursively:

  • --no-directories (-nd): forces wget not to create a directory hierarchy, so all files are downloaded in the same directory.
  • -nH: removes the default host directory prefix, so http://www.example.com/test is stored in the directory test/ rather than in www.example.com/test.
  • --cut-dirs=n: this enables you to better control where files are saved locally. If you recursively retrieved http://www.example.com/dir/subdir/mydirectory/, it would be stored in www.example.com/dir/subdir/mydirectory. With -nH, it would be stored in dir/subdir/mydirectory. However, with -nH --cut-dirs=2, it would be saved in mydirectory.

Handling Connection Issues

Sometimes you may find that a file only partially downloads – perhaps your connection flaked out halfway through. When that happens, you can use wget’s --continue (-c) option:

wget -c http://www.example.com/bigfile.tgz

If there’s a file in the local directory called bigfile.tgz, wget will try to fetch the rest of it from where it leaves off. wget is actually smart enough to do this itself if you’re still within the same session; you only need the -c option if you’re starting a new invocation of wget (for instance in a new terminal window). Be aware also that if wget can’t get the rest of the file (perhaps the server doesn’t support part-downloads), it will refuse to start a new download so as not to clobber the existing content. In that case, if you really want a new download, you have to remove the part-file and start over. Remember as well that wget isn’t entirely magic; it can’t tell if a file has been changed on the server since your first download attempt. If that has happened, you’ll get a garbled file and will have to start over.

wget will automatically retry downloads that failed entirely, but by default it will do so immediately. It may be more useful to set it to wait for a short while before the retry, so that any problems at the server end have a chance of being fixed. Using --waitretry=seconds will do this, using a linear backoff strategy. This means that if you specify five seconds, wget will wait one second between the first and second tries, two seconds between the second and third tries, and so on up to five seconds between the fifth and sixth tries, at which point it will give up. This option is usually set to default to 10 in the global wgetrc file that’s provided with the standard wget package.

If you have a slow connection, you may wish to limit the amount of bandwidth that wget is allowed, which you can do with --limit-rate=amount, where amount represents bits per second. For example, to limit the rate to 20Kbps, run:

wget --limit-rate=20k http://www.example.com/bigpage.html

You can add this setting to your .wgetrc (or to the global one), so it’s used as a default, with the line:

limit_rate=20k

If you’re downloading recursively or otherwise fetching a large number of files, it’s considered polite to use the --wait=n option, which tells wget to wait n seconds between retrievals and thus helps avoid server overload at the other end. Some websites look for particular usage patterns and use them to block automated retrieval. To get around them, you can use --random-wait together with --wait to vary the time between requests more randomly (between 0 and 2 * wait seconds).

Other Options

Finally, a few more miscellaneous useful wget options:

  • --timestamping (-N) offers timestamping of downloaded files. It sets the last modified date of the local file to be the same as it was on the server. You can then use -N in a subsequent wget operation to retrieve only files that have changed since the last download. Thus wget -N file.html would get file.html only if it had been modified on the server since your last download.
  • --server-response (-S) prints headers and responses, as well as retrieving files, which can be useful for basic debugging if there’s a problem.
  • --referer=url is useful when dealing with sites that require a specific Referer page. You’re most likely to discover this by experimentation (e.g. trying to download a page with wget and discovering that it doesn’t download correctly). You can also use --user-agent="user agent string" to fake looking like a browser, which is another problem that can arise when you try to automate downloads. (See my recent cURL article for more on user agent strings, among other things.)
  • --http-user=username --http-passwd=password allows you to specify a username and password for a site. Of course that’s not very secure – anyone with access to the process list on your machine would be able see your password. Saving them instead in .wgetrc (and setting that file’s permissions to hide it from other users) is safer. For further security, you can add those lines only just before you start the download, then delete them again once it has begun.

With a little experimentation, wget can make your online life easier – and far more straightforward to automate.

Download the Open Source Support Evaluation Kit

Related posts:

  1. Internet Law Part III: Ownership of Copyrights
  2. Creating a Continuous Integration Server for Java Projects Using Hudson
  3. Internet Law Part II: Patent, Trademark and Trade Secret Law
  4. Internet Law Part I: Copyright Law
  5. Throw a cURL at Your Web Work

Related Open-Source Packages

Wget: See all Wget Articles » Get Wget Support at OLEX »

Juliet Kemp

Juliet Kemp has been messing around with Linux systems, for financial reward and otherwise, for about a decade. She is also the author of Linux System Administration Recipes: A Problem-Solution Approach (Apress, 2009).

3 Responses to “Grab the Internet with wget”

  1. Blair says:

    Very informative and interesting article. It’s nice to know that wget is capable of this as I have used third party software in the past for the same result. Thanks!

Leave a Reply

© 2012 OpenLogic, Inc. | Licensing | Privacy Policy | Terms of Use

Bad Behavior has blocked 2284 access attempts in the last 7 days.