Getting a full offline travel copy using wget on Mac OS
Guide for using the standard wget to download a full offline archive for a website on Mac OS X High Sierra. Includes a description of all arguments used.
TL;DR: wget -r -E -k -p -np -nc --random-wait http://example.com'
There are cases when you want to download a website as an offline copy. For me it's usually because it's a site I want to be able to use while I'm offline. That happens either by choice or because of planes that don't have Wi-Fi available (Gen-1 planes ;)).
When we download a website for local offline use, we need it in full. Not just the html of a single page, but all required links, sub-pages, etc. Of course any cloud-based functionality will not work, but especially documentation is usually mostly static.
wget is (of course) the tool for the job.
To install wget on your Mac (if you don't have it yet), you can use HomeBrew in a Terminal window:
$ homebrew install wget
Just as with all terminal commands, you can get an overview of the (very extensive) list of options using:wget --help
In general when you want to use wget
, you just use:
$ wget example.com
But that only gets you so far. You will now have the html of the root page of that website. But you can't really use that to view the site on your laptop. You are missing all support CSS files, images, JavaScript, etc. In addition, all links are still in the old 'context' of the online website.
So we'll use some options to get wget to do what we want.
$ wget -r -E -k -p -np -nc --random-wait http://example.com
Let's go through those different command line options to see what they do:
-r
or--recursive
Make sure wget uses recursion to download sub-pages of the requested site.-E
or--adjust-extension
Let wget change the local file extensions to match the type of data it received (based on the HTML Content-Type header)-k
or--convert-links
Have wget change the references in the page to local references, for offline use.-p
or--page-requisites
Don't just follow links within the same domain, but also download everything that is used by a page (even if it's outside of the current domain).-np
or--no-parent
Restrict wget to the provided sub-folder. Don't let it escape to parent or sibling pages.-nc
or--no-clobber
Don't retrieve files multiple times--random-wait
Have wget add some intervals between downloading different files in order to not bash the host server.