Tools to Browse Any Website Offline

by August 1, 2012 0 comments


– Pushkar Gupta (Dept of CS & IS, BITS, Pilani) and Sumit Goswami (Dte of MIST, DRDO, New Delhi)

Many a times we come across a website, whose content may seem interesting to us or the data of the website may come handy for us in the near future. With the websites evolving each day, they are becoming richer in content as well as size. Repeated browsing of such websites and reaching to the required data is not only tedious but also economically a bit unviable. There are many tools and methods available both on Windows as well as Linux, which can help us download the website for offline viewing or generating corpus for text mining and business intelligence applications.

SnapshotApplies to: Internet users, web developers

USP: Learn the various methods to download a website and make it available for offline viewing.
Primary Links:Securing Your Website’s CMS http://ld2.in/486

Search engine keywords: offline browsing

For Linux users

1. Wget: This is probably one of the most useful non-interactive tools available for web crawling on the Linux platform. The non-interactive nature allows the program to run in the background, eliminating continued presence of the user. Other essential features of the utility include its ability to resume downloads, and to work with HTML, CSS and XML pages. The utility supports upto three levels of crawling. With a proper knowledge of its attributes one can easily customize a download according to need and get the type of file one needs. Wget can be downloaded from http://ftp.gnu.org/gnu/wget or ftp://ftp.gnu.org/gnu/wget (using FTP).

Using wget: The basic syntax for the wget utility is $wget [attribute] [url of the website]. If the need arises more than one attribute can also be used in one line. In such a case the various attributes are preceded by ‘-‘, if used in the short form and by ‘- -‘, if used in the long form. The various attributes that can be used are:

– -r: To download files from a website recursively. The attribute is very helpful when trying to download an entire website with its contents. Eg. $wget —r www.abcd.com will download all the files from the website (www.abcd.com) one by one.

– -o: Used to specify the name of the downloaded file. By default wget creates a zip file with the last word of the url. Eg. Wget —o name.zip www.abc.com will download all the files to a zip file named name.zip.

– -b: Makes the program run in background. The progress of the download can be checked by entering —f.

– -spider: This option helps us in checking whether the given url exists and also checks the links contained in the url.

– -mirror: Helps download the website fully and makes it available for offline viewing.

– -Q: This option lets you to stop downloading if the size of the file exceeds a certain size. E.g. $ wget —Q10m will cancel the download as soon as its size exceeds 10MB.

– -tries: As mentioned earlier wget keeps on trying till the download is successful, however using this option you can set the maximum number of retries. Eg. $ wget – -tries=100, will limit the maximum number of tries to 100.

– -l: This attribute specifies the number of levels to crawl. Eg $ wget —l3 will tell wget to crawl upto three levels.

 

– Specify the file type: You can also specify the type of files to download, for example you may only want to download images or PDFs. The following syntax is to be used

$ wget -r —A.extension www.abcd.com

Here extension is the extension of the file to be downloaded. E.g. $ wget —r —A.jpg www.abcd.com, will download all the images from the website www.abcd.com.

[image_library_tag 810/71810, border=”0″ align=”right” hspace=”4″ vspace=”4″ ,default]

– Download from a list of links: Paste the links to a text file, with one link per line, and use wget to download the files using the command :

$ wget —input-file=filename

For example, if you have your links in a file named links.txt the syntax would be $ wget – -inputfile=links.txt .

2. Curl: Curl is another tool that helps to transfer data (documents/files) using HTTP, HTTPS and other supported protocols. The command, like wget, has a large number of tricks in its kitty, like support for proxy, FTP upload and many others. No user interactions, or for that matter no interactivity is one of the design features of Curl. ‘libcurl’ helps curl in all its transfer related features. Curl can be downloaded from www.curl.haxx.se/download.html. Currently curl 7.26.0 is the latest version available (released on 24th May 2012). Other libraries that curl can use can be found at www.curl.haxx.se/docs/libs.html

Using curl: The general syntax of curl is curl [attributes] [URL]. The requested website (the HTML source code of the page) is displayed in the terminal window. Like wget, multiple attributes can be used in the same command line and the various attributes are preceded by ‘-‘. Similar to wget, curl also displays information like amount of data transferred, speed of transfer and estimated time left. Multiple URLs can also be specified in the command line, and curl will fetch data from these URLs in an orderly manner. Curl also tries to understand which protocol to use and tries to implement it. Some useful attributes of curl are:

– verbose: It is a very useful attribute and displays the kind of command received by the server from Curl. —verbose comes in very handy in debugging and understanding Curl.

– anyauth: This lets Curl find out the appropriate authentication method on its own and use the most secure one that the accessed site supports.

– connect-timeout: This is used to specify the maximum time Curl should utilize in doing a particular task, and restricts Curl from wasting unnecessary time on inaccessible downloads.

– -c: This allows Curl to continue/resume from a particular point in a website. It allows certain number of bytes from the source file to be skipped, and then continue downloads.

– -h: This provides help regarding the usage of Curl. Using $ curl —h provides all the information to make the most out of Curl.


For those using Windows

Like Linux, downloading a complete website is possible in Windows as well. A large number of third party software, both free as well as paid, are available. Some of the tools with best operational capabilities are:

1. Wget: As seen earlier wget is one of the most popular utility for website downloading. Seeing its importance, a wget add-on, which works in the same manner as the wget for Linux, is available. Wget add-on for Windows can be downloaded from www.users.ugent.be/~bpuype/wget. The version available on the website is 1.11.4 and works well in both Win XP/Win7 and is available for 32-bit systems as well as 64-bit systems. For using wget open the command prompt and change the current directory to C: or better instead change the path variable in Advanced System Properties to C: . Once the path variable is changed, wget can be used in the command prompt in exactly the same manner as described for Linux, with all its attributes and features.

2. HTTrack: HTTrack is a free web crawling tool for Windows. The easy to use GUI makes downloading websites a simple task. The various options available allow us to download files conveniently and also to specify the number of levels it needs to crawl. The software is available for all versions of Windows (2000/XP/Vista/7) and can be downloaded fromn www.httrack.com/page/2/en/index.html

 

3. SurfOffline 2.1: Surf Offline is another software available on Windows for downloading websites. Apart from downloading all files from a website, it also links together the various downloads to make the website available for offline viewing as a complete unit. The entire website and links contained therein can be browsed in the same manner as if they were online. The software can be bought or downloaded for free trial from www.surfoffline.com

[image_library_tag 811/71811, border=”0″ align=”right” hspace=”4″ vspace=”4″ ,default]

4. Webripper: Webripper proves quite useful when the website which is to be downloaded has many levels and different types of content. Webripper supports crawling up to five levels, and also just like wget allows the user to choose which types of files to download. The software is a freeware provided by Calluna Software and can be downloaded from www.calluna-software.com/Webripper

5. WINWSD Website Downloader: Another good software, based on the ease of use and functionality. Features include auto shut down of the system when a download is complete and also allows users to view the downloaded site inside the software. This free utility can be downloaded from www.download.cnet.com/WinWsd-website-downloader.

6. Custom Code: This is a Java code that extracts all links from a website in an orderly manner (one link per line). Once the links are extracted, the website can be easily downloaded using wget. The code with minor tweaks can prove to be very useful in extracting websites and downloading them. It requires an additional module called the jsoup package to be downloaded. The package can be downloaded from www.jsoup.org. Copy the downloaded jar file to the lib folder of your JDK.

The following code extracts all links from the page www.google.com and prints them as a list.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Links
{
public static void main(String[] args)
{
new Links(“http://www.google.com”);
private Links(String url)
{
System.out.println(“Links from “+url+”\n”);
Document doc = null;
Try
{
doc = Jsoup.connect(url).get();
}
catch (IOException e)
{
e.printStackTrace();
}
Elements links = doc.select(“a[href]”);
for (Element link : links)
{
System.out.println(link.text() + “ -> “ + link.attr(“abs:href”));
}
}
}
}


Conclusion

We have seen a number of methods to download a website and make it available for offline viewing. On one hand, Linux makes use of a number of predefined functions which despite being non-interactive, are very powerful to use while on the other Windows provides a host of software which provide an interactive method to download a website. The article also shows how using various pre-defined library packages we can write a simple code which can help us to extract a website in a manner desired by us. A proper use of the combination of the above tools can surely help us download a website we want.

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.

Your data will be safe!Your e-mail address will not be published. Also other data will not be shared with third person.