by April 13, 2002 0 comments



In the story Get from Web. Stream on Intranet, page 34, we talked about a Java news parser program. This program extracts news from sites, parses it for news items (headlines) and displays the headlines. In this article we understand the code that does the work. 

Objective
Many news sites put up a file with .rdf extension, which contains the latest news. This file is in XML format. On the right is an extract from such a news file named slashdot.rdf hosted at
slashdot.org.

In the RDF file, the <title> and </title> tags, which are enclosed within <item> </item> tags contain the news headlines, like Domain Name Dispute, Process Called Into Question. The <link> </link> tags contain a URL which points to the detailed news to the corresponding headline. Our objective is to write a program that extracts the contents from within the <title> </title> tag. We must also take care that the <title>, </title> tags must be enclosed within <item> </item> tags. This is because there is also a <title> </title> tag pair within the <image> </image> tags, which does not contain a news headline and thus is irrelevant to our use.

Set up tools
We have given a java source file named NewsParser.java on this month’s CD in the directory cdrom\Dev Lab\Source \newsparser. This file contains the Java code to accomplish the above objective. Also there is a file named sites.txt in the same directory, which contains the URLs (each on a separate line) to the news files. This file is read line by line by the NewsParser application to retrieve the files from the stated URLs. Copy these two files to a directory on your hard disk. 

To compile NewsParser.java, you need J2SDK (Java 2 Software Development Kit)–aka JDK–and Java XML pack (a suite of Java API libraries for developing XML applications). The suite includes JXAPI, which we require to process the retrieved RDF file. Here by processing we mean, extracting the news headline within the <title> </title> tags. 

Install JDK (found in the Dev Labs>SDK/IDE section on this month’s CD). JDK will be installed in a directory named j2sdk1.4.0. Extract the zipped archive named java_xml_pack-winter01_01-dev.zip of Java XML pack found on the same CD-ROM section on to your hard disk (say c:\). Copy the file named xerces.jar (found in the subdirectory jaxp-1.2-ea1 of the extracted archive) to j2sdk1.4.0/lib directory. Launch an MS DOS window or command prompt, change to the directory where you have copied NewsParser.java and sites.txt and issue the following command.

c:\j2sdk1.4.0\bin\javac -classpath
c:\j2sdk1.4.0\lib\xerces. jar;. NewsParser.java

This accomplishes program compilation. Make sure you are connected to the Internet and execute the program as:

c:\j2sdk1.4.0\bin\java NewsParser

After a couple of seconds news items will be displayed on the screen as shown in the screenshot.

The code behind
We assume that you are aware of the basics of Java. We will be laying emphasis only on the part of code, which uses the XML API for parsing. In the main() method we call getNews( ). In the getNews( ) method the file sites.txt is read line by line for the news URL. Subsquently, the URL string is passed to the method retrieveRDF( ) method.

The retrieveRDF( ) method retrieves the XML news file from the site and produces a DOM (Document Object Model) representation of the XML document. The following lines of code do this job.

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(rdfURL);

There are two widely used specifications for XML parsing SAX (Simple API for XML) and DOM. In DOM all the elements of a XML document (<item>, <title>, <url>, etc) are represented in a tree like structure. We don’t have to worry about the intricacies of these parsing APIs as the code to parse the XML news–in the method retrieveNewsItems( )–is simple. 

The first line

NodeList items = doc.getElementsByTagName(“item”);

gets a collection of all the <item> </item> tag pair in the document. Now using the for loop we iterate through the contents as:

for(int i=0;i<items.getLength();i++)
{
Element itemElement = (Element) items.item(i); 
NodeList titles = itemElement.getElementsByTagName(“title”);
Node title = titles.item(0);
NodeList titleValues = title.getChildNodes();
Node titleValue = titleValues.item(0);
System.out.println(“”+titleValue.getNodeValue()+”.”);
}

All the tags like <item>,<title>,<url>, etc, are called elements in the DOM lingo. During the first iteration of for loop, with the line:

Element itemElement = (Element) items.item(i); 

we retrieve the first item element in the document. With the second statement, we again get a collection of all the <title> </title> tag pair – within <item> </item>. We know that there is exactly one <title> element within each <item> </item> tag pair. Hence we retrieve it as:

Node title = titles.item(0);

Now within <title> </title> tag lies the news item, which interests us. Also there is only one news item within each <title> </title> tag. Hence we get a reference to the news item as:

NodeList titleValues = title.getChildNodes();
Node titleValue = titleValues.item(0);

Finally, we display this news item on the screen with the statement:

System.out.println(“”+titleValue.getNodeValue()+”.”);

Collect news in a text file
If you want to save the extracted news in a text file, then you must redirect the output of NewsParser program to a text file. This can be done as:

c:\j2sdk1.4.0\bin\java NewsParser > news.txt

This will create a file news.txt, containing the news items, in your working directory. If you want to append more news to an existing news.txt file, then issue:

c:\j2sdk1.4.0\bin\java NewsParser >> news.txt

Shekhar Govindarajan

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.

<