In the story Get from Web. Stream on Intranet, page 34, we talked about a Java news parser program. This program extracts news from sites, parses it for news items (headlines) and displays the headlines. In this article we understand the code that does the work.
Objective
Many news sites put up a file with .rdf extension, which contains the latest news. This file is in XML format. On the right is an extract from such a news file named slashdot.rdf hosted at
slashdot.org.
In the RDF file, the
Set up tools
We have given a java source file named NewsParser.java on this month’s CD in the directory cdrom\Dev Lab\Source \newsparser. This file contains the Java code to accomplish the above objective. Also there is a file named sites.txt in the same directory, which contains the URLs (each on a separate line) to the news files. This file is read line by line by the NewsParser application to retrieve the files from the stated URLs. Copy these two files to a directory on your hard disk.
To compile NewsParser.java, you need J2SDK (Java 2 Software Development Kit)–aka JDK–and Java XML pack (a suite of Java API libraries for developing XML applications). The suite includes JXAPI, which we require to process the retrieved RDF file. Here by processing we mean, extracting the news headline within the
Install JDK (found in the Dev Labs>SDK/IDE section on this month’s CD). JDK will be installed in a directory named j2sdk1.4.0. Extract the zipped archive named java_xml_pack-winter01_01-dev.zip of Java XML pack found on the same CD-ROM section on to your hard disk (say c:\). Copy the file named xerces.jar (found in the subdirectory jaxp-1.2-ea1 of the extracted archive) to j2sdk1.4.0/lib directory. Launch an MS DOS window or command prompt, change to the directory where you have copied NewsParser.java and sites.txt and issue the following command.
c:\j2sdk1.4.0\bin\javac -classpath
c:\j2sdk1.4.0\lib\xerces. jar;. NewsParser.java
This accomplishes program compilation. Make sure you are connected to the Internet and execute the program as:
c:\j2sdk1.4.0\bin\java NewsParser
After a couple of seconds news items will be displayed on the screen as shown in the screenshot.
The code behind
We assume that you are aware of the basics of Java. We will be laying emphasis only on the part of code, which uses the XML API for parsing. In the main() method we call getNews( ). In the getNews( ) method the file sites.txt is read line by line for the news URL. Subsquently, the URL string is passed to the method retrieveRDF( ) method.
The retrieveRDF( ) method retrieves the XML news file from the site and produces a DOM (Document Object Model) representation of the XML document. The following lines of code do this job.
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(rdfURL);
There are two widely used specifications for XML parsing SAX (Simple API for XML) and DOM. In DOM all the elements of a XML document (
The first line
NodeList items = doc.getElementsByTagName(“item”);
gets a collection of all the
for(int i=0;i
{
Element itemElement = (Element) items.item(i);
NodeList titles = itemElement.getElementsByTagName(“title”);
Node title = titles.item(0);
NodeList titleValues = title.getChildNodes();
Node titleValue = titleValues.item(0);
System.out.println(“”+titleValue.getNodeValue()+”.”);
}
All the tags like
Element itemElement = (Element) items.item(i);
we retrieve the first item element in the document. With the second statement, we again get a collection of all the
Node title = titles.item(0);
Now within
NodeList titleValues = title.getChildNodes();
Node titleValue = titleValues.item(0);
Finally, we display this news item on the screen with the statement:
System.out.println(“”+titleValue.getNodeValue()+”.”);
Collect news in a text file
If you want to save the extracted news in a text file, then you must redirect the output of NewsParser program to a text file. This can be done as:
c:\j2sdk1.4.0\bin\java NewsParser > news.txt
This will create a file news.txt, containing the news items, in your working directory. If you want to append more news to an existing news.txt file, then issue:
c:\j2sdk1.4.0\bin\java NewsParser >> news.txt
Shekhar Govindarajan