Developers

Machine Learning Using WEKA

PCQ Bureau

03 Sep 2009 11:57 IST

New Update

There are multiple gigs of data to deal with in our lives and this only goes

on increasing with each passing day. The gap between the data generated and

analyzed is also growing. So, you ought to look for techniques that make things

easier. Machine learning is one such technique that searches a very large

dataset of possible hypothesis to determine the best fit in the observed data

and any prior knowledge held by the learning system. Data Mining augments the

search and understanding of the electronically stored data.

Advertisment

What is WEKA?

Waikato Environment for Knowledge Analysis (WEKA), developed at the University
of Waikato, New Zealand, is a collection of machine learning algorithms with

data preprocessing tools to provide input to these algorithms. The tool was

developed in Java and runs on Linux as well as Windows. It can also be used to

develop and analyze new machine learning algorithms. It is open source software

and distributed under the terms of GNU General Public License. The input to the

Machine Learning algorithms is in the form of a relational table in the ARFF

format. Weka comes with an API documentation generated using Javadoc. More

details on Weka and its usage are available across a few chapters in the book

written by Ian H Witten and Eibe Frank, 'Data Mining: Practical Machine Learning

Tools and Techniques,' 2nd Edition, Morgan Kaufmann Series, San Francisco, 2005.

How it helps

Some key features of WEKA include:

Preprocess — Weka has file format converters for spreadsheets, C4.5

file formats and serialized instances. It can also open a URL and use HTTP to

download an ARFF file from the Web or open a database using JDBC, and retrieve

instances using SQL. It also provides a list of filters to delete specified

attributes from a dataset.

Advertisment

Direct Hit!

Applies To: Researchers in Data Mining

and Artificial Intelligence

USP: Applying machine learning algorithms
for data mining

Primary Link: www.cs.waikato.ac.nz/ml/weka
Search Engine

Keywords: Machine Learning, Data Mining

Classify — Weka trains and tests learning schemes that perform

classification or regression. The classifiers can be divided into Bayesian,

trees, rules, functions and lazy. It also builds a linear regression model and

allows the user to build their own classifiers interactively. It also provides

options for a number of meta learners.

Cluster — Weka shows the clusters and the number of instances in the

cluster. Thereafter it determines the majority class in each cluster and gives

the confusion matrix.

Advertisment

Associate — Weka contains three algorithms for determining

association rules-apriory, predictive apriory and filtered associators.

It has no methods for evaluating such rules.

Attribute Selection — Weka gives access to several methods for

attribute selection, which involves an attribute evaluator and a search method.

Attribute selection can be performed using the full training set or

cross-validation.

Advertisment


In the Preprocess tab, you can view attributes in the input file, properties of the selected attribute, and visualisation of class distribution for each attribute.	Building a NaÃ¯ve Bayes Classifier with 10 fold cross-validation. The correctly classified instances can be viewed by right clicking on Classifier in Results Window.

Visualization - It displays a matrix of two-dimensional scatter plots

of each pair of attributes.

Preparing input

Major effort in the process of data mining/machine learning goes into the

preparation of input. In order to analyze data using Weka, you need to prepare

it in the Attribute Relation File Format (ARFF) and then load it in its

Explorer. Spreadsheets, Comma Separated Value (CSV) files and databases can be

converted to ARFF. In ARFF, there is an @relation tag, @attribute tag and @data

tag to represent the dataset name, attribute information and values

respectively.

Advertisment

Classifying data

Weka should preferably be used through a graphical user interface called
'Explorer' than the command-line interface. The other two interfaces are

'Knowledge Flow Interface,' which supports design configuration for streamed

data processing and 'Experimenter,' which helps users compare a variety of

learning techniques. In this example, we use an ARFF named age.arff which

contains a few selective words in the attribute and @data contains their number

of occurrences per 10,000 words in a blog dataset written by bloggers belonging

to various age groups.

1. Open the file you want to analyze using the Open file option in the

Preprocess tab in Weka explorer, ie open the age data file, age.arff.

2. Once the input file has been opened, all attributes in the input file are
shown in the Attributes Window. Properties of the selected attributes like

Attribute Name, Attribute Type, number of missing values, etc are displayed in

the 'Selected Attribute' window. Here, you can select attributes that you want

to include in working relations, eg age prediction.

3. Select the classifier algorithm in the Classify tab. In this example, we
selected NaÃ¯ve Bayes with 10 fold Cross-Validation. Next, click on Start. The

result is displayed in the Classifier Output window as shown in figure on the

left.


Output of the NaÃ¯ve Bayes Classifier in terms of errors, accuracy by class and confusion matrix, on Age dataset.	View of an ARFF dataset which consists of a list of instances, and the attribute values for each instance separated by commas.

Advertisment

Analyzing the result

The result displays the summary of the data set followed by the algorithm

used to analyze it. It also gives the predictive performance of the

machine-learning algorithm applied on the dataset. Thereafter the confusion

matrix displays the number of instances classified properly and those

misclassified. The classification error is displayed mentioning the mean

absolute error and the root mean squared error of the class probability

estimates.

Processing huge datasets

If the dataset is too huge, running to a few thousand attributes and a few

lakh records, it can happen that Weka runs into an 'OutOfMemory' exception. Most

Java virtual machines allocate a certain maximum amount of memory which is much

less than the amount of RAM to run Java programs. However, we can extend the

memory available for the virtual machine by setting appropriate options.

Alternately, Weka offers several filters for re-sampling a dataset and

generating a new dataset reduced in size. Besides, there are schemes that can be

trained in an incremental fashion, not just in batch mode unlike most

classifiers which require all the data before they can be trained. Such a

classifier will load the dataset incrementally and feed the data instance by

instance to the classifier.

Conclusion

It is difficult for a single machine learning tool to suite all data mining

requirements even as the universal learner is still a distant dream. In order to

obtain an accurate model of real datasets, the learning algorithm must match the

domain. Data mining is an experimental science and provides a workbench for data

preprocessing tools and machine learning algorithms. Weka helps in realizing the

goal of data mining, by predicting missing values and validating that the

predicted values are correct.

Abhinav Gupta & Sumit Goswami

Advertisment