There are multiple gigs of data to deal with in our lives and this only goes
on increasing with each passing day. The gap between the data generated and
analyzed is also growing. So, you ought to look for techniques that make things
easier. Machine learning is one such technique that searches a very large
dataset of possible hypothesis to determine the best fit in the observed data
and any prior knowledge held by the learning system. Data Mining augments the
search and understanding of the electronically stored data.
What is WEKA?
Waikato Environment for Knowledge Analysis (WEKA), developed at the University
of Waikato, New Zealand, is a collection of machine learning algorithms with
data preprocessing tools to provide input to these algorithms. The tool was
developed in Java and runs on Linux as well as Windows. It can also be used to
develop and analyze new machine learning algorithms. It is open source software
and distributed under the terms of GNU General Public License. The input to the
Machine Learning algorithms is in the form of a relational table in the ARFF
format. Weka comes with an API documentation generated using Javadoc. More
details on Weka and its usage are available across a few chapters in the book
written by Ian H Witten and Eibe Frank, 'Data Mining: Practical Machine Learning
Tools and Techniques,' 2nd Edition, Morgan Kaufmann Series, San Francisco, 2005.
How it helps
Some key features of WEKA include:
Preprocess — Weka has file format converters for spreadsheets, C4.5
file formats and serialized instances. It can also open a URL and use HTTP to
download an ARFF file from the Web or open a database using JDBC, and retrieve
instances using SQL. It also provides a list of filters to delete specified
attributes from a dataset.
Direct Hit! |
Applies To: Researchers in Data Mining and Artificial Intelligence USP: Applying machine learning algorithms for data mining Primary Link: www.cs.waikato.ac.nz/ml/weka Search Engine Keywords: Machine Learning, Data Mining |
Classify — Weka trains and tests learning schemes that perform
classification or regression. The classifiers can be divided into Bayesian,
trees, rules, functions and lazy. It also builds a linear regression model and
allows the user to build their own classifiers interactively. It also provides
options for a number of meta learners.
Cluster — Weka shows the clusters and the number of instances in the
cluster. Thereafter it determines the majority class in each cluster and gives
the confusion matrix.
Associate — Weka contains three algorithms for determining
association rules-apriory, predictive apriory and filtered associators.
It has no methods for evaluating such rules.
Attribute Selection — Weka gives access to several methods for
attribute selection, which involves an attribute evaluator and a search method.
Attribute selection can be performed using the full training set or
cross-validation.
In the Preprocess tab, you can view attributes in the input file, properties of the selected attribute, and visualisation of class distribution for each attribute. |
Building a Naïve Bayes Classifier with 10 fold cross-validation. The correctly classified instances can be viewed by right clicking on Classifier in Results Window. |
Visualization - It displays a matrix of two-dimensional scatter plots
of each pair of attributes.
Preparing input
Major effort in the process of data mining/machine learning goes into the
preparation of input. In order to analyze data using Weka, you need to prepare
it in the Attribute Relation File Format (ARFF) and then load it in its
Explorer. Spreadsheets, Comma Separated Value (CSV) files and databases can be
converted to ARFF. In ARFF, there is an @relation tag, @attribute tag and @data
tag to represent the dataset name, attribute information and values
respectively.
Classifying data
Weka should preferably be used through a graphical user interface called
'Explorer' than the command-line interface. The other two interfaces are
'Knowledge Flow Interface,' which supports design configuration for streamed
data processing and 'Experimenter,' which helps users compare a variety of
learning techniques. In this example, we use an ARFF named age.arff which
contains a few selective words in the attribute and @data contains their number
of occurrences per 10,000 words in a blog dataset written by bloggers belonging
to various age groups.
1. Open the file you want to analyze using the Open file option in the
Preprocess tab in Weka explorer, ie open the age data file, age.arff.
2. Once the input file has been opened, all attributes in the input file are
shown in the Attributes Window. Properties of the selected attributes like
Attribute Name, Attribute Type, number of missing values, etc are displayed in
the 'Selected Attribute' window. Here, you can select attributes that you want
to include in working relations, eg age prediction.
3. Select the classifier algorithm in the Classify tab. In this example, we
selected Naïve Bayes with 10 fold Cross-Validation. Next, click on Start. The
result is displayed in the Classifier Output window as shown in figure on the
left.
Output of the Naïve Bayes Classifier in terms of errors, accuracy by class and confusion matrix, on Age dataset. |
View of an ARFF dataset which consists of a list of instances, and the attribute values for each instance separated by commas. |
Analyzing the result
The result displays the summary of the data set followed by the algorithm
used to analyze it. It also gives the predictive performance of the
machine-learning algorithm applied on the dataset. Thereafter the confusion
matrix displays the number of instances classified properly and those
misclassified. The classification error is displayed mentioning the mean
absolute error and the root mean squared error of the class probability
estimates.
Processing huge datasets
If the dataset is too huge, running to a few thousand attributes and a few
lakh records, it can happen that Weka runs into an 'OutOfMemory' exception. Most
Java virtual machines allocate a certain maximum amount of memory which is much
less than the amount of RAM to run Java programs. However, we can extend the
memory available for the virtual machine by setting appropriate options.
Alternately, Weka offers several filters for re-sampling a dataset and
generating a new dataset reduced in size. Besides, there are schemes that can be
trained in an incremental fashion, not just in batch mode unlike most
classifiers which require all the data before they can be trained. Such a
classifier will load the dataset incrementally and feed the data instance by
instance to the classifier.
Conclusion
It is difficult for a single machine learning tool to suite all data mining
requirements even as the universal learner is still a distant dream. In order to
obtain an accurate model of real datasets, the learning algorithm must match the
domain. Data mining is an experimental science and provides a workbench for data
preprocessing tools and machine learning algorithms. Weka helps in realizing the
goal of data mining, by predicting missing values and validating that the
predicted values are correct.
Abhinav Gupta & Sumit Goswami