Advertisment

Spectral Clustering Using WEKA for Big data Analysis

Big Data mining using WEKA is the process of analysing data from different perspectives and summarising it into useful information

author-image
PCQ Bureau
New Update
meraevents.com

Magesh Kasthuri, Senior Technical Consultant-Java-Technology Practices Group, Wipro & Dr B. Thangaraju, Talent Transformation, Wipro Technologies

Advertisment

Big data refers data sets which are very large and complex, both in structured and unstructured in nature and characterised by the three Vs – volume, velocity and variety. So, it has a challenge to collect, store, filter with reliable data, search for a specific information, analyse the data for our requirement and present the result in easily understandable format. Data Mining is the process of analysing the existing data from different perspectives for a given problem and come out with suggestion or solution for our requirement. This may help to increase our revenue or help cost cutting or show future direction for potential growth of our business. Data mining is the exercise of analysing the collected data to produce new information. The conventional data analysis methods are incompetent in most of the cases. Since the volume is high in Data mining analysis, clustering is one of the key processes where we group the data for predictive analysis. Spectral clustering is one such type of clustering where the group collected can be easily represented in a connected tree of data which has relation in all directions.

Big data and lead generation

In Data mining world, Lead generation is a data searching technique which is used to collect relevant customer information (leads), one of the examples for this techniques is contextual advertising. You might have noticed as soon as you open google site to search something, it displays unique advertisement or sponsored link along with search results. This sponsored link is typically based on search text, user logged in (ex: Google user), location, browser to name a few. This type of preparing customised advertisement and sponsored links is called as Contextual advertisement and this technique is an example for Lead generation. It is an easy and painless way of attracting people/users and cultivating prospective customers out of them.

Advertisment

Untitled-28

Lead nurturing

Once the leads are gathered from a suitable data collection algorithm also called as lead nurturing technique, we have the raw leads ready to be processed and distributed to advertisers. They can be processed manually or using data mining tools like WEKA – Waikato Environment for Knowledge Analysis. It is a machine learning open source software written in Java with user friendly visualisation tools and algorithms for data analysis and predictive modeling. It is developed by machine learning group at University of Waikato, New Zealand (http://www.cs.waikato.ac.nz/ml/weka/bigdata.html).

Advertisment

Untitled-29

Data mining applications

Applications of data mining include prediction of the effectiveness of procedures, tests and result analysis and discovery of relationships among historical and current data to predict the trend of data flow/growth. These databases normally have huge amounts of information about user and their data/history/responses. Data mining techniques employed on these databases find relationships, helping the study of progression and providing predictive results. In this article, we will discuss a case study to show how Data Mining helps to classify and analysis of huge data with supervised learning techniques.

Advertisment

Untitled-30

There are various steps involved in big data analysis starting from data collection, data cleansing, classification and up to pattern evaluation and trend report generation.

Data mining and retail industry

Advertisment

Predictive analytics and market basket analysis (MBA) are some examples of the extent and effectiveness with which retailers today are resorting to data-driven strategies to increase profits. Retailers today have access to enormous quantities of customer data and access to powerful statistical techniques and software to derive actionable information. We have taken data mining process in retail industries to improvise marketing strategy and find effective solution for expanding the business.

Spectral Clustering

Spectral clustering is a graph theoretic technique for metric modification such that it gives much more global notion of similarity between data points as compared to other clustering methods such as k-means. It is the most popular data mining technique for big data analysis particularly in the field of social computing and later introduced in various other fields like medical science, customer relationship management (CRM), retail stores, manufacturing and health care. Clustering nodes in a graph is a useful general technique in data mining of large network data sets.

Advertisment

We can understand this with an example of Facebook. In Facebook, we used to get suggested friends and suggested post which is typically based on clustered information from a user. This clustered information is based on location, age, school/college and friends link which is used to gather (mining related data) suggested friends or suggested posts depending on the predictive analysis

WEKA

WEKA is a landmark system in the history of the data mining and machine learning research communities. WEKA toolkit has gained widespread adoption. Weka is open source and freely available. It is also platform-independent. There are various Spectral clustering classifiers in WEKA like KMeans, ZeroR which can be selected for different variants of predictive results and clustering information.

Advertisment

Data mining using WEKA is the process of analysing data from different perspectives and summarising it into useful information. The non-trivial process of identifying valid, novel, potentially useful, and eventually understandable patterns in data is called Knowledge discovery.

WEKA tool accepts data in terms of records. This is evaluated and approved for actual data set processing or for further run of provisional training data set preparation.

WEKA accepts ARFF file format of input data which can be prepared using WEKA itself.

Data Collection in WEKA

The ARFF is (Attribute-Relation File Format) an ASCII text file, which defines a list of occurrences sharing a set of attributes (http://www.cs.waikato.ac.nz/ml/weka/arff.html). Also, WEKA provides export/import tool which can be used to convert data from other formats like MS-Excel, database formats, text files into ARFF file formats, which can then be used in WEKA for classification and attribute evaluation. Once the data is fed and classified in WEKA, we can get Clustered representation.

Data mining and marketing analysis

One may infer following business improvement ideas for the marketing team to ideate the trends and improvise the market reach to customer.

  • Quickly identify potential customers.
  • Data mining the customer data for insights that drive new strategies for customer acquisition, retention, campaign optimization and next best offers.
  • Send tailored recommendations to mobile devices at just the right time, while customers are in the right location to take advantage of offers.
  • Generate discount coupons at the point of sale based on the customer’s current and past orders and ensuring a higher redemption rate.
  • Recalculate entire risk portfolios on the fly and understand future possibilities to mitigate risk.
  • Analyze data from social media to detect new market trends and changes on demand.
  • Use click stream analysis and data mining to detect fraudulent behaviour.

Data collection methodology

Data collection can be done in various methods, one of the methods is to identify data source openly available in online, which are collected for various experimental purpose. There are various data mining related study sites like Techtarget, which is one of the largest lead-gen providers in IT industry. It has as many as 41 inter websites, which publishes contents freely to the users. Registered users can freely browse and download interested articles from this site. Techtarget collects relevant data from user which is later sold to advertisers based on the interest (lead nurturing). It provides real-time big data for data analysis by collecting from various forums like technical, musical, social, cultural and historical interests. It also offers data mining reports using spectral clustering and analytics processing to gather customer information based on regional and cultural backgrounds. Studying such reports and results shows that there is a tremendous growth that the clustered algorithm are appropriate for any data mining process when scaling and data classification are diversified and less in control.

Conclusion

Data mining involves in clustered information gathered as ‘raw data’ from customers from various forums like social networking, trends in browsing pages, trends in search or pages visited. Analysing such raw clusters of data which is huge in volume not only involves various analytics algorithm or techniques but also involves in filtering various required preference set which makes the base of data analysis. Clustering algorithms helps in such a condition where we focus on our analysis area and collect required subset of volumes of data gathered from Lead generation and process them to filter the preference set and produce the required results in terms of reports, diagrams, trend analysis and statistical data points.

big-data weka big-data-analysis
Advertisment