Advertisment

Start-up Builds India's Largest Big Data Repository of Election Data

author-image
Vipul
New Update

Elections in India till now comprised heat, dust, dirt, drama, traditional wisdom, opinion polls, speeches, processions, door-to-door visits, sweat and toil. 2014 parliamentary elections registered two irreversible trends - very large young voter base and advances in technology. It became smart led by digital/social media technologies. Obama's 2008 presidential campaign ushered use of Social Media and 2012 brought Big Data Analytics to forefront. World's largest democracy went one step further by integrating Social Media and Big Data Analytics for the first time.

Advertisment

Hyderabad based Modak Analytics, a data analytics start-up, built India's first Big Data based Electoral Data Repository of 81.4 crore voters for the just concluded elections, a dream for every IT major. It has pioneered innovations by applying big data analytics to Indian politics that generated actionable insights silently for its client.

"We were confronted with Volume, Variety and Velocity of Data (common for most Big Data problems). We knew difficulties and complexities" informed Milind Chitgupakar, chief analytics officer. Aarti Joshi, co-founder and executive Vice President from Modak Analytics added, "High Automation, Domain expertise powered with technology, tailor made proprietary tools developed were what put Modak ahead of our peers. Milind Further adds, "It's 12 month long 3240 hours of toil and 65 years of cumulative experience of 10 scientists in Data Analytics that made it possible."

Big Data Analytics helps design tailored communication to target a select few, rework advertisements and create innovative models for voter engagement, said Milind, who has 16 years of domain expertise and holds six patents. He has built largest data warehousing platforms for credit bureaus, banks and insurance companies in USA.

Advertisment

The massive exercise involved 81.4 crore voters, the largest ever on the planet. Comparatively, USA has 19.36 crore voter, Indonesia 17.1, Brazil 13.58 and UK 4.55 crore.

Complexities included:

- 543 Parliamentary and 4120 assembly constituencies

- 9.3 lakh polling booths

- Voter Rolls in PDF in 12 languages

- 9 lakh PDFs, amounting to 2.5 crore pages to be deciphered

- Diverse range of Voter Names and Information

Advertisment

The real challenge was extraction of voter info from 2.5 crore PDF pages and transliteration of the same into English to fuse with other sources. Technology was a big hurdle. The infrastructure, built especially for the project, included 64 node Hadoop, PostgreSQL and servers that process master file containing over 8 Terabytes of Data. Besides, Testing and Validation was another big task. ‘First of a Kind' Heuristic (machine learning) algorithms were developed for people classification based on Name, Geography etc., which help in identification of Religion, Caste and even Ethnicity.

"Data from multiple sources like Census, Economic and Social surveys were mapped to polling booths. Simultaneously, external and propriety data sources had to be fused with individual voters' data. Because of this complex nature, no big IT company ever ventured into this", informed Aarti Joshi.

Modak Analytics developed patent potential proprietary technologies -- ‘Rapid Extraction, Transformation and Loading (RapidETL)', Transliteration of Indian Languages, Data Dictionaries for all individual states (except Orissa). This was utilized to process unstructured data and quick transliteration while fuzzy logic helped match based on Indian names, address, villages. RapidETL automated transformation and merging of massive datasets which significantly reduced time and cost. Further, Math Quants produced actionable information to develop individual booth strategies.

The Indian context is heterogeneous, diverse, non-uniform, hard to collect and complex, whereas in USA high quality data already exists. Finding and reaching a voter is easy and less expensive in US, but in India it's a huge challenge. "Delhi and Andhra Pradesh have good quality electoral data, while UP comes last", said Milind.

Businesses can learn many analytics lessons. Technology and the methodologies built can be leveraged outside politics to FMCG, Retail, Banking, Insurance and Telecoms to micro-target specific areas, adopt Data-Driven approaches to Marketing and Advertising. The fuzzy logic matching technology will be immensely useful for Police to stop criminal and terrorist activities.

Advertisment