Do you ever wonder how Aadhaar data belonging to more than 1.32 billion Indian citizens is stored? How the generation of one million Aadhaar numbers is achieved by performing 600 trillion matches in a day? Have you ever wondered how 100 million authentications are undertaken; establishing the identity of a person by UIDAI in a day?
This article aims to provide answers to these questions. Along the way, this article will enumerate the requirement of Aadhaar and the two essential tasks of the UIDAI, i.e. enrollment and authentication. UIDAI has leveraged big data technologies like open scale-out, open-source, cheap commodity hardware, distributed computing technologies, etc. in handling and processing vast amounts of data.
Aadhaar a necessity?
The Indian Government was spending about 25 to 40 billion dollars on direct subsidies. According to CIA World Factbook, the GDP of North Korea was 40 billion for the year 2014.
We are spending the equivalent of North Korea’s GDP on direct subsidies.
The problem is not the subsidy, but the leakage of it. Most programs suffered due to ghost and multiple identities. Indians didn’t have any standard identity document. We possess many certificates viz., driving license, PAN card, voter card, etc. issued by central and state government authorities. All these certificates/cards were domain restricted. It was difficult to establish the identity of a person with these cards issued by the government.
So, there was a need felt for a document which could uniquely determine the identity of a person. Thus, one of the most challenging projects ever took birth. The task of providing identification to one billion people, i.e. one-sixth of the world’s population.
Tasks performed by UIDAI
Two critical tasks performed by the UIDAI are enrollment and authentication. Enrollment is the process of providing a new Aadhaar number to a citizen. Authentication is the process of establishing the identity of a person. Both are entirely different beasts with their peculiar challenges.
Enrollment is an asynchronous process. An Aadhaar number is not provided instantaneously. The Aadhaar number is generated after some days of data collection. Processing of every enrollment requires matching ten fingerprints, both irises, and demographics with every existing record in the database. Currently, UIDAI is processing one million Aadhaar numbers a day. With the Aadhaar database at 600 million, processing 1 million enrollments every day roughly translates to about 600 trillion matches every day.
The number game
Do you know how many years do one trillion seconds make? More than 31,000 years. Can you imagine the height of a tower that would be created by stacking one trillion pennies on top of each other? It will be more than 8,70,000 miles. One trillion ants will weigh more than 3000 tons. Six hundred trillion is a one followed by fourteen zeros. Besides storing such humongous amount of data, processing 600 trillion biometric matches in a day is beyond anyone’s wildest dreams.
On the other hand, imagine if a person wants to open a bank account. He approaches a bank employee. This employee wants to check if this person is who he is claiming to be before opening his bank account. This authenticity check can’t run forever; then no customer will be willing to open an account with that bank. Authentication is expected to be performed within quick seconds, even when the authentication volume is a few 100 million requests every day. Authentication is synchronous and needs to happen very fast.
Now let us see how the architectural principles established with UIDAI help in achieving the tasks of enrollment and authentication efficiently and effortlessly.
Up until the 90s Information Technology systems used to be monolithic, involving both technology and vendor lock-in. Once investment was made, it was challenging to break away from a particular vendor and technology. Advantage can’t be taken of the advancement in technology or drop in hardware and other costs. The only option was to ‘Scale-Up’ with the same vendor and technology.
From the 90s to mid-2000s, the software with horizontal scaling capability at the application server layer came into existence. Even though it was possible to scale horizontally, it was tied up to a particular database vendor or application vendor. Here, there was no technology, but vendor lock-in. Here typically the computing environment, i.e. the hardware and OS used was similar across all application server nodes.
A Love Story Begins with Open Scale-Out
This phase started from mid-2000 onwards. Here the system architecture is vendor and technology neutral. There is no lock-in with any technology or vendor. Infinite scope for scaling and interoperability exists. UIDAI achieved open scale-out with the help of cheap commodity hardware.
Commodity hardware is nothing but that which is affordable and accessible. It has nothing special in it which is typically used by enterprise systems. The entire UIDAI hardware infrastructure is composed of cheap Linux based personal computers and blade servers. The advantage of commodity hardware is that the cost and the initial investment are meager. The architecture is scalable when the requirement exists. Equipment can be purchased from any vendor and plugged in for scaling the architecture. The advantage of a price drop in the future can also be used while scaling the infrastructure. The open source technology, which is used to cluster commodity hardware is known as Hadoop.
Distributed Computing & Open Source
Imagine how it would be if a monolithic structure did all the processing work required for generating an Aadhaar card. How significant would that structure be? How many processing cores are needed for 600 trillion matches a day? Is it possible to expand that structure if the number of matches required increases from 600 to 1200 trillion? How costly would that be?
For all these reasons, Aadhaar was implemented in a distributed commodity hardware. It is distributed not monolithic. The processing happens on many nodes at once, which reduces the execution times by many times. Distributed computing reduces the computation time, many times, which would take days in a traditional monolithic structure. The file system used in conventional sequential computing would not work in case of distributed computing.
A distributed platform requires a specially designed file system.
Hadoop distributed file system (HDFS) is one such type of distributed file system. Special software is also needed to spread the workload between different nodes. On completion of processing at various nodes, this software should also aggregate the results. MapReduce is one such open source software which distributes and finally aggregates the processed results. Hive is a tool used to query the database distributed on the commodity hardware. Hive is very similar to SQL.
All these open source technologies like Hadoop, HDFS, MapReduce and Hive etc. come under the purview of Big data technologies. It is because of these technologies the processing time of computation, which would otherwise take days, can be reduced to mere minutes and at a very cheap cost. UIDAI entirely leveraged these technologies. It was implemented in a completely open scaleout fashion without any dependence on vendor or technology.
Kudos Team UIDAI!
Petabytes of data related to the identity of the citizens of a country, with a population more than one billion, is processed using open source technologies in a distributed fashion on commodity hardware. This is an astonishing feat of engineering which was successfully achieved by UIDAI. Team UIDAI deserves a thunderous applause for attaining this impossible feat.
The government should now think of creative ways to leverage this data in avoiding leaks that happen in its various direct subsidy programs. It should bring more transparency to financial transactions, prevent tax evasion, provide banking facilities to the poor, and other such crucial tasks. Then, we can achieve the status of a real ‘welfare nation’.
Latest posts by Thulasiram Gunipati (see all)
- A Brilliant Future Scope of Machine Learning - July 18, 2019
- Data Analyst vs Data Scientist – Spot the Difference - July 8, 2019
- Applications of Data Science and Machine Learning in NETFLIX - August 21, 2018