In the happy carefree days of the pre-COVID era, I used to travel a lot on the traffic-riddled roads of Bangalore. With the journeys taking an hour to cover a 2 km distance, I had ample free time sitting in the traffic. Needless to say, I was looking at new ways of whiling away that time which is when I picked up the habit of looking at the number plates in front of me and trying to add them as quickly as possible in mind. With time I started getting quicker and quicker at it which is when I decided to talk it up a notch and add the numbers in the sum further and further till I reached a single digit number.
This weekend when I was reminiscing about this mental game of mine, I began to think of the significance of finding that number. What could be a practical advantage of reducing a large number to a single unit number? Being from a CSE background, hashing seemed to be the most natural application for this. This mindless musing on the process of repeatedly adding numbers also revealed a few interesting mathematical patterns. Building on top of this I decide to use this property to try and create an effective hashing mechanism that can be used for practical uses like, password verification and generating a message digest for large text files.
Some Interesting Mathematics
Lets begin with the mathematical patterns that I mentioned earlier. We'll begin by taking any 4 digit number and add up its numbers repeatedly till we reach a single digit sum.
Now, lets play with this function a bit. Imagine adding a 0 to this number anywhere in between. Does that affect the unit sum value obtained from this function?
The core logic is pretty simple. Given a large number, break up the number into equal length segments. Then apply the unit-sum function to recursively add the numbers on each segment and combine all segments to form a single hash value. I feel the name "unithash" seems befitting here as the hash is obtained by recursively reducing each segment to the unit sum. The length of the segments can also vary depending on the type of content we are trying to hash.
There are 2 major functions defined in this library -
+ find.get_unithash (arg1: integer)
+ find.set_unithash (arg1: string, arg2: integer)
The first function is used to get the unit sum value of a given number. It expects only 1 argument of type integer as input, which would be the number for which we wish to calculate the unit sum.
The second function is actual hashing algorithm implementation. It takes in 2 arguments. Argument 1 is the content to be hashed in the form of a string. Argument 2 is used to set the segment distribution length. Varying the segmentation length, would generate varying hash digests for the same text.
We'll take a look at a real life use cases where this hashing algorithm might be useful.
Lets say there is a terms and conditions document that has to be sent from company A to company B. The company B wishes to make sure that the terms received are actually the correct terms and have not been addled with by an attacker. In order to do so the company A will send the hashed message digest of the terms to company B via a secure channel. Thus when the company B receives the actual terms document they can verify its veracity by re-computing the hash value and seeing if the 2 match.
Now let's say, some attacker gets hold of this agreement during the transfer and decides to modify the agreements to now read "not free of charge". The company B can now detect this change by recomputing the hash on the received message and comparing it with the hash for the original agreement.
Comparing the 2 message digests we can see that the hashed values for both the agreements are different and hence the received message has been tampered with. Phew!
Using the library
In order to use this hashing algorithm in your code, do a pip install for the library.
> pip install unithash
Check the link here for more details on the library. Then in the python code import the library and its corresponding functions and that's it! Here's a sample code -
This hashing algorithm is the result of weekend musings and time to kill thanks to the lock-down and covid. The primary reason for implementing this algorithm as a python library was to contribute to open source and for the thrill of creating a publicly available python of my own. Exploring this mathematical pattern and working on this algorithm helped me keep in touch with my analytical side. Any coding enthusiast viewing this post are welcome to compare the merits and demerits of this algorithm with existing algorithms along with the collision vulnerability rates, time and space complexity
0 Comments