Creating my own python library | Unitsum Hashing Algorithm

In the happy carefree days of the pre-COVID era, I used to travel a lot on the traffic-riddled roads of Bangalore. With the journeys taking an hour to cover a 2 km distance, I had ample free time sitting in the traffic. Needless to say, I was looking at new ways of whiling away that time which is when I picked up the habit of looking at the number plates in front of me and trying to add them as quickly as possible in mind. With time I started getting quicker and quicker at it which is when I decided to talk it up a notch and add the numbers in the sum further and further till I reached a single digit number.

This weekend when I was reminiscing about this mental game of mine, I began to think of the significance of finding that number. What could be a practical advantage of reducing a large number to a single unit number? Being from a CSE background, hashing seemed to be the most natural application for this. This mindless musing on the process of repeatedly adding numbers also revealed a few interesting mathematical patterns. Building on top of this I decide to use this property to try and create an effective hashing mechanism that can be used for practical uses like, password verification and generating a message digest for large text files.

Some Interesting Mathematics

Lets begin with the mathematical patterns that I mentioned earlier. We'll begin by taking any 4 digit number and add up its numbers repeatedly till we reach a single digit sum.


 Now, lets play with this function a bit. Imagine adding a 0 to this number anywhere in between. Does that affect the unit sum value obtained from this function?

We see that the unit sum value remains unchanged. Which feels pretty intuitive, given that this is primarily just an addition operation albeit done iteratively. Now, here comes the interesting part. What happens if you add a 9 on the original number? Does the function then still remain unchanged? Does the position of where we insert the 9 have any effects?

As seen above adding a 9 to the number when calculating the unit sum, irrespective of its position has no effect on the unit sum function! This is a property that we'll utilize later when creating the hash, but for the time being let us look at another aspect of this function. The function for the most part requires at most 2 iterations to reach the unit sum. In order for the iteration round count to go up, the sum of the digits should add up to greater than 10. This progression isn't a linear one though.

This however, enunciate the point that irrespective of the size of the number to be hashed, the rounds of iterations to calculate the unit sum would rarely go beyond 2-3 rounds.

Using this to create a hashing library on python
Having taken a look at the underlying mathematical property used in the hash, lets move forward to creating the hashing mechanism itself.

The core logic is pretty simple. Given a large number, break up the number into equal length segments. Then apply the unit-sum function to recursively add the numbers on each segment and combine all segments to form a single hash value. I feel the name "unithash" seems befitting here as the hash is obtained by recursively reducing each segment to the unit sum. The length of the segments can also vary depending on the type of content we are trying to hash.

Another important aspect here is handling of alphabetic and non-alphanumeric characters. Lets say there is a terms and conditions agreement, that contains a long paragraph, and is being sent over from the client to the server. How do we create a message digest for this text to ensure that the agreements text received on the other end matches the text that was sent?

In order to handle these situations, we convert the character to their ASCII values and then convert them to corresponding mappings to the positions in the alphabets series and use them to create the hash. Thus the mappings become a = 1, z = 26 and likewise for the other characters. A similar logic is extended to non-alphanumeric characters with the only distinction being only absolute values are considered.

There are 2 major functions defined in this library -

            + find.get_unithash (arg1: integer)

            + find.set_unithash (arg1: string, arg2: integer)

The first function is used to get the unit sum value of a given number. It expects only 1 argument of type integer as input, which would be the number for which we wish to calculate the unit sum.

The second function is actual hashing algorithm implementation. It takes in 2 arguments. Argument 1 is the content to be hashed in the form of a string. Argument 2 is used to set the segment distribution length. Varying the segmentation length, would generate varying hash digests for the same text.

We'll take a look at a real life use cases where this hashing algorithm might be useful. 

Lets say there is a terms and conditions document that has to be sent from company A to company B. The company B wishes to make sure that the terms received are actually the correct terms and have not been addled with by an attacker. In order to do so the company A will send the hashed message digest of the terms to company B via a secure channel. Thus when the company B receives the actual terms document they can verify its veracity by re-computing the hash value and seeing if the 2 match.

Now let's say, some attacker gets hold of this agreement during the transfer and decides to modify the agreements to now read "not free of charge". The company B can now detect this change by recomputing the hash on the received message and comparing it with the hash for the original agreement.

Comparing the 2 message digests we can see that the hashed values for both the agreements are different and hence the received message has been tampered with. Phew!

Using the library

In order to use this hashing algorithm in your code, do a pip install for the library.

> pip install unithash

Check the link here for more details on the library. Then in the python code import the library and its corresponding functions and that's it! Here's a sample code -

This hashing algorithm is the result of weekend musings and time to kill thanks to the lock-down and covid. The primary reason for implementing this algorithm as a python library was to contribute to open source and for the thrill of creating a publicly available python of my own. Exploring this mathematical pattern and working on this algorithm helped me keep in touch with my analytical side. Any coding enthusiast viewing this post are welcome to compare the merits and demerits of this algorithm with existing algorithms along with the collision vulnerability rates, time and space complexity


Post a Comment