Download Latest Version identity-0.0.8.zip (299.8 kB)
Email in envelope

Get an email when there's a new version of libIdentity

Home
Name Modified Size InfoDownloads / Week
README.txt 2024-01-03 10.4 kB
identity-0.0.8.zip 2024-01-03 299.8 kB
identity-0.0.4.zip 2023-12-14 230.7 kB
identity-0.0.1.zip 2023-12-13 190.9 kB
Totals: 4 Items   731.8 kB 0
Idea
 
I'm thinking of a method to create unique and sequential identifiers that are 64 bit signed.
It is common to use fields of this type "bigint" in relational databases in order to store unique identifiers per table.
By entering the world of clusters or NoSQL, we understand that an action that at first sight we take for granted as an identity
or auto-increment for those coming from the MySQL world, actually contains complex problems for managing sequences,
which may require global locking.

The hypothesized method could produce numbers that were reasonably sequential and collision-free enough to be usable
in the real world as a replacement for auto-increments.

To do this the function would work on three types of arguments:
- a time constraint, in order to provide sequentiality and to limit the space in which
  collisions can occur;
- the digest of a key passed by the caller;
- a random part: it could reduce the number of collisions.

The caller should provide one year of entry into the service; this value cannot be modified subsequently,
the feature will continue to operate for a certain number of years, then it will run out of space and can no longer be used.

Furthermore, to generate new numbers the caller should provide a fairly short unique logical key on which the hash will be calculated.

Internally the algorithm would take care of packaging the various fields into a signed long.

This value will be returned to the caller, who can consider it reasonably unique and expect a low number of collisions.
However, the function cannot be considered monotonic, since within the same time space various numbers can come out of sequence.
However, the tendency will be to maintain an increasing order: depending on the scenarios this may not represent a problem.


Prototype

I prototyped a version of this algorithm, and released it under the LGPL license.
During the implementation, based on experimental tests, I noted that the random part could not be used because the available space is reduced to only 64 bits.
Decreasing the portion of bits reserved for the digest in order to use a pseudo-causal sequence decreases the efficiency of the function.

In this version I have stacked the current date in 36 bits with a precision of up to the second;
the remaining 28 bits are dedicated to a 32 bit digest, of which the first 4 bits are ignored.

Remind that a bit is lost because databases typically use signed int64.
Theoretically, using SQL Server for example, we could start using identities with negative values,
using for example a seed equal to -2147483648:

Identity(-2147483648, 1)

With experience, I have actually never seen this practice and usually the first id starts from 1.

The results show a collision rate of around 0.2%, continuously generating new numbers.
In reality this is not the main use case for which the algorithm was designed,
in a network scenario with a limited number of requests within one second the function could prove adequate.

However, it is necessary to implement a collision management policy:
a simple way would be to set the value as unique in the table, delegating control to the database.

I tried inserting a random part, combining it in xor with a part of the digest, but the number of collisions tends to increase.
After some attempts I came to the conclusion that to maintain the constraints imposed by 64 bits, the current solution is valid.

This is the bit layout according to the current version of the project:

The 64 bits layout is described below:
      
+-[ sign bit, cannot be used because databases work with signed long ]
|
|    +-[ the year is 9 bits long, enough for storing 512 values between 0 and 511 ]
|    |
|    |      +-[ the month is 4 bits long, it stores values beteen 0 and 11 ]
|    |      |
|    |      |    +-[ the day is 5 bits long, stores values between 0 and 30 ]
|    |      |    |
|    |      |    |     +-[ the hour is 5 bits long, stores values betweeen 0 and 23 ]
|    |      |    |     |
|    |      |    |     |     +-[ the minute is 6 bits long, stores values between 0 and 60 ]
|    |      |    |     |     |
|    |      |    |     |     |      +-[ the second is 6 bits long, stores values between 0 and 60 ]
|    |      |    |     |     |      |
|    |      |    |     |     |      |        +-[ the remaining 28 bits are taken from the 32 bit digest ]
|    |      |    |     |     |      |        |
|    |      |    |     |     |      |        |
| +--+---+  ++  ++-+  +++   ++-+  +-+-+  +---+-----------------------+
|/        \/  \/    \/   \ /    \/     \/                             \
-YYYYYYY YYMMMMDD DDDhhhhh mmmmmmss sssstttt tttttttt tttttttt tttttttt              
xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx

Once operational, the function continues to produce identifiers for 512 years,
after which it is no longer usable.
This interval seems plausible, let's imagine that in 2024 the function would stop working
if it had started in the year 1512.

Every second there is a theoretical space of 2^28 identifiers available.

Ultimately I think that, however simple, this solution can find a real application,
and it is an additional tool that becomes part of my wealth of knowledge.


API
The library provides the static method GetNew(short year, string key) to generate new identifiers

   long id = Identity.GetNew(year, key);

There is an equivalent object-oriented version, as we see below:

   Identity identity = new Identity(year);
   long id = identity.Next(key);

Finally, once the IDs have been generated, the algorithm makes some utilities available;
the first concerns the fact of being able to trace the UTC timestamp in which this id was generated:

   Identity identity = new Identity(year);
   DateTime birth = identity.Birth(id);

The second shows us, given an identifier and an original key,
whether this had been used for generation or not.

   Identity identity = new Identity(year);
   bool matches = identity.Check(id, key);

Conclusion
This little project helped me learn more about SQL Server's identity function,
and to develop an alternative solution.

News

I tried to improve the algorithm described in the previous article in order to produce fewer collisions.
I had hypothesized that I could use milliseconds to increase timing precision by reducing the space dedicated to the hash.
Unfortunately this method doesn't pay, and the net effect is to have more collisions.

I then followed the opposite direction, trying to leave the 32 bits dedicated to the digest.
I reused the UNIX concept of epoch, calculating the seconds from a starting date.
So when we set the year, for example 2024, we bless the software with the epoch equal to 2024-01-01 00:00:00.


The seconds starting from this date are calculated, and packed into the first 32 bits with the sign of the returned long.
The remaining 32 bits are dedicated to holding the hash.
The code thus becomes extremely simple:

identifier = seconds << 32 | digest;


This solution shows improvements in performance: since the operations to be performed on the bits are fewer, and in collisions, 
where the number is significantly improved, going from 0.180% to 0.012%.

I have updated the test program to support this option which has become the default choice,
the previous mode can be activated from the command line with the "-f" switch

Internally I called these two variants FAT and Epoch, because the former 
was influenced by the way the File Allocation Table stored time references,
further information can be found at the following link: https://wiki.osdev.org/FAT.

The new version is related to the way dates are stored under UNIX.

Below I show a comparative table between the two versions.

 

>identity.exe 2024 -f --test 40000000
Generated 40000000 identifiers in 0:46.723ms.
Detected 64221 collisions (0,1605525%).

>identity.exe 2024 -f --test 40000000
Generated 40000000 identifiers in 0:41.093ms.
Detected 71915 collisions (0,1797875%).

>identity.exe 2024 -f --test 40000000
Generated 40000000 identifiers in 0:41.988ms.
Detected 70367 collisions (0,1759175%).

>identity.exe 2024 --test 40000000
Generated 40000000 identifiers in 0:37.983ms.
Detected 4987 collisions (0,0124675%).

>identity.exe 2024 --test 40000000
Generated 40000000 identifiers in 0:37.593ms.
Detected 4842 collisions (0,012105%).

>identity.exe 2024 --test 40000000
Generated 40000000 identifiers in 0:37.850ms.
Detected 4810 collisions (0,012025%).

This version of the algorithm suffers from the same problems that we encounter in the UNIX epoch, that is, 
the lifespan of the generator will be only 68 years.
This value is given by the maximum number describable by a signed 32-bit integer which is equal to 2147483647
divided by the seconds present in a year which are 31556926.
This site talks extensively about epoch https://www.epochconverter.com/ describing its limitations.

The original version of the algorithm produces a validity range of 511 years, therefore much longer, however, 
the benefits brought by this variant have promoted it to a preferential option.

 
Md5

I tried using the last 4 bytes of the md5 hash in order to get a 32 bit hash.
However, the performance is about six times worse than the MaPrime2c algorithm I used previously.
In this regard, the original code of this algorithm can be found at http://www.amsoftware.narod.ru/algo2.html.
Initially, the fact that performance was worse led to fewer collisions being generated:
simply because in one second the number of identifiers created could only be lower.
To verify this feature, I artificially slowed down the MaPrime2c algorithm until the computation times were similar.
At this point the number of collisions was similar too, from this I deduced that for this project the use of md5 is of no use,
for now this attempt remains in the code, but I plan to remove it in the future.
However, I like having included the mechanisms that allow the use of different hashes.
 
Conclusions

I have come to the conclusion that in fact 64 bits are too few to have truly unique identifiers;
It seems to me that this new version of the algorithm makes good use of the available bits.
Source: README.txt, updated 2024-01-03