For the purpose of this post, I will use UUID (Universally Unique Identifier) to mean both UUID and GUID (Globally Unique Identifier, Microsoft's implementation of UUID).
I've been thinking a lot about UUID's lately. The system I've inherited at work is plagued by their usage. To many developers, the UUID seems like a totally awesome way to establish the identity of a record in a system. I mean, how cool is it that you can generate an ID unique to every system in the world?.
The purpose of this post is to discuss appropriate and inappropriate uses of UUIDs. My goal is to encourage engineers to think about the general consequences of data type selection for identifiers in their architectures.
To most databases, UUID's are just 36 character strings.
Databases like MySQL do not have a native implementation of the data structure. This means the column that carries the value must at least be 36 characters (
VARCHAR(36)). When you consider the text encoding (e.g. MySQL character set) used to represent strings, this could mean 2-3 bytes per character (if using UTF-8). That means at least 72 bytes per identifier!
While this doesn't seem like a big deal, consider that for every 13,889 identifiers, your database will consume 1MB of storage. If you are using Foreign Keys, and they are also UUID's, that's another 1MB of storage for each Foreign Key.
UUIDs degrade database performance.
If you use a UUID as an identifier for a table, you're going to have to index it. Unfortunately, UUID's don't index well. The problem is their size and randomness. Indexes are trees that grow and branch as you add more data. Sequential values tend to index well because they don't require large sections of the index to be realigned (e.g. splitting a value into a new branch of values - refer to B-tree). UUID's are designed to be non-sequential and they are very large compared to an integer. This means the more UUIDs you insert, the larger the insertion penalty will be.
Don't take my word for it, just look at the statistics: http://kccoder.com/mysql/uuid-vs-int-insert-performance/. Notice the nearly consistent insertion time of a long integer vs. the UUID. More importantly, note that the scale on the left is "insert time in hours".
You will very certainly see better performance on queries using integers as well, though this might not be as pronounced. Auto incremented integers will almost always be smaller, meaning scans against tables/records will be more efficient. On the other hand, the index that stores UUID identifiers will grow much larger and at a faster rate than integers. This means a UUID index will become disk bound (because it can't fit completely into memory) more quickly than an integer index.
Are UUID's the right data structure for the task?
The point of a UUID is to have a universally unique identifier. There's generally two reason to use UUIDs:
- You do not want a database (or some other authority) to centrally control the identity of records.
- There's a chance that multiple components may independently generate a non-unique identifier.
These concerns should generally only arise when you are in a concurrent or distributed environment. The first reason is about avoiding an unnecessary call to an external system. You don't want a high-throughput message broker like RabbitMQ asking MySQL for an identifier every time it publishes a message. The second is generally a concern of a distributed system that requires a high write throughput and uses the key (generally a UUID) to determine the instance/server/partition it needs to write to.
The question is, does your system actually have the same requirements?
The following are use cases in which I'd argue you don't need a UUID.
1. Application writes records to a single database (or a cluster in master/slave configuration).
If there is only one database accepting writes, you don't have a need to route records using a unique key.
If portions of the data structure are going to update more than one table (and you need the ID before inserting records into those other tables in one single transaction), you could encapsulate the mutations in a stored procedure. I will admit that this may not be true in a NoSQL database with limited macro/remote execution functionality. In that case it may make sense to generate the ID outside of the database an perform simultaneous mutations. If you do this you are probably not doing it within a transaction and will need to account for inconsistency in your data stores if the operation fails mid-update.
2. There's a Natural Key for your record.
A natural key is a property or group of properties that make your record unique. A phone number or social security number could serve as a unique identifier. In a distributed system, this could be the combination of server IP, application, and timestamp. The point is, there may not be a need to have a UUID to establish record identity in a database.
If you need to distribute writes across servers, you could simply hash the natural key (concatenating fields if it's a composite) for a unique identifier. If the hashing strategy is well know amongst components in the architecture, you will not need to store the hash (since components will know how to regenerate it on the fly when performing lookups).
3. When you need to present an ID to a user.
Consider an HR application for a small company (30 employees). Does a UUID make sense for an employee ID? Could you imagine if your employee number was:
2a6db8e1-8967-4511-9839-a7cb3a895710? Of course not! We're accustomed to a more friendly number we can remember.
There's an argument for this same kind of recognizability in the URI's we present users and application developers:
I can probably readily remember the second URL if that user identifier was mine. The first one? Forget it!
UUID's should be a last resort tactic,employed only after you have exhausted other strategies for determining uniqueness in an architecture. In a database, they are generally considered an anti-pattern and should be avoided due to the performance limitations in indexing their values. UUID's also not particularly friendly to users. Instead, engineers should use natural keys as much as possible, and auto incremented integers as an alternative (when it makes sense).