-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniqueness broken since 2.2.9? #153
Comments
If indeed this package has no active maintainers, I'd like to volunteer to pick it up and restore the original short-and-unique functionality. There's a place for both this package and |
Finally, I found an expert to talk about the unique ID topic! I like the topic, but it is hard to find a person to talk about my passion :). Collision ProbabilityTLDR; old shortid didn’t use a unique machine ID. As a result, old way had a bigger collision probability that the current one. Yeap, you are right, we moved from mixing environment data to using hardware randomness. But we made it for the reason. Uniqueness guarantee is a very complex problem. And it became even more complex when we start to generate random IDs across multiple computers in the cluster (at the same time). The problematic was studied very well in different UUID versions.
In that commit, I removed the old bytes packing system to:
Major vs Patch ReleaseYeap, I thought about the major releases. But here are my reasons why I didn’t do it:
DeprecationYeap, we can do it more properly. Let’s discussed after the collision topic. SolutionI believe that the whole way of mixing “UUID v1” and “UUID v4” was wrong. “UUID v1” is very very hard to implement. Right now “UUID v4” is a much safer way since it is easy to implement and using >= 122 bits of hardware random generator gives extremely low collision probability:
UUID v4 is an industry-standard right now. It is safe to use. I made a new library based on good “UUID v4” way: Nano ID. It is faster, smaller and has less collision probability than old and new |
Thanks for the detailed response, appreciate it. It sounds like your main complaint about shortid @ 2.2.8 was that it wasn't clear that "worker" ID should be set differently on each process, and that in some environments (browser) it's barely even possible, so users would encounter collisions if they were trying to use it as a replacement for UUIDv4. However, assuming a user did read the docs, did know what they were doing and did set worker ID correctly (or only ever had one process generating IDs), then shortid was cast-iron guaranteed unique within the scope of your cluster. I personally ran a 5-node backend service with it for years, and I never had to guess what my demand would be, never had to customise ID length, never had to check for collisions, and could've gone for years more without thinking about it, at least assuming I never updated to 2.2.9. ID lengthThere's also the space efficiency consideration - for a random ID generator to be useful it has to have a very sparsely populated range of potential values, so that a new random ID has a very low probability of collision. UUIDv4 has a colossal range and no two will ever be generated the same, but it has to be very long to achieve that. It's entirely dependent on the size of my record set vs the size of my id space. If my IDs are based on worker + timestamp + counter, then it doesn't matter how many records I have already. Every new second of time guarantees a big empty chunk of id space, and we know how to fill it efficiently. I could generate a shortid every 1ms across half a dozen workers from now for the next 10 years and I'd never need more than 7 characters, and there'd be zero probability of collision. My id space is relatively efficiently filled, because I'm filling it with an algorithm and I know where the gaps are. So in the case of ID generation at a fairly constant rate over a very long time span, ComparisonOf course UUIDv4 is also practically unique, but because it aims to be globally unique across all instances of everything, it's definitely not short, and this matters when you're conscious of things like index size or, device storage, document size, power drain...
Consider that |
How do you set worker ID for now? |
On that multi-instance system (on AWS) I think I just used the last part of a private IP address since I had one node process per instance/IP on a small subnet. On a manually scaled system it's obviously fairly easy to set an ID as an environment variable or similar as part of provisioning. On an auto scaling GCP system I'm working on now I could probably set an integer instance ID as instance metadata, and when a new instance boots that instance could allocate itself the lowest unused integer (the cluster has access to cluster metadata via an http API). Lots of ways to tackle that problem - more of a devops thing really. The point is that if you can determine an ID per worker, this library would give you unique IDs. |
Hm. The current API doesn’t explain how cluster-ID is important. We can revert random-based generation and add new API which will be clear about node ID: let shortid = require('shortid')
shortid.generate()
//=> Error, You didn’t set a unique worker ID. Read github.com/dylang/shortid#worker-id show to set it properly
shortid.uniqueWorkerId(getIPAddress())
shortid.generate() //=> "PPBqWA9" |
I guess it you could improve the function naming, but it's all in the readme. |
Yeap, I forget to explain my reasons behind this idea:
|
I don't really think that's necessary but it is subjective, if you want to make it idiot-proof at the expense of simplicity in the majority case I guess that's your call. What's not subjective at all is the fact that 0e9c560 should not have been approved as a patch release, or at all frankly - that was obviously a mistake. Hopefully it hasn't caused a massive problem to too many people. |
It is much more complicated. 2.2.9 worse for people with the correct worker ID. But most of the users use Users who don’t set worker ID in 2.2.9 have lower collision probability. But I agree that people who set worker ID now have bigger collision probability (but now very big). Because more people use This is why we can’t just revert 2.2.8. It will make better for you, but worse for most of the users. This is why I suggest a better way. Change API to force everyone to set correct worker ID and revert old method. |
If you think about how this is used, anyone who has been storing ids with >=2.2.9 now has random ids in their database. There’s no way to revert things so that we won’t collide with those random ids - the id space is of all current users of this package is permanently polluted. I’m not sure where you go from here. |
Why we can’t change |
Because we don’t encode the version any more - we now set the first character completely randomly because |
Hm, yeap, you are right. We uses |
You can only really publish a new major version 3, which restores 2.2.8 behaviour, and explain that v3 ids may collide with v2 ids, because unfortunately v2 ids were non-uniquely generated contrary to the docs. There’s nothing that can be done for people who relied on the package behaving as documented after 2.2.9, unfortunately. They may now have an ID migration to have to go through. |
We can’t just restore 2.2.8 without solving the problem of forgetting a set worker ID. But we can do both: restore 2.2.8 and fix API to force everyone to set worker ID. |
You can make the API whatever you like in v3, obviously. The behaviour of 2.2.8 is the important part - ie zero-collision guarantee. |
You mentioned that you want to become a maintainer. I can help you with 3.0 with reviewing PRs and then I will give you access. But it is important for me to find a solution for everyone, not only for people who set worker ID correctly, since 2.2.9 was released to fix problem of worker ID bad practices. |
Forgetting about v3 for the moment, I do really think you should make it very clear (shout it, big letters!) in the readme that 2.2.9 and onwards do not offer uniqueness, to the extent of publishing a warning/deprecation notice to npm. The readme at the moment is considerably out of step with what this package now does. I don't mean to be dramatic but I really feel sorry for people who have used this package long-term thinking it was a good choice for time-based unique IDs - now assuming they've been patching their dependencies they have unknowingly introduced an insidious bug - an increasing possibility of collisions - and no way to fix/avoid collisions with the random IDs they have already accidentally generated. In the short term I'll maintain a fork, https://www.npmjs.com/package/@rh389/shortid, which will always guarantee uniqueness. CC @dylang, if he's still out there! |
Sure, send PR to docs |
literally went here because of this example of "trusting docs without checking": https://www.youtube.com/watch?v=SLpUKAGnm-g i was curious and ended up here. docs should probably point out that id's can in fact collide and uniquenes should always be checked. |
Hi,
From https://github.com/dylang/shortid/blob/master/README.md#shortid---
I've been trying to figure out how
shortid
guarantees uniqueness, and I've come to the conclusion that in fact it doesn't at all (any more).The way this module worked as at https://github.com/dylang/shortid/tree/2.2.8/lib is that it encoded a number of variables (version, worker, seconds offset, counter), mixing in some randomness so that the ID wasn't guessable - however the original data was still hidden between those random bits, and it could be decoded by bit-masking. Given different input the ID was therefore guaranteed to be different (at least subject to #105).
Since 0e9c560 that's all broken. The number to be "encoded" now only influences the string length, but doesn't actually make it into the string at all
shortid/lib/generate.js
Line 14 in 1403944
The string is actually completely random, and has at least as much chance of collision as any random string of its length and alphabet. None of version, worker ID, or time have any impact whatsoever on what is generated.
Please tell me I'm wrong!
Edit: I notice @ai has been closing issues with the assertion that this package is deprecated, but in anticipation of that I'll just note -
nanoid
, though it seems good at what it does, is not a like-for-like replacement - that's a random generator, not a unique generator.It seems to me like when maintenance changed hands after 2.2.8 there might've been an honest misunderstanding about what this package did / was meant to do / how it worked.
The text was updated successfully, but these errors were encountered: