Skip to content

Fix GH-14807 ext/standard levenshtein overflow on 3rd, 4th and 5th arguments. #14809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: PHP-8.2
Choose a base branch
from

Conversation

devnexen
Copy link
Member

@devnexen devnexen commented Jul 4, 2024

No description provided.

@devnexen devnexen changed the title ext/standard levenshtein overflow on 3rd, 4th and 5th arguments. Fix GH-14807 ext/standard levenshtein overflow on 3rd, 4th and 5th arguments. Jul 4, 2024
@devnexen devnexen marked this pull request as ready for review July 4, 2024 06:37
@devnexen devnexen requested a review from bukka as a code owner July 4, 2024 06:37
@nielsdos
Copy link
Member

nielsdos commented Jul 4, 2024

This doesn't solve the problem in general, also the weights are allowed to be negative

@devnexen
Copy link
Member Author

devnexen commented Jul 4, 2024

Oh I see you already work on it he might be better if you continue with it :)

@nielsdos
Copy link
Member

nielsdos commented Jul 4, 2024

I didn't really come up with a general solution, see #13823 (comment)

@devnexen devnexen marked this pull request as draft July 5, 2024 11:50
@nielsdos
Copy link
Member

nielsdos commented Jul 5, 2024

@devnexen I thought about this more.

I think there's two ways to go about this:

  1. Check beforehand if overflow can happen. If we set a number L to the largest cost (i.e. MAX(cost_ins, cost_rep, cost_del)) then overflow can only happen if MAX(ZSTR_LEN(string1), ZSTR_LEN(string2)) * L overflows, i.e. if L > ZEND_LONG_MAX / MAX(ZSTR_LEN(string1), ZSTR_LEN(string2)). And same reasoning for the lowest cost. If we check both of these cases, then we could throw beforehand. The disadvantage is that this can be overly defensive, e.g. using a PHP_INT_MAX weight to disallow one of the 3 operations is a legitimate trick and that would break even though the overflow will never happen in practice.
  2. Use overflow-checked math in the computation of the distance. This is a bit harder / more annoying. It should be cheap for the CPU as the overflow flag is computed anyway when performing arithmetic. The advantage is that the function only throws when there actually is overflow, although then if you use extreme weights then depending on some input strings you get an exception and user code may not be prepared to deal with this. Then again, overflow should be rare.

@devnexen
Copy link
Member Author

devnexen commented Jul 5, 2024

Thanks I ll look at it within the next days.

@cmb69
Copy link
Member

cmb69 commented Jul 14, 2024

I think there's two ways to go about this:

Both appear to be well-thought out and analyzed, and unfortunately neither seems to be fully backward compatible. I consider the current weighted levenshtein() functionality rather half-baked anyway; originally, the function signature supported callbacks (but not even general callables) for the costs, but since that was never implemented, it had been removed. If callables were properly supported, the function might be way more useful (albeit supposedly even much slower), but as is I still see not much use for this function, so it might indeed be "best" to just document the issue and let users deal with it. More elaborate support for levenshtein (and maybe other edit distance algorithms) could still be offered as PECL package (maybe something like this already exists). These are just my 2cents, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants