Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vstrtonum util as replacement for atof/strtod, etc. #19309

Draft
wants to merge 18 commits into
base: develop
Choose a base branch
from

Conversation

markcmiller86
Copy link
Member

@markcmiller86 markcmiller86 commented Feb 14, 2024

Description

This is a draft PR for developers to see where I am headed with this. It defines vstrtonum<sometype>() with some special sauce to handle things like default values, error checking, range checking, etc. Read the comment in StringHelpers.h for an overview...

// ****************************************************************************
// Function: vstrtonum
//
// Purpose: Replacement for strtoX() and atoX() methods.
//
// Instead of any of these kinds of uses...
//
// int k = atoi(numstr);
// float f1 = atof(numstr);
//
// unsigned u = (unsigned) strtoul(numstr, 0);
// if (errno != 0) u = 0xFFFFFFFF; // set to default val
//
// float f2 = (float) strtod(numstr, 0);
// if (errno != 0) {
// f2 = -1.0; // set to default value
// debug5 << numstr << " bad value" << endl; // log error
// }
//
// ...do this...
//
// int k = vstrtonum<int>(numstr);
// float f1 = vstrtonum<float>(numstr);
//
// unsigned u = vstrtonum<unsigned>(numstr, 0xFFFFFFFF);
// float f = vstrtonum<float>(numstr, -1.0, debug5);
//
// Templatized methods to convert strings to language-native typed
// numeric values, perform some minimal error checking and optionally
// emit error messages with potentially useful context when errors
// are encountered.
//
// This should always be used in place of strtoX() or atoX() when
// reading ascii numerical data.
//
// We do a minimal amount of error checking for a signed conversion
// by checking if first non-whitespace character is a minus sign and
// artificially setting errno to EDOM (not something strtoX/atoX
// would ever do). We could add more error checking for different
// cases too by, for example, checking value read for int type and
// seeing if it is too big to fit in an int. We currently do not
// do this but it would be easy to add. We could also easily add
// logic for other, less frequently used types such as shorts or
// maybe int64_t, etc.
//
// The default method treats all ascii as long double for the strtoX
// conversion and then casts it to correct type. We specialize some
// cases for slightly better behavior.
//
// I ran performance tests on macOS doing 1 million conversions with
// these methods (including error checking) and 1 million directly
// with strtoX and atoX methods and observed no significant diffs
// in performance. In addition, keep in mind that these methods are
// typically being used in conjunction with file I/O, which almost
// certainly dominates performance. The only time this might not be
// true is for memory resident "files", mmaps, and/or SSDs.
//
// At some point, it would make sense to enhance this to use locale
// so we can handle different character encodings as well as regional
// specific interpretations (e.g. European 1.234.456,89). The idea
// would be to set the locale VisIt is using (a pref. maybe) and then
// these methods would just use that locale in calls to strtoX. That
// could be achieved globally in VisIt with a call to setlocale().
// However, when in the United States reading an ascii data file
// formatted for human readability in Germany the desire would be
// to specify "de_DE" for the locale during the read of just that
// file suggesting something more complicated than just a global
// setting.

@markcmiller86 markcmiller86 marked this pull request as draft February 14, 2024 00:11
template<typename T> inline T _vstrtonum(char const *numstr, char **eptr, int /* unused */) { return static_cast<T>(strtold(numstr, eptr)); }

// Specialize int/long cases to use int conversion strtol which with base of 0 can handle octal and hex also
#define _VSTRTONUMI(T,F) template<> inline T _vstrtonum<T>(char const *numstr, char **eptr, int base) { return static_cast<T>(F(numstr, eptr, base)); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of macros, we can use std::function along with the template to provide a generic way to forward and apply for each type

Copy link
Member Author

@markcmiller86 markcmiller86 Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the macro is being used quite the way your comment suggests. Consumers of this interface use only the vstrtonum templatized function...never a macro.

strtold() would work for everything if we didn't care about handling bases other than 10 in conversions (e.g. octal or hex) and we would not need the specializations for integer data which include a base argument.

This macro here is simply being used to easily instantiate several specializations for both signed and unsigned integer data to also support conversions involving bases other than 10. The specialization for unsigned cases is added error detection for passing negative values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, why we don't just due a normal instantiation? Templates should be the tool, not sure why we need both templates + macros?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the issue is with using macros.

That said, my reason for using them is that it makes a) for a lot less typing and reading and, in particular, b) reading a lot of text that is substantially similar and trying to identify what the key difference is between them.

IMHO, macros make it very clear that the only difference in the three instantiations is the type and function called to perform the conversion.

@markcmiller86
Copy link
Member Author

It occurs to me...maybe we should be thinking about unicode string data as well?

@markcmiller86
Copy link
Member Author

markcmiller86 commented Mar 20, 2025

From March 20, 2025 special topics meeting conversation...

  • replace macros with templates
  • consider c++ 17 from_chars
  • add ci to enforce use

@markcmiller86
Copy link
Member Author

Ok, @JustinPrivitera I looked at from_chars feature of C++ 17 uses only the C locale. I think that is a good reason not to use it, definitely in places where data is being read from a user-provided string (GUI, CLI, ascii database readers).

Given that new information, what do you think?

@JustinPrivitera
Copy link
Member

Hi @markcmiller86. That makes sense then to not use from_chars. My main issue with this work was adding complexity where I don't think complexity is needed. Adding new features that do the same thing as features built in to C++ seems like a non-idiomatic way of doing things. Maybe error handling and the interface are unified, but adding something like this makes it harder to work on a project like VisIt. The more special ways we have of doing things that are non-idiomatic to C++, the more difficult it is to maintain. If we can add some kind of checker or CI action that can complain about using the standard C++ converters, then that would help mitigate. But I'm not convinced this is worth solving in the first place.

I was overruled, however, by the collective wisdom of the team, so you should go ahead.

@markcmiller86
Copy link
Member Author

@JustinPrivitera the concerns regarding "complexity" certainly resonate with me. More on that below. Speaking to the ...adding new features that do the same thing as features built in to C++ seems like..., what is proposed here does more than what C++ does.

If you've ever read large ascii data files into VisIt and have them fail for some obscure reason way down in the middle of the file read, it can be next to impossible to figure out where (e.g. which line and text in the input file) VisIt is having trouble. Maybe the whole read fails and nothing is plottable or maybe something small fails, something is plottable but it gets the plot wrong due to a problem with some data that was read (and a reader that didn't bother to error check the input). Getting this right means each and every ascii reader needs to have been coded to account for all the possible ways ascii reads can fail, repeating that error detection and messaging code EVERYWHERE. Hardly anyone is willing to do that and so few readers do any of that.

The work proposed here is meant to SIMPLIFY this for ANY cases where we read ASCII data into some internal integer/float/double data. And, if we're lucky, maybe even handle variations in conventions for handling things like decimal point, comma separator, octal or hex data, etc. That last part should come almost for free if we play our cards right here. As an aside, it would be most useful to report input file line numbers where failures occurred and this requires potentially many adjustments to existing logic loops where reading ascii data in readers is performed. Finally, all of this code needs to be fast because its in the critical path of problem-sized data reads.

Ok, back to the "complexity" argument...I am confused about how using from_chars instead of strtod (and friends) has any impact on complexity. How does using something "built-in" to C++ address that complexity question any more than using other reasonable options.

@JustinPrivitera
Copy link
Member

@markcmiller86 Thanks for your detailed and thorough response!

I understand much better what we are after now. Making reading large ASCII files easier is great.

Why is using this new utility an advantage when it comes to simple string to number conversions that occur incidentally in source code? Unrelated to reading data files.

Ok, back to the "complexity" argument...I am confused about how using from_chars instead of strtod (and friends) has any impact on complexity. How does using something "built-in" to C++ address that complexity question any more than using other reasonable options.

In my view, building any tools on top of built-in language features is adding complexity. Sometimes that complexity is warranted, other times it is not. You've made an excellent case for having a utility for reading large ASCII data files, in which case the complexity is warranted. For simple conversions, it seems to me that built-in language features are more accessible to the average programmer than using a special utility unique to our project, that we have to educate all developers about.

For example, I was confused by intVector and friends for a while in our codebase, and I still don't have a solid grasp of why we had to use those some places instead of std::vector<int>. I'm guessing it has something to do with XML code generation tools.

@markcmiller86
Copy link
Member Author

I am now rethinking from_chars...

Great follow-up — std::from_chars is significantly faster than the traditional strto* family, especially for integer parsing. For floating-point, it’s still quite fast and beats strtod/strtold in most modern implementations.


std::from_chars — Performance Summary

Task strto* std::from_chars (C++17)
Integer parsing 20–80 ns 3–10 ns ✔️ (5–10× faster)
Float parsing 100–300 ns 15–80 ns ✔️ (2–5× faster)

Benchmarks from platforms like x86-64 with GCC ≥ 10, Clang ≥ 11, MSVC ≥ 2019. Actual numbers vary with string length and format.


🚀 Why Is std::from_chars Faster?

  • No locale support (unlike strtol/strtod, which must obey locale rules).
  • No memory allocation, no errno, no thread-local state.
  • Designed for high performance: it's a low-level, parse-only tool (not format-aware like std::stringstream).
  • Can be inlined and SIMD-optimized by the standard library.

🧪 Example: Integer Parsing

#include <charconv>
#include <string_view>
#include <iostream>

void parse_fast(std::string_view input) {
    int value;
    auto result = std::from_chars(input.data(), input.data() + input.size(), value);
    if (result.ec == std::errc()) {
        std::cout << "Parsed value = " << value << "\n";
    } else {
        std::cerr << "Parse failed\n";
    }
}

🧪 Example: Floating Point (C++17/20)

double d;
auto [ptr, ec] = std::from_chars(str.data(), str.data() + str.size(), d);
  • Fast and non-allocating
  • If the string is invalid, ec == std::errc::invalid_argument or ec == std::errc::result_out_of_range

⚠️ Caveats

  • std::from_chars for floating-point (double, float) was only added in C++17, but was implemented fully in:

    • GCC 11+
    • Clang 13+ with libc++
    • MSVC 2019+
  • from_chars for floating-point may not be available or optimized in older compilers — check your STL version.

  • No support for long double or half (binary16) in current from_chars.


🧠 Bottom Line

  • For integer parsing, std::from_chars is a clear winner5–10× faster than strtol.
  • For floating-point, it's still 2–5× faster than strtod, but performance depends on STL version.
  • If you're parsing large volumes of ASCII numeric data and care about speed, use from_chars wherever supported.

Would you like a ready-to-run benchmark comparing strtol vs from_chars on your platform?

@markcmiller86
Copy link
Member Author

However, in the context of ASCII file I/O...

Fantastic and practical question — and you're thinking in exactly the right direction.

✅ Short Answer:

Yes, in many real-world scenarios, ASCII file I/O is slow enough that even the slower strtoXXX or std::from_chars parsing functions won't be the bottleneck — especially if you're reading line-by-line with std::ifstream.

But… it depends on:

  • File size and structure
  • Disk/SSD speed or whether you're streaming over a network
  • How efficient your line-reading and parsing logic is
  • How many values you parse per line

Let’s dig in with actual numbers and some rules of thumb.


⏱️ Ballpark Timing Estimates

Operation Approx Speed (modern hardware)
std::getline() from SSD 0.5–2 µs per line (depending on length)
std::strtol per integer 20–80 ns
std::from_chars per int 5–15 ns ✔️ fast
strtod / strtold per float 100–300 ns
std::from_chars per float 20–80 ns ✔️ fast

Example:

If you're reading 1 million lines, each with 5 numbers:

  • File I/O (from SSD) ≈ 1M × 1.5µs = 1.5 seconds
  • Parsing with from_chars (5 ints/line) ≈ 5M × 10ns = ~50 ms
  • Parsing with strtod (5 floats/line) ≈ 5M × 200ns = ~1 second

🔍 So:

  • With strtod, parsing can be a significant chunk of runtime.
  • With from_chars, I/O clearly dominates unless you're doing additional processing.

🔍 Line-by-line I/O is Not Free

Reading line-by-line with std::ifstream and std::getline() is:

  • Easy to write
  • But can be inefficient due to:
    • Multiple memory allocations per line
    • Synchronization if ifstream is not nosync_with_stdio(false)
    • Lack of buffering if not done carefully

✅ Optimization Tips

If performance starts to matter:

1. Use std::ifstream::rdbuf()->sgetn() for bulk reads (or mmap on Linux)

You can then scan a buffer yourself line-by-line and parse inline.

2. Disable iostream sync with C I/O

std::ios::sync_with_stdio(false);

3. Use from_chars for numeric parsing

It avoids locales, mallocs, and is easily inlined.

4. Parse from a buffer, not line-by-line

Avoids per-line overhead. You can keep track of line numbers manually by counting \n.


🧠 Rule of Thumb

Situation Best Parser
Fast bulk parse, known format std::from_chars
Locale-specific input strtod / std::istringstream + locale
High-volume integer parsing std::from_chars
Reading ~10 values per line, moderate-size files I/O dominates, don’t worry too much
Massive input (e.g., GBs) Optimize I/O & use from_chars

✅ Final Answer:

If you're doing typical line-by-line parsing of scientific ASCII data, and not parsing thousands of numbers per line, then:

Yes — I/O performance will dominate. Use from_chars for its speed and precision, but don’t over-optimize until you have profiling data showing parsing is your bottleneck.


Let me know if you'd like a fast file reader pattern using mmap, buffer-based scanning, or multi-threaded parsing!

@JustinPrivitera
Copy link
Member

Some late night chat GPT adventures... sounds like from_chars is the winner then? We could always leave vstrtonum implemented how it is now and change to from_chars when we switch to C++17.

@cyrush
Copy link
Member

cyrush commented Apr 2, 2025

Having a wrapper method that standardizes our use cases is helpful.

Say C++27 includes a better mouse trap, we update our wrapper implementation and don't have to update the entire code base.

(from_chars existence already demonstrates this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants