-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing can become accidentally quadratic because of sscanf #40
Comments
First of all, I'm happy that as a matter of course ryml is already expected to be faster than alternatives by at least 10x, to the point that a mere 3x speedup is considered an issue! But of course, this is a nasty issue, so thanks for detecting it and reporting. The stringifying landscape in C/C++ is bleak, even after 50 years of language life. Stringifying/destringifying floats is really hard, and it took until C++17 to have that with a non-zero-terminated bounded string. Are you using C++17 or later? I've been meaning to use But anyway, ryml has to support everything down to and including C++11, and here we have no recourse in the standard. You've likely looked into these c4::atof() functions, and seen the song and dance they have to perform to guarantee boundedness. And now it looks as glibc's sscanf() is looking for a non-existing terminating zero character, so it fails (and is likely incurring a read-after-end bug). I'm going to look for alternatives to snprintf()/sscanf() for float types; I was already thinking about doing this. The first one is hard to find (because it's really hard to do) and I won't even attempt to write it, but the second one is easier. Maybe I'll implement snscanf() and just keep the calls to snprintf(). Also, bear in mind that this issue is with the c4core repo. I'm tracking the issue there, and will close this after the c4core submodule is updated here. |
Oh right, I forgot about the C++17 or later only std::from_chars/to_chars... Yeah it's unfortunate that there doesn't seem to be any standard function that can handle non-zero terminated strings until C++17. As a workaround I am currently getting the value from the NodeRef manually and since I am already using abseil I simply used SimpleAtod (which internally uses a port of the C++17 from_chars). |
Quick note to say I have not forgotten this and I mean to return to it when I can. |
@leoetlino I finally found something suitable, and also exceedingly faster than sscanf: fast_float. It should fix your problem. |
I'll ask the stupid question: |
@dataf3l exactly something like that is in effect: see the code here |
While investigating a performance issue in a library I am working on, I noticed that parsing a particular document (warning: very large file) causes the program to spend a lot of time in
memchr
(roughly 70% of its time). Pausing execution in gdb gives the following stack trace: (this is an image to keep the colors)to the point that parsing is only ~3x faster than doing it with PyYAML. The speedup I get for other files is >10x.
It looks like the problem is that reading floats is done using c4::from_chars -> c4::atof -> sscanf which ultimately calls _IO_str_init_static_internal in the glibc implementation. As far as I understand, this essentially causes strlen to be called on the string that rapidyaml passes to it, essentially causing strlen to be called O(n^2) times.
Since there are a lot of floats throughout the entire document, parsing appears to become accidentally quadratic.
Not sure how to fix this. Even using the standard atof might not help since that function doesn't take the length either...
The text was updated successfully, but these errors were encountered: