Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing is slow for large arrays #278

Open
BrouxtForce opened this issue Oct 17, 2024 · 2 comments
Open

Parsing is slow for large arrays #278

BrouxtForce opened this issue Oct 17, 2024 · 2 comments

Comments

@BrouxtForce
Copy link

I have recently tried to switch from toml++ (seeing as toml++ does not preserve whitespace or comments), but toml11 is too slow at parsing large arrays (it seems that computational complexity is O(N2) with array size). For whatever reason, I have cursed myself with these long arrays in my TOML files, so now my application stalls for several seconds when I load such a file.

Here's a simple example with a array with length 5000:

#include <iostream>
#include <sstream>
#include <chrono>
#include <toml.hpp>

int main(int argc, char** argv)
{
    std::stringstream ss;

    ss << "[Numbers]\narray = [";
    for (int i = 0; i < 5000; i++)
    {
        if (i != 0) ss << ", ";
        ss << i;
    }
    ss << "]\n";

    std::cout << ss.str() << std::endl;;

    std::cout << "Start parsing..." << std::endl;

    auto start = std::chrono::steady_clock::now();
    toml::value input = toml::parse_str(ss.str());
    auto end = std::chrono::steady_clock::now();

    std::cout << "Finished parsing!" << std::endl;
    std::cout << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() << "ms" << std::endl;
}
@ToruNiina
Copy link
Owner

ToruNiina commented Oct 17, 2024

The problems reported here are different from how I usually feel when I use this library, so I took additional benchmarks.

The code I used is below. In addition to your example, I measured the speed with a looong array having line breaks for each element. 100 elements to 5000 elements were plotted. The results are as follows.

#include <toml.hpp>
#include <iostream>

int main()
{
    for(std::size_t i=1; i<=50; ++i)
    {
        std::string with_newlines;
        with_newlines += "[table]\n";
        with_newlines += "array = [\n";
        for(std::size_t e=0; e<i*100; ++e)
        {
            with_newlines += std::to_string(e);
            with_newlines += ",\n";
        }
        with_newlines += "]\n";

        std::string without_newlines;
        without_newlines += "[table]\n";
        without_newlines += "array = [";
        for(std::size_t e=0; e<i*100; ++e)
        {
            without_newlines += std::to_string(e);
            without_newlines += ", ";
        }
        without_newlines += "]\n";

        const auto w_start = std::chrono::system_clock::now();
        const auto data1 = toml::parse_str(with_newlines);
        const auto w_stop = std::chrono::system_clock::now();

        const auto w_dur = std::chrono::duration_cast<std::chrono::microseconds>(w_stop - w_start).count() * 0.001;

        const auto wo_start = std::chrono::system_clock::now();
        const auto data2 = toml::parse_str(without_newlines);
        const auto wo_stop = std::chrono::system_clock::now();

        const auto wo_dur = std::chrono::duration_cast<std::chrono::microseconds>(wo_stop - wo_start).count() * 0.001;

        std::cout << i * 100 << " " << w_dur << " " << wo_dur << std::endl;
    }
}

bench

I see, you are right, it seems to be O(n^2) computations in case of one loooong line.
At the same time, it appears to be O(n) in case of one element per line.

This result narrows down the cause of this slowdown. Probably the part that parses indentation depth or column numbering retraces or searches the whole line per element. I think I can resolve this in this weekend.

If you want to solve it without waiting for that, I recommend inserting a line break every certain number of elements.

(edit: I forgot to paste the code I used.)

@ToruNiina
Copy link
Owner

I found the cause of the slowdown.

  • toml::parse_integer calls type_config::parse_int to allow custom integer type.
  • type_config::parse_int takes toml::source_location for easier error message generation.
  • toml::source_location saves the whole line that include the region of interest and counts the column number of the region in the line

Here I modified toml::detail::location type to track the column number and omitted unrelated part of the line in toml::source_location. Now it seems to be O(n) without any LF.

bench_new

But it changes the error message a bit. I think I need to check if the new error message looks okay. After checking it, I will merge the branch and will close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants