Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ctre gives different result compared with icu and rust #286

Open
DamonsJ opened this issue Jun 26, 2023 · 6 comments
Open

ctre gives different result compared with icu and rust #286

DamonsJ opened this issue Jun 26, 2023 · 6 comments

Comments

@DamonsJ
Copy link

DamonsJ commented Jun 26, 2023

here is the test code :

int test2()
{
    using namespace std::literals;
    //std::string original = "𝔾𝕠𝕠𝕕 𝕞𝕠𝕣𝕟𝕚𝕟𝕘 𝔾𝕠𝕠𝕕 𝕞𝕠𝕣𝕟𝕚𝕟𝕘";
    std::string original = "戦場のヴァルキュリア3";
    auto bdata = original.data();
    static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
    auto matcher = ctre::search<pattern>;
    
    std::string_view cur_data((char*)original.data(),original.size());
    
    std::vector<std::pair<std::pair<int32_t, int32_t>, bool>> splits;
    splits.reserve(original.size());
    int prev = 0;
    bool is_matched =false;
    do {
        auto matched = matcher(cur_data);
        is_matched = matched;
        if (is_matched){
            
            auto start_byte_index =  matched.begin() - original.data();
            auto end_byte_index =  matched.end() - original.data();
            
            
            if (prev != start_byte_index) {
                std::pair<int32_t, int32_t> p(prev, start_byte_index);
                splits.push_back(
                                 std::pair<std::pair<int32_t, int32_t>, bool>(p, false));
            }
            std::pair<int32_t, int32_t> p(start_byte_index, end_byte_index);
            splits.push_back(std::pair<std::pair<int32_t, int32_t>, bool>(p,
                                                                          true));
            prev = end_byte_index;
            int pos = matched.end() - cur_data.data();
            cur_data.remove_prefix(pos);
        }
    } while(is_matched);
    

rust and icu give the same result the matched string is "戦場のヴァルキュリア3"
and ctre gives two part "戦場のヴァルキュリア" and "3"
why that happen?

@hanickadot
Copy link
Owner

Can you minimize it?

@DamonsJ
Copy link
Author

DamonsJ commented Jun 26, 2023

Yes!

int test2()
{
   
    std::string original = "戦場のヴァルキュリア3";
    int size_of_str = original.size(); // size_of_str = 31;
    auto bdata = original.data();
    static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};
    auto matcher = ctre::search<pattern>;
    std::string_view cur_data((char*)original.data(),original.size());

    int prev = 0;
    bool is_matched =false;
    do {
        auto matched = matcher(cur_data);
        is_matched = matched;
        if (is_matched){
            int pos = matched.end() - cur_data.data();
            cur_data.remove_prefix(pos);
        }
    } while(is_matched);
}

the code give me two matches, one is "戦場のヴァルキュリア" and the other is "3"

but when I do the same regex search using ICU library and rust, they give me one match : "戦場のヴァルキュリア3"
so why that happen?

@DamonsJ
Copy link
Author

DamonsJ commented Jun 26, 2023

by the way, if I use this string :
std::string original = "Media.Vision";
ctre , ICU library and rust, they give same three matches:

  1. "Media"
  2. "."
  3. "Vision"

@iulian-rusu
Copy link

\w+ in Rust is unicode-aware, it will match any word character in any script (equivalent to [\p{L}\p{N}_]).
In PCRE it only matches ASCII letters, digits and underscore.

https://regex101.com/r/jVmHsw/1

@marzer
Copy link

marzer commented Jun 28, 2023

For a compile-time regex library to be fully Unicode-aware is a huge ask, FYI @DamonsJ. Unicode is incredibly complex, requiring lots of very large lookup-tables and other short-circuiting mechanisms to implement all the code point identification logic correctly and efficiently.

@DamonsJ
Copy link
Author

DamonsJ commented Jun 29, 2023

Thanks very much @marzer @iulian-rusu

I know it is hard to fully support for unicode regex!

For my question, I write pattern like this :

static constexpr auto pattern = ctll::fixed_string{
        "[\\p{L}\\p{N}\\p{M}\\p{Pc}]+|[^\\p{L}\\p{N}\\p{M}\\p{Pc}\\p{Zs}\\u{A}\\u{B}\\u{C}\\u{D}"
        "\\u{85}\\u{2028}\\u{2029}\\u{DA}]+"};

it works for me, but you know it is not exactly same with :

static constexpr auto  pattern = ctll::fixed_string{"\\w+|[^\\w\\s]+"};

but hope to help others who has same problems!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants