Fix empty community report due to community id mismatch #1437

LevickCG · 2024-11-24T07:31:16Z

Description

[Provide a brief description of the changes made in this pull request.]
Current local search module fails to integrate community reports because of mismatched community_id, this pr changed the code of graphrag/query/structured_search/local_search/mixed_context.py, so that the community id in mixed_contaext actually matches.

Empty selected community due to previously unmatched community id:

Related Issues

This pull request aims to fix the problem mentioned in following issue.
issue1391

Proposed Changes

Change the mismatched community_id (from uuid and human-readable-id to human-readable-id to human-readable-id) so that the selected community would no be empty, and the structure information of community report will be used.

Here are the different outputs in comparison:

Previous:

After:

Checklist

I have tested these changes locally.
I have reviewed the code changes.
I have updated the documentation (if necessary).
I have added appropriate unit tests (if applicable).

Additional Notes

I have checked the code to make sure it won't affect other modules that might use the mix_context, but I'm not fully aware of every detail of all the code given the complexity of the entire project. Please reach out if you notice any interactions or if there are specific areas you'd like me to double-check.

…d to human readable id

LevickCG · 2024-11-24T07:35:17Z

@LevickCG please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

marianaavelino-bertelsmann · 2024-12-03T18:09:15Z

Please check that _cut_batch() is working correctly in community_context.py (line 157). Even when the community ids match, _cut_batch() can also cause no communities to be used during local search.

_init_batch()

for report in selected_reports:
    new_context_text, new_context = _report_context_text(report, attributes)
    new_tokens = num_tokens(new_context_text, token_encoder)

    if batch_tokens + new_tokens > max_tokens:
        # add the current batch to the context data and start a new batch if we are in multi-batch mode
        _cut_batch()     <----------------- problem here
        if single_batch:
            break
        _init_batch()

    # add current report to the current batch
    batch_text += new_context_text
    batch_tokens += new_tokens
    batch_records.append(new_context)     <--------------- because this doesn't get called first

Why does _cut_batch() cause problems, you ask?

When max_tokens is a small values, e.g. 1200, then batch_records.append(new_context) (line 167) is not run to begin with. Consequently, batch_records remains empty and _cut_batch() returns an empty Dataframe. This also results in no communities being used during local search.

Is it possible for max_tokens to be such a small value, you also ask?

Yes, especially during local search, because of the default values in settings: 12 000 * 0.1 = 1200

local_context_params = {
    "community_prop": 0.1,
    "max_tokens": 12_000,
     ...,
}

LevickCG · 2024-12-12T06:11:06Z

Please check that _cut_batch() is working correctly in community_context.py (line 157). Even when the community ids match, _cut_batch() can also cause no communities to be used during local search.
_init_batch()

for report in selected_reports:
    new_context_text, new_context = _report_context_text(report, attributes)
    new_tokens = num_tokens(new_context_text, token_encoder)

    if batch_tokens + new_tokens > max_tokens:
        # add the current batch to the context data and start a new batch if we are in multi-batch mode
        _cut_batch()     <----------------- problem here
        if single_batch:
            break
        _init_batch()

    # add current report to the current batch
    batch_text += new_context_text
    batch_tokens += new_tokens
    batch_records.append(new_context)     <--------------- because this doesn't get called first
Why does _cut_batch() cause problems, you ask?

When max_tokens is a small values, e.g. 1200, then batch_records.append(new_context) (line 167) is not run to begin with. Consequently, batch_records remains empty and _cut_batch() returns an empty Dataframe. This also results in no communities being used during local search.

Is it possible for max_tokens to be such a small value, you also ask?

Yes, especially during local search, because of the default values in settings: 12 000 * 0.1 = 1200
local_context_params = {
    "community_prop": 0.1,
    "max_tokens": 12_000,
     ...,
}

If I get it right about the code you mentioned, it's graphrag/graphrag/query/context_builder/community_context.py
( well, there is another community_context.py in graphrag/graphrag/query/structured_search/global_search/ )

For this case:

What you mean (from my perspective): Empty community may be caused by _cut_batch() due to community context exceeding max token limit.

Well If that's the case, the answer (see the"before" and "after" images in the "Proposed Changes ") wouldn't have changed even if I change the code. If it exceeds the token limit, the afterwards answer wouldn't have shown report(6). (Because in my commit, community context actually expands from zero after getting the id matched)

For overall project:

What if we just ended up getting an empty community context due to _cut_batch() anyway?

For _cut_batch() in the community_context.py is the coder's design to avoid exceeding the max token limit, if community report it's empty because of _cut_batch(), it's not a bug, code's behavior aligns with the design. In this pr, I don't intend to change the original design (we don't know the right propotion of each context's contribution to the final answer, edges ? nodes? community ? It's yet to be found out)

For this pr, it's a bug-fix , the community context cannot be added into the context even-if it doesn't exceed the max token limit due to community mismatch.

LevickCG added 3 commits November 10, 2024 20:23

fix the bug of empty community reports due to mismatch of community i…

7c02e32

…d to human readable id

fix the bug of empty community reports due to mismatch of community i…

7d6bb40

…d to human readable id

run poetry to add patch information

cce04fa

LevickCG requested review from a team as code owners November 24, 2024 07:31

Merge branch 'main' into fix-empty-community-report

247f9af

Merge branch 'main' into fix-empty-community-report

c22c10b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix empty community report due to community id mismatch #1437

Fix empty community report due to community id mismatch #1437

LevickCG commented Nov 24, 2024

LevickCG commented Nov 24, 2024

marianaavelino-bertelsmann commented Dec 3, 2024

LevickCG commented Dec 12, 2024

Fix empty community report due to community id mismatch #1437

Are you sure you want to change the base?

Fix empty community report due to community id mismatch #1437

Conversation

LevickCG commented Nov 24, 2024

Description

Related Issues

Proposed Changes

Checklist

Additional Notes

LevickCG commented Nov 24, 2024

marianaavelino-bertelsmann commented Dec 3, 2024

LevickCG commented Dec 12, 2024

For this case:

For overall project: