Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix empty community report due to community id mismatch #1437

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

LevickCG
Copy link

Description

[Provide a brief description of the changes made in this pull request.]
Current local search module fails to integrate community reports because of mismatched community_id, this pr changed the code of graphrag/query/structured_search/local_search/mixed_context.py, so that the community id in mixed_contaext actually matches.

Empty selected community due to previously unmatched community id:
Image

Related Issues

This pull request aims to fix the problem mentioned in following issue.
issue1391

Proposed Changes

Change the mismatched community_id (from uuid and human-readable-id to human-readable-id to human-readable-id) so that the selected community would no be empty, and the structure information of community report will be used.

Here are the different outputs in comparison:

Previous:
Image
After:
image

Checklist

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have updated the documentation (if necessary).
  • I have added appropriate unit tests (if applicable).

Additional Notes

I have checked the code to make sure it won't affect other modules that might use the mix_context, but I'm not fully aware of every detail of all the code given the complexity of the entire project. Please reach out if you notice any interactions or if there are specific areas you'd like me to double-check.

@LevickCG LevickCG requested review from a team as code owners November 24, 2024 07:31
@LevickCG
Copy link
Author

@LevickCG please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

@marianaavelino-bertelsmann

Please check that _cut_batch() is working correctly in community_context.py (line 157). Even when the community ids match, _cut_batch() can also cause no communities to be used during local search.

_init_batch()

for report in selected_reports:
    new_context_text, new_context = _report_context_text(report, attributes)
    new_tokens = num_tokens(new_context_text, token_encoder)

    if batch_tokens + new_tokens > max_tokens:
        # add the current batch to the context data and start a new batch if we are in multi-batch mode
        _cut_batch()     <----------------- problem here
        if single_batch:
            break
        _init_batch()

    # add current report to the current batch
    batch_text += new_context_text
    batch_tokens += new_tokens
    batch_records.append(new_context)     <--------------- because this doesn't get called first

Why does _cut_batch() cause problems, you ask?

When max_tokens is a small values, e.g. 1200, then batch_records.append(new_context) (line 167) is not run to begin with. Consequently, batch_records remains empty and _cut_batch() returns an empty Dataframe. This also results in no communities being used during local search.

Is it possible for max_tokens to be such a small value, you also ask?

Yes, especially during local search, because of the default values in settings: 12 000 * 0.1 = 1200

local_context_params = {
    "community_prop": 0.1,
    "max_tokens": 12_000,
     ...,
}

@LevickCG
Copy link
Author

Please check that _cut_batch() is working correctly in community_context.py (line 157). Even when the community ids match, _cut_batch() can also cause no communities to be used during local search.

_init_batch()

for report in selected_reports:
    new_context_text, new_context = _report_context_text(report, attributes)
    new_tokens = num_tokens(new_context_text, token_encoder)

    if batch_tokens + new_tokens > max_tokens:
        # add the current batch to the context data and start a new batch if we are in multi-batch mode
        _cut_batch()     <----------------- problem here
        if single_batch:
            break
        _init_batch()

    # add current report to the current batch
    batch_text += new_context_text
    batch_tokens += new_tokens
    batch_records.append(new_context)     <--------------- because this doesn't get called first

Why does _cut_batch() cause problems, you ask?

When max_tokens is a small values, e.g. 1200, then batch_records.append(new_context) (line 167) is not run to begin with. Consequently, batch_records remains empty and _cut_batch() returns an empty Dataframe. This also results in no communities being used during local search.

Is it possible for max_tokens to be such a small value, you also ask?

Yes, especially during local search, because of the default values in settings: 12 000 * 0.1 = 1200

local_context_params = {
    "community_prop": 0.1,
    "max_tokens": 12_000,
     ...,
}

If I get it right about the code you mentioned, it's graphrag/graphrag/query/context_builder/community_context.py
( well, there is another community_context.py in graphrag/graphrag/query/structured_search/global_search/ )

For this case:

What you mean (from my perspective): Empty community may be caused by _cut_batch() due to community context exceeding max token limit.

Well If that's the case, the answer (see the"before" and "after" images in the "Proposed Changes ") wouldn't have changed even if I change the code. If it exceeds the token limit, the afterwards answer wouldn't have shown report(6). (Because in my commit, community context actually expands from zero after getting the id matched)

For overall project:

What if we just ended up getting an empty community context due to _cut_batch() anyway?

For _cut_batch() in the community_context.py is the coder's design to avoid exceeding the max token limit, if community report it's empty because of _cut_batch(), it's not a bug, code's behavior aligns with the design. In this pr, I don't intend to change the original design (we don't know the right propotion of each context's contribution to the final answer, edges ? nodes? community ? It's yet to be found out)

For this pr, it's a bug-fix , the community context cannot be added into the context even-if it doesn't exceed the max token limit due to community mismatch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants