Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Errors in local search #451

Closed
CCzzzzzzz opened this issue Jul 9, 2024 · 27 comments
Closed

[Bug]: Errors in local search #451

CCzzzzzzz opened this issue Jul 9, 2024 · 27 comments
Labels
community_support Issue handled by community members

Comments

@CCzzzzzzz
Copy link

Describe the bug

I successfully ran the global search, but I encountered an error when running the local search.

Error embedding chunk {'OpenAIEmbedding': 'Error code: 400 - {'error': "'input' field must be a string or an array of strings"}'}
Traceback (most recent call last):
File "C:\Users\cpdft.conda\envs\myconda\lib\runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Users\cpdft.conda\envs\myconda\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query_main
.py", line 75, in
run_local_search(
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\cli.py", line 154, in run_local_search
result = search_engine.search(query=query)
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\structured_search\local_search\search.py", line 118, in search
context_text, context_records = self.context_builder.build_context(
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\structured_search\local_search\mixed_context.py", line 139, in build_context
selected_entities = map_query_to_entities(
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 55, in map_query_to_entities
search_results = text_embedding_vectorstore.similarity_search_by_text(
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\vector_stores\lancedb.py", line 118, in similarity_search_by_text
query_embedding = text_embedder(text)
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 57, in
text_embedder=lambda t: text_embedder.embed(t),
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\llm\oai\embedding.py", line 96, in embed
chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\numpy\lib\function_base.py", line 550, in average
raise ZeroDivisionError(
ZeroDivisionError: Weights sum to zero, can't be normalized

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

llm:
api_key: ollama
type: openai_chat # or azure_openai_chat
model: gemma2
model_supports_json: true # recommended if this is available for your model.
api_base: http://localhost:11434/v1

embeddings:
llm:
api_key: lm-studio
type: openai_embedding # or azure_openai_embedding
model: nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf
api_base: http://localhost:1234/v1

Logs and screenshots

No response

Additional Information

  • GraphRAG Version:
  • Operating System:
  • Python Version:
  • Related Issues:
@CCzzzzzzz CCzzzzzzz added bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 9, 2024
@di3n0
Copy link

di3n0 commented Jul 9, 2024

The same. It seems like the base64 and strings problem.
The picture was the LMStudio.
image

@CCzzzzzzz
Copy link
Author

一样。这似乎是 base64 和字符串问题。图片是 LMStudio。 图像

Why is this problem only in local search? How did you solve?

@di3n0
Copy link

di3n0 commented Jul 9, 2024

一样。这似乎是 base64 和字符串问题。图片是 LMStudio。 图像

Why is this problem only in local search? How did you solve?

Sorry, I haven't solved it.
But I am looking for information related to LM Studio
langchain-ai/langchain#21318
According to what you said, it feels like it's a Local issue.
Thank you, it gave me some inspiration.

@Kingatlas115
Copy link

its something in the commuity extract scripts or llm parser scripts. just cant nail down what

@CCzzzzzzz
Copy link
Author

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.

@812406210
Copy link

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.
hello,
i use ollama occured same problem.
use xinference ,the url api how to get ?

@CCzzzzzzz
Copy link
Author

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.
hello,
i use ollama occured same problem.
use xinference ,the url api how to get ?

http://localhost:"ollama_or_xinference_default_port"/v1

@812406210
Copy link

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.
hello,
i use ollama occured same problem.
use xinference ,the url api how to get ?

http://localhost:"ollama_or_xinference_default_port"/v1

thx , but use xinference ,it happen this error
ValueError: Query vector size 768 does not match index column size 1536

@KylinMountain
Copy link
Contributor

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.
hello,
i use ollama occured same problem.
use xinference ,the url api how to get ?

http://localhost:"ollama_or_xinference_default_port"/v1

thx , but use xinference ,it happen this error ValueError: Query vector size 768 does not match index column size 1536

looks like you are using different embedding model in index and query?

@KylinMountain
Copy link
Contributor

I am using llama.cpp to server embedding api, it is more stable, you can try that.

@goodmaney
Copy link

same. I use the py script app.py .Maybe it's about the int and str variable.

Error embedding chunk {'OpenAIEmbedding': "Error code: 422 - {'detail': [{'type': 'string_type', 'loc': ['body', 'input', 0], 'msg': 'Input should be a valid string', 'input': 3923, 'url': 'https://errors.pydantic.dev/2.7/v/string_type'}, {'type': 'string_type', 'loc': ['body', 'input', 1], 'msg': 'Input should be a valid string', 'input': 527, 'url': 'https://errors.pydantic.dev/2.7/v/string_type'}, {'type': 'string_type', 'loc': ['body', 'input', 2], 'msg': 'Input should be a valid string', 'input': 279, 'url': 'https://errors.pydantic.dev/2.7/v/string_type'}
.......................
ZeroDivisionError: Weights sum to zero, can't be normalized

@goodmaney
Copy link

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.
hello,
i use ollama occured same problem.
use xinference ,the url api how to get ?

http://localhost:"ollama_or_xinference_default_port"/v1

thx , but use xinference ,it happen this error ValueError: Query vector size 768 does not match index column size 1536

I reexecute the [python -m graphrag.index --init --root ./ragtest] with xinference embedding ,it not report that error. "768 does not match index column size 1536" is reported when I build the index with py script then query by xinference . But the local search response me nothing 😂 with no error

@unxd9c
Copy link

unxd9c commented Jul 10, 2024

Issue is that --method local does not work out of the box with open source embedding models.
It is because of the way how OpenAI's text-embedding-3-small model is working. It is using token IDs as input, while open source models like nomic-embed-text are working with text as input.
So you need to convert token IDs to text before using open source models.

Solution is to add one line to package's graphrag/query/llm/oai/embedding.py "embed" function :

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

@goodmaney
Copy link

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.

Did local search response you something? it response me nothing and not report error

@Atarasin
Copy link

Issue is that --method local does not work out of the box with open source embedding models. It is because of the way how OpenAI's text-embedding-3-small model is working. It is using token IDs as input, while open source models like nomic-embed-text are working with text as input. So you need to convert token IDs to text before using open source models.

Solution is to add one line to package's graphrag/query/llm/oai/embedding.py "embed" function :

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

I can use local search by this way, thank you so much.

@karthik-codex
Copy link

karthik-codex commented Jul 14, 2024

Issue is that --method local does not work out of the box with open source embedding models. It is because of the way how OpenAI's text-embedding-3-small model is working. It is using token IDs as input, while open source models like nomic-embed-text are working with text as input. So you need to convert token IDs to text before using open source models.

Solution is to add one line to package's graphrag/query/llm/oai/embedding.py "embed" function :

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

Could you show how you modified the "def _embed_with_retry" function in the embedding.py?

I got the embedding to work but later got an error that says "Error: Query vector size 768 does not match index column size 3072". 768 is the length of my embedding vector for the provided query. Not sure what 3072 means. I use nomic-embed-text from Ollama.

@unxd9c
Copy link

unxd9c commented Jul 14, 2024

Could you show how you modified the "def _embed_with_retry" function in the embedding.py?

I got the embedding to work but later got an error that says "Error: Query vector size 768 does not match index column size 3072". 768 is the length of my embedding vector for the provided query. Not sure what 3072 means. I use nomic-embed-text from Ollama.

I had a similar looking problem when i tried to use different models for Index (text-embedding-3-small, which generates 1536-dimensional vectors) and Search (nomic-embed-text, which generates 768-dimensional). 3072 looks like text-embedding-3-large.

BTW my def _embed_with_retry is untouched. And i did not yet found a way to work with Ollama's embedding models while doing Search (i use LM Studio).

    def _embed_with_retry(
        self, text: str | tuple, **kwargs: Any
    ) -> tuple[list[float], int]:
        try:
            retryer = Retrying(
                stop=stop_after_attempt(self.max_retries),
                wait=wait_exponential_jitter(max=10),
                reraise=True,
                retry=retry_if_exception_type(self.retry_error_types),
            )
            for attempt in retryer:
                with attempt:
                    embedding = (
                        self.sync_client.embeddings.create(  # type: ignore
                            input=text,
                            model=self.model,
                            **kwargs,  # type: ignore
                        )
                        .data[0]
                        .embedding
                        or []
                    )
                    return (embedding, len(text))
        except RetryError as e:
            self._reporter.error(
                message="Error at embed_with_retry()",
                details={self.__class__.__name__: str(e)},
            )
            return ([], 0)
        else:
            # TODO: why not just throw in this case?
            return ([], 0)

@1997shp
Copy link

1997shp commented Jul 18, 2024

问题是,--method local对于开源嵌入模型来说,这无法开箱即用。 这是因为 OpenAI 模型的工作方式text-embedding-3-small。它使用 token ID 作为输入,而开源模型则nomic-embed-text使用文本作为输入。 因此,在使用开源模型之前,您需要将 token ID 转换为文本。

graphrag/query/llm/oai/embedding.py解决方案是在包的“嵌入”功能中添加一行:

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

I can use local search by this way too, thank you so much.

@karthik-codex
Copy link

karthik-codex commented Jul 18, 2024

@sdjd93dj
Copy link

sdjd93dj commented Jul 19, 2024

@karthik-codex

I fixed this as well. you can find my repo to do local indexing and search here. https://medium.com/@karthik.codex/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f https://github.com/karthik-codex/autogen_graphRAG

Apologies for going off-topic, but seeing as you've successfully attempted global search, did you have to make any hotfixes for that? Or was it all smooth sailing?

I ran into this JSON issue, which has this fix and this fix.

Perhaps there's no answer, but I'm a bit curious as to why you might not have run into the issue, unless that simply isn't discussed in the blog.

@karthik-codex
Copy link

@karthik-codex

I fixed this as well. you can find my repo to do local indexing and search here. https://medium.com/@karthik.codex/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f https://github.com/karthik-codex/autogen_graphRAG

Apologies for going off-topic, but seeing as you've successfully attempted global search, did you have to make any hotfixes for that? Or was it all smooth sailing?

I ran into this JSON issue, which has this fix and this fix.

Perhaps there's no answer, but I'm a bit curious as to why you might not have run into the issue, unless that simply isn't discussed in the blog.

No I did not get any of these issues. I also used Mistral instead of Llama, which was suggested by an Youtuber for its longer context window than llama.

@sdjd93dj
Copy link

sdjd93dj commented Jul 19, 2024 via email

@adirsingh96
Copy link

adirsingh96 commented Jul 21, 2024

Issue is that --method local does not work out of the box with open source embedding models. It is because of the way how OpenAI's text-embedding-3-small model is working. It is using token IDs as input, while open source models like nomic-embed-text are working with text as input. So you need to convert token IDs to text before using open source models.

Solution is to add one line to package's graphrag/query/llm/oai/embedding.py "embed" function :

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

this is working , but it is giving completely out of context answers

@natoverse
Copy link
Collaborator

Consolidating alternate model issues here: #657

@natoverse natoverse closed this as not planned Won't fix, can't repro, duplicate, stale Jul 22, 2024
@natoverse natoverse added community_support Issue handled by community members and removed bug Something isn't working triage Default label assignment, indicates new issue needs reviewed by a maintainer labels Jul 22, 2024
@yurochang
Copy link

问题是,--method local对于开源嵌入模型来说,这无法开箱即用。 这是因为 OpenAI 模型的工作方式text-embedding-3-small。它使用 token ID 作为输入,而开源模型则nomic-embed-text使用文本作为输入。 因此,在使用开源模型之前,您需要将 token ID 转换为文本。

graphrag/query/llm/oai/embedding.py解决方案是在包的“嵌入”功能中添加一行:

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

cool, it solved the problem!

@adirsingh96
Copy link

Hey, are you getting relevant answers?

@lyyf2002
Copy link

I fix it and create a PR 568. Hope it will be merged soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community_support Issue handled by community members
Projects
None yet
Development

No branches or pull requests