Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cosmos DB storage/cache option #1431

Merged
merged 130 commits into from
Dec 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
130 commits
Select commit Hold shift + click to select a range
6cae56c
added cosmosdb constructor and database methods
KennyZhang1 Nov 4, 2024
80bd11a
added rest of abstract method headers
KennyZhang1 Nov 4, 2024
e846edc
added cosmos db container methods
KennyZhang1 Nov 4, 2024
439409f
implemented has and delete methods
KennyZhang1 Nov 5, 2024
b010d14
finished implementing abstract class methods
KennyZhang1 Nov 5, 2024
e1beb4d
Merge branch 'main' of github.com:microsoft/graphrag into add-cosmosd…
KennyZhang1 Nov 5, 2024
923a791
integrated class into storage factory
KennyZhang1 Nov 6, 2024
8cd45cb
integrated cosmosdb class into cache factory
KennyZhang1 Nov 6, 2024
b041e51
added support for new config file fields
KennyZhang1 Nov 6, 2024
66ace0d
Merge branch 'main' of github.com:microsoft/graphrag into add-cosmosd…
KennyZhang1 Nov 6, 2024
b263569
Merge branch 'main' of github.com:microsoft/graphrag into add-cosmosd…
KennyZhang1 Nov 7, 2024
a76eb54
replaced primary key cosmosdb initialization with connection strings
KennyZhang1 Nov 8, 2024
73d1e42
Merge branch 'main' of github.com:microsoft/graphrag into add-cosmosd…
KennyZhang1 Nov 11, 2024
5436166
Merge branch 'main' of github.com:microsoft/graphrag into add-cosmosd…
KennyZhang1 Nov 12, 2024
e0a0546
modified cosmosdb setter to require json
KennyZhang1 Nov 12, 2024
6d2427e
Fix non-default emitters
AlonsoGuevara Nov 14, 2024
d206e67
Format
AlonsoGuevara Nov 14, 2024
ea7a404
Ruff
AlonsoGuevara Nov 14, 2024
297066c
ruff
AlonsoGuevara Nov 14, 2024
0982efe
Merge remote-tracking branch 'origin/fix/non-default-emitters' into a…
KennyZhang1 Nov 14, 2024
d6c3afc
first successful run of cosmosdb indexing
KennyZhang1 Nov 14, 2024
d1fc4f0
removed extraneous container_name setting
KennyZhang1 Nov 14, 2024
716bfa4
require base_dir to be typed as str
KennyZhang1 Nov 14, 2024
65c93bb
reverted merged changed from closed branch
KennyZhang1 Nov 14, 2024
6d1a4d9
Merge branch 'main' of github.com:microsoft/graphrag into add-cosmosd…
KennyZhang1 Nov 14, 2024
66641d6
removed nested try statement
KennyZhang1 Nov 15, 2024
5e5f76d
readded initial non-parquet emitter fix
KennyZhang1 Nov 15, 2024
0d93d0d
added basic support for parquet emitter using internal conversions
KennyZhang1 Nov 15, 2024
dac0b86
merged with main and resolved conflicts
KennyZhang1 Nov 18, 2024
594f332
Merge branch 'main' of github.com:microsoft/graphrag into add-cosmosd…
KennyZhang1 Nov 18, 2024
6eb6134
fixed more merge conflicts
KennyZhang1 Nov 18, 2024
31c0a7a
added cosmosdb functionality to query pipeline
KennyZhang1 Nov 18, 2024
c5281bb
tested query for cosmosdb
KennyZhang1 Nov 19, 2024
76511d0
collapsed cosmosdb schema to use minimal containers and databases
KennyZhang1 Nov 19, 2024
232cd07
simplified create_database and create_container functions
KennyZhang1 Nov 20, 2024
297373b
ruff fixes and semversioner
KennyZhang1 Nov 21, 2024
4ddff36
spellcheck and ci fixes
KennyZhang1 Nov 21, 2024
15d9e62
updated pyproject toml and lock file
KennyZhang1 Nov 21, 2024
93afe22
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Nov 29, 2024
48b7bd9
apply fixes after merge from main
jgbradley1 Dec 2, 2024
c6c7494
add temporary comments
jgbradley1 Dec 3, 2024
aa4a996
refactor cache factory
jgbradley1 Dec 3, 2024
2113de6
refactored storage factory
jgbradley1 Dec 3, 2024
11b5b62
minor formatting
jgbradley1 Dec 3, 2024
8f3e44c
update dictionary
jgbradley1 Dec 3, 2024
447cbe7
fix spellcheck typo
jgbradley1 Dec 3, 2024
6f91dfb
fix default value
jgbradley1 Dec 3, 2024
c79e63c
fix pydantic model defaults
jgbradley1 Dec 3, 2024
aeceacb
update pydantic models
jgbradley1 Dec 3, 2024
b3feb44
fix init_content
jgbradley1 Dec 3, 2024
03e7d35
cleanup how factory passes parameters to file storage
jgbradley1 Dec 4, 2024
f2a5a4d
remove unnecessary output file type
jgbradley1 Dec 4, 2024
dbd6737
update pydantic model
jgbradley1 Dec 4, 2024
a059333
cleanup code
jgbradley1 Dec 4, 2024
a0272a6
implemented clear method
KennyZhang1 Dec 4, 2024
b0e3369
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Dec 6, 2024
10b8996
fix merge from main
jgbradley1 Dec 6, 2024
91d0348
add test stub for cosmosdb
jgbradley1 Dec 6, 2024
7f62af2
regenerate lock file
jgbradley1 Dec 6, 2024
4231189
modified set method to collapse parquet rows
KennyZhang1 Dec 6, 2024
110b10e
modified get method to collapse parquet rows
KennyZhang1 Dec 6, 2024
8e7a1e3
updated has and delete methods and docstrings to adhere to new schema
KennyZhang1 Dec 6, 2024
2db7f83
added prefix helper function
KennyZhang1 Dec 6, 2024
4726bbf
replaced delimiter for prefixed id
KennyZhang1 Dec 6, 2024
49a3639
verified empty tests are passing
jgbradley1 Dec 6, 2024
0429f2f
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Dec 6, 2024
731b5db
fix merges from main
jgbradley1 Dec 6, 2024
e0865f9
add find test
jgbradley1 Dec 7, 2024
22429fd
update cicd step name
jgbradley1 Dec 7, 2024
f06ddfd
tested querying for new schema
KennyZhang1 Dec 9, 2024
c290521
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
KennyZhang1 Dec 9, 2024
778d67f
resolved errors from merge conflicts
KennyZhang1 Dec 9, 2024
7e47f44
refactored set method to handle cache in new schema
KennyZhang1 Dec 9, 2024
2c3d076
refactored get method to handle cache in new schema
KennyZhang1 Dec 9, 2024
231ad57
force unique ids to be written to cosmos for nodes
KennyZhang1 Dec 10, 2024
02bd07e
found bug with has and delete methods
KennyZhang1 Dec 13, 2024
d390b51
modified has and delete to work with cache in new schema
KennyZhang1 Dec 13, 2024
d61e131
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Dec 16, 2024
09037b6
fix the merge from main
jgbradley1 Dec 16, 2024
b959e2b
minor typo fixes
jgbradley1 Dec 16, 2024
cb98d7a
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Dec 16, 2024
9852d8a
update lock file
jgbradley1 Dec 16, 2024
cccd6bd
spellcheck fix
jgbradley1 Dec 16, 2024
1f8be94
fix init function signature
jgbradley1 Dec 16, 2024
07899b8
minor formatting updates
jgbradley1 Dec 16, 2024
61894a2
remove https protocol
jgbradley1 Dec 17, 2024
070013e
change localhost to 127.0.0.1 address
jgbradley1 Dec 17, 2024
82f05c0
update pytest to use bacj engine
jgbradley1 Dec 17, 2024
3ab938f
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Dec 17, 2024
a1571f3
verified cache tests
KennyZhang1 Dec 17, 2024
7c7f85c
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
KennyZhang1 Dec 17, 2024
aac3c62
improved speed of has function
KennyZhang1 Dec 17, 2024
2eb102a
resolved pytest error with find function
KennyZhang1 Dec 17, 2024
3fc6001
added test for child method
KennyZhang1 Dec 17, 2024
0036fc4
make container_name variable private as _container_name
jgbradley1 Dec 17, 2024
3982a3e
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
KennyZhang1 Dec 17, 2024
7480cff
minor variable name fix
jgbradley1 Dec 17, 2024
20d867d
cleanup cosmos pytest and make the cosmosdb storage class operations …
jgbradley1 Dec 18, 2024
fb945b2
update cicd to use different cosmosdb emulator
jgbradley1 Dec 18, 2024
1cddde1
test with http protocol
jgbradley1 Dec 18, 2024
8d34c2d
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
KennyZhang1 Dec 18, 2024
74320c9
added pytest for clear()
KennyZhang1 Dec 18, 2024
16932b3
add longer timeout for cosmosdb emulator startup
jgbradley1 Dec 18, 2024
dbf2be2
revert http connection back to https
jgbradley1 Dec 18, 2024
6a702bf
add comments to cicd code for future dev usage
jgbradley1 Dec 18, 2024
2ca43ce
set to container and database clients to none upon deletion
KennyZhang1 Dec 18, 2024
189b2be
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
KennyZhang1 Dec 18, 2024
e2b3cf0
ruff changes
KennyZhang1 Dec 18, 2024
944b9d3
add comments to cicd code
jgbradley1 Dec 18, 2024
e8e85f0
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
jgbradley1 Dec 18, 2024
26bba96
removed unneeded None statements and ruff fixes
KennyZhang1 Dec 18, 2024
45260f3
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
KennyZhang1 Dec 18, 2024
3ba1516
more ruff fixes
KennyZhang1 Dec 18, 2024
cc9e977
Update test_run.py
KennyZhang1 Dec 18, 2024
0c62cf7
remove unnecessary call to delete container
jgbradley1 Dec 18, 2024
9bac7ef
ruff format updates
jgbradley1 Dec 18, 2024
9c342c2
Reverted test_run.py
KennyZhang1 Dec 18, 2024
7b36265
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
jgbradley1 Dec 18, 2024
998e216
Merge branch 'add-cosmosdb-to-storage' of github.com:microsoft/graphr…
jgbradley1 Dec 18, 2024
0e50eef
fix ruff formatter errors
jgbradley1 Dec 18, 2024
25c281b
cleanup variable names to be more consistent
jgbradley1 Dec 18, 2024
57ae36a
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Dec 18, 2024
04bf8c6
remove extra semversioner file
jgbradley1 Dec 18, 2024
cf165ac
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Dec 18, 2024
b059367
Merge branch 'main' into add-cosmosdb-to-storage
jgbradley1 Dec 19, 2024
a6a4694
revert pydantic model changes
jgbradley1 Dec 19, 2024
143e231
revert pydantic model change
jgbradley1 Dec 19, 2024
6926dfe
revert pydantic model change
jgbradley1 Dec 19, 2024
52dc631
re-enable inline formatting rule
jgbradley1 Dec 19, 2024
c5b4c78
update documentation in dev guide
jgbradley1 Dec 19, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions .github/workflows/python-integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ permissions:

concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
# Only run the for the latest commit
# only run the for the latest commit
cancel-in-progress: true

env:
Expand All @@ -37,7 +37,7 @@ jobs:
matrix:
python-version: ["3.10"]
os: [ubuntu-latest, windows-latest]
fail-fast: false # Continue running all jobs even if one fails
fail-fast: false # continue running all jobs even if one fails
env:
DEBUG: 1

Expand Down Expand Up @@ -84,6 +84,17 @@ jobs:
id: azuright
uses: potatoqualitee/[email protected]

# For more information on installation/setup of Azure Cosmos DB Emulator
# https://learn.microsoft.com/en-us/azure/cosmos-db/how-to-develop-emulator?tabs=docker-linux%2Cpython&pivots=api-nosql
# Note: the emulator is only available on Windows runners. It can take longer than the default to initially startup so we increase the default timeout.
# If a job fails due to timeout, restarting the cicd job usually resolves the problem.
- name: Install Azure Cosmos DB Emulator
if: runner.os == 'Windows'
run: |
Write-Host "Launching Cosmos DB Emulator"
Import-Module "$env:ProgramFiles\Azure Cosmos DB Emulator\PSModules\Microsoft.Azure.CosmosDB.Emulator"
Start-CosmosDbEmulator -Timeout 500

- name: Integration Test
run: |
poetry run poe test_integration
4 changes: 4 additions & 0 deletions .semversioner/next-release/patch-20241121202210026640.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "patch",
"description": "Implement cosmosdb storage option for cache and output"
}
1 change: 0 additions & 1 deletion DEVELOPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,6 @@ graphrag
├── config # configuration management
├── index # indexing engine
| └─ run/run.py # main entrypoint to build an index
├── llm # generic llm interfaces
├── logger # logger module supporting several options
│   └─ factory.py # └─ main entrypoint to create a logger
├── model # data model definitions associated with the knowledge graph
Expand Down
1 change: 1 addition & 0 deletions dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ ints

# Azure
abfs
cosmosdb
Hnsw
odata

Expand Down
3 changes: 3 additions & 0 deletions graphrag/cache/factory.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

from graphrag.config.enums import CacheType
from graphrag.storage.blob_pipeline_storage import BlobPipelineStorage
from graphrag.storage.cosmosdb_pipeline_storage import create_cosmosdb_storage
from graphrag.storage.file_pipeline_storage import FilePipelineStorage

if TYPE_CHECKING:
Expand Down Expand Up @@ -50,6 +51,8 @@ def create_cache(
)
case CacheType.blob:
return JsonPipelineCache(BlobPipelineStorage(**kwargs))
case CacheType.cosmosdb:
return JsonPipelineCache(create_cosmosdb_storage(**kwargs))
case _:
if cache_type in cls.cache_types:
return cls.cache_types[cache_type](**kwargs)
Expand Down
3 changes: 3 additions & 0 deletions graphrag/config/create_graphrag_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -362,6 +362,7 @@ def hydrate_parallelization_params(
storage_account_blob_url=reader.str(Fragment.storage_account_blob_url),
container_name=reader.str(Fragment.container_name),
base_dir=reader.str(Fragment.base_dir) or defs.CACHE_BASE_DIR,
cosmosdb_account_url=reader.str(Fragment.cosmosdb_account_url),
)
with (
reader.envvar_prefix(Section.reporting),
Expand All @@ -383,6 +384,7 @@ def hydrate_parallelization_params(
storage_account_blob_url=reader.str(Fragment.storage_account_blob_url),
container_name=reader.str(Fragment.container_name),
base_dir=reader.str(Fragment.base_dir) or defs.STORAGE_BASE_DIR,
cosmosdb_account_url=reader.str(Fragment.cosmosdb_account_url),
)

with (
Expand Down Expand Up @@ -667,6 +669,7 @@ class Fragment(str, Enum):
concurrent_requests = "CONCURRENT_REQUESTS"
conn_string = "CONNECTION_STRING"
container_name = "CONTAINER_NAME"
cosmosdb_account_url = "COSMOSDB_ACCOUNT_URL"
deployment_name = "DEPLOYMENT_NAME"
description = "DESCRIPTION"
enabled = "ENABLED"
Expand Down
4 changes: 4 additions & 0 deletions graphrag/config/enums.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ class CacheType(str, Enum):
"""The none cache configuration type."""
blob = "blob"
"""The blob cache configuration type."""
cosmosdb = "cosmosdb"
"""The cosmosdb cache configuration type"""

def __repr__(self):
"""Get a string representation."""
Expand Down Expand Up @@ -60,6 +62,8 @@ class StorageType(str, Enum):
"""The memory storage type."""
blob = "blob"
"""The blob storage type."""
cosmosdb = "cosmosdb"
"""The cosmosdb storage type"""

def __repr__(self):
"""Get a string representation."""
Expand Down
4 changes: 2 additions & 2 deletions graphrag/config/init_content.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,15 +63,15 @@
## connection_string and container_name must be provided

cache:
type: {defs.CACHE_TYPE.value} # or blob
type: {defs.CACHE_TYPE.value} # one of [blob, cosmosdb, file]
base_dir: "{defs.CACHE_BASE_DIR}"

reporting:
type: {defs.REPORTING_TYPE.value} # or console, blob
base_dir: "{defs.REPORTING_BASE_DIR}"

storage:
type: {defs.STORAGE_TYPE.value} # or blob
type: {defs.STORAGE_TYPE.value} # one of [blob, cosmosdb, file]
base_dir: "{defs.STORAGE_BASE_DIR}"

## only turn this on if running `graphrag index` with custom settings
Expand Down
1 change: 1 addition & 0 deletions graphrag/config/input_models/cache_config_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ class CacheConfigInput(TypedDict):
connection_string: NotRequired[str | None]
container_name: NotRequired[str | None]
storage_account_blob_url: NotRequired[str | None]
cosmosdb_account_url: NotRequired[str | None]
1 change: 1 addition & 0 deletions graphrag/config/input_models/storage_config_input.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ class StorageConfigInput(TypedDict):
connection_string: NotRequired[str | None]
container_name: NotRequired[str | None]
storage_account_blob_url: NotRequired[str | None]
cosmosdb_account_url: NotRequired[str | None]
3 changes: 3 additions & 0 deletions graphrag/config/models/cache_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,6 @@ class CacheConfig(BaseModel):
storage_account_blob_url: str | None = Field(
description="The storage account blob url to use.", default=None
)
cosmosdb_account_url: str | None = Field(
description="The cosmosdb account url to use.", default=None
)
3 changes: 3 additions & 0 deletions graphrag/config/models/storage_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,6 @@ class StorageConfig(BaseModel):
storage_account_blob_url: str | None = Field(
description="The storage account blob url to use.", default=None
)
cosmosdb_account_url: str | None = Field(
description="The cosmosdb account url to use.", default=None
)
28 changes: 27 additions & 1 deletion graphrag/index/config/cache.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""A module containing 'PipelineCacheConfig', 'PipelineFileCacheConfig', 'PipelineMemoryCacheConfig', 'PipelineBlobCacheConfig' models."""
"""A module containing 'PipelineCacheConfig', 'PipelineFileCacheConfig', 'PipelineMemoryCacheConfig', 'PipelineBlobCacheConfig', 'PipelineCosmosDBCacheConfig' models."""

from __future__ import annotations

Expand Down Expand Up @@ -71,9 +71,35 @@ class PipelineBlobCacheConfig(PipelineCacheConfig[Literal[CacheType.blob]]):
"""The storage account blob url for cache"""


class PipelineCosmosDBCacheConfig(PipelineCacheConfig[Literal[CacheType.cosmosdb]]):
"""Represents the cosmosdb cache configuration for the pipeline."""

type: Literal[CacheType.cosmosdb] = CacheType.cosmosdb
"""The type of cache."""

base_dir: str | None = Field(
description="The cosmosdb database name for the cache.", default=None
)
"""The cosmosdb database name for the cache."""

container_name: str = Field(description="The container name for cache.", default="")
"""The container name for cache."""

connection_string: str | None = Field(
description="The cosmosdb primary key for the cache.", default=None
)
"""The cosmosdb primary key for the cache."""

cosmosdb_account_url: str | None = Field(
description="The cosmosdb account url for cache", default=None
)
"""The cosmosdb account url for cache"""


PipelineCacheConfigTypes = (
PipelineFileCacheConfig
| PipelineMemoryCacheConfig
| PipelineBlobCacheConfig
| PipelineNoneCacheConfig
| PipelineCosmosDBCacheConfig
)
36 changes: 34 additions & 2 deletions graphrag/index/config/storage.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""A module containing 'PipelineStorageConfig', 'PipelineFileStorageConfig' and 'PipelineMemoryStorageConfig' models."""
"""A module containing 'PipelineStorageConfig', 'PipelineFileStorageConfig','PipelineMemoryStorageConfig', 'PipelineBlobStorageConfig', and 'PipelineCosmosDBStorageConfig' models."""

from __future__ import annotations

Expand Down Expand Up @@ -66,6 +66,38 @@ class PipelineBlobStorageConfig(PipelineStorageConfig[Literal[StorageType.blob]]
"""The storage account blob url."""


class PipelineCosmosDBStorageConfig(
PipelineStorageConfig[Literal[StorageType.cosmosdb]]
):
"""Represents the cosmosdb storage configuration for the pipeline."""

type: Literal[StorageType.cosmosdb] = StorageType.cosmosdb
"""The type of storage."""

connection_string: str | None = Field(
description="The cosmosdb storage primary key for the storage.", default=None
)
"""The cosmosdb storage primary key for the storage."""

container_name: str = Field(
description="The container name for storage", default=""
)
"""The container name for storage."""

base_dir: str | None = Field(
description="The base directory for the storage.", default=None
)
"""The base directory for the storage."""

cosmosdb_account_url: str | None = Field(
description="The cosmosdb account url.", default=None
)
"""The cosmosdb account url."""


PipelineStorageConfigTypes = (
PipelineFileStorageConfig | PipelineMemoryStorageConfig | PipelineBlobStorageConfig
PipelineFileStorageConfig
| PipelineMemoryStorageConfig
| PipelineBlobStorageConfig
| PipelineCosmosDBStorageConfig
)
42 changes: 42 additions & 0 deletions graphrag/index/create_pipeline_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from graphrag.index.config.cache import (
PipelineBlobCacheConfig,
PipelineCacheConfigTypes,
PipelineCosmosDBCacheConfig,
PipelineFileCacheConfig,
PipelineMemoryCacheConfig,
PipelineNoneCacheConfig,
Expand All @@ -44,6 +45,7 @@
)
from graphrag.index.config.storage import (
PipelineBlobStorageConfig,
PipelineCosmosDBStorageConfig,
PipelineFileStorageConfig,
PipelineMemoryStorageConfig,
PipelineStorageConfigTypes,
Expand Down Expand Up @@ -420,6 +422,26 @@ def _get_storage_config(
base_dir=storage_settings.base_dir,
storage_account_blob_url=storage_account_blob_url,
)
case StorageType.cosmosdb:
cosmosdb_account_url = storage_settings.cosmosdb_account_url
connection_string = storage_settings.connection_string
base_dir = storage_settings.base_dir
container_name = storage_settings.container_name
if cosmosdb_account_url is None:
msg = "CosmosDB account url must be provided for cosmosdb storage."
raise ValueError(msg)
if base_dir is None:
msg = "Base directory must be provided for cosmosdb storage."
raise ValueError(msg)
if container_name is None:
msg = "Container name must be provided for cosmosdb storage."
raise ValueError(msg)
return PipelineCosmosDBStorageConfig(
cosmosdb_account_url=cosmosdb_account_url,
connection_string=connection_string,
base_dir=storage_settings.base_dir,
container_name=container_name,
)
case _:
# relative to the root_dir
base_dir = storage_settings.base_dir
Expand Down Expand Up @@ -457,6 +479,26 @@ def _get_cache_config(
base_dir=settings.cache.base_dir,
storage_account_blob_url=storage_account_blob_url,
)
case CacheType.cosmosdb:
cosmosdb_account_url = settings.cache.cosmosdb_account_url
connection_string = settings.cache.connection_string
base_dir = settings.cache.base_dir
container_name = settings.cache.container_name
if base_dir is None:
msg = "Base directory must be provided for cosmosdb cache."
raise ValueError(msg)
if container_name is None:
msg = "Container name must be provided for cosmosdb cache."
raise ValueError(msg)
if connection_string is None and cosmosdb_account_url is None:
msg = "Connection string or cosmosDB account url must be provided for cosmosdb cache."
raise ValueError(msg)
return PipelineCosmosDBCacheConfig(
cosmosdb_account_url=cosmosdb_account_url,
connection_string=connection_string,
base_dir=base_dir,
container_name=container_name,
)
case _:
# relative to root dir
return PipelineFileCacheConfig(base_dir="./cache")
6 changes: 5 additions & 1 deletion graphrag/index/run/workflow.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,12 @@ async def _process_workflow(
return None

context.stats.workflows[workflow_name] = {"overall": 0.0}

await _inject_workflow_data_dependencies(
workflow, workflow_dependencies, dataset, context.storage
workflow,
workflow_dependencies,
dataset,
context.storage,
)

workflow_start_time = time.time()
Expand Down
3 changes: 1 addition & 2 deletions graphrag/query/input/loaders/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,8 +165,7 @@ def to_optional_float(data: pd.Series, column_name: str | None) -> float | None:
if value is None:
return None
if not isinstance(value, float):
msg = f"value is not a float: {value} ({type(value)})"
raise ValueError(msg)
return float(value)
else:
msg = f"Column {column_name} not found in data"
raise ValueError(msg)
Expand Down
Loading
Loading