Make it possible to generate a DBT docs subset #5096
Replies: 11 comments
-
@bashyroger Thanks for opening the issue for this! I was just having this conversation again yesterday. I'm completely on board.
What do you think of using YAML selectors for this in particular? As of v0.19.0, the To me, that feels like the right lever to pull. Each selector is a meaningful grouping, defined and saved in advance (and in version control!), with the full power of node selection syntax. The question is how:
I think the use case will need to determine the specific implementation. If indeed the motivation is privacy / legal reasons, then having the sensitive information still populate in |
Beta Was this translation helpful? Give feedback.
-
Using YAML selectors for this indeed makes sense @jtcohen6 , It should achieve the same results but then better (version controlled, documented e.o) Then, for now we can just copy / serve that data from the subdir to a client 'as is'. And later, in DBT cloud, for users that having a read-only account, you could add config data as to which 'output directory' subdir they have access too (ie: root, client_x, client_y).... |
Beta Was this translation helpful? Give feedback.
-
In addition, as an alternative to generating multiple DBT docs manifests, another way to achieve similar results would be to change the docs visibility based on the user logging in by adding model (or YML selectors) as a parameter / filter in the user config screen. As an example, when I would add user client_X to dbt cloud, the same metadata I mentioned in my first post, added to the user would then filter the incoming nodes, all outgoing nodes and the parents of the children 'under the hood' as the only nodes allowed to use / search in:
|
Beta Was this translation helpful? Give feedback.
-
We currently use DBT cloud and would love to see this improved there as well. Our teams currently face the issue where they only need to see a specific set of docs relevant to their roles but dbt generate docs makes docs for every single model in our project. This means they are unable to effectively use the docs due to information overload. We currently have a workflow similar to the following:
This only generates a handful of tables in our destination bigquery, but the docs generated in dbt cloud by clicking the generate docs button (which uses |
Beta Was this translation helpful? Give feedback.
-
i am kind of confused, because
still, they don't seem to work. is it just an error in the command line documentation? |
Beta Was this translation helpful? Give feedback.
-
I think it's an error in the I'm interested in doing a PR in that direction IMO another PR should be in the future to make selection available in the future in |
Beta Was this translation helpful? Give feedback.
-
Just to clarify that point: The If you skip all compilation, |
Beta Was this translation helpful? Give feedback.
-
I made a simple CLI to help me modify the |
Beta Was this translation helpful? Give feedback.
-
I'd like to second this, our DBT ETL has 32679 sources and creates a 60MB manifest.json.. |
Beta Was this translation helpful? Give feedback.
-
When the manifest grows to even >100MB, it really has problems... - the python CLI referenced is great, but only for tags. Ideally I need something that can do a Is there any work planned on this coming up? |
Beta Was this translation helpful? Give feedback.
-
This would be very nice to have. When I am developing I like to run the docs to verify the documentation looks good and it adds a lot of extra time. |
Beta Was this translation helpful? Give feedback.
-
Describe the feature
As we are building a data platform for multiple, different clients we would like to generate a subset of the complete DBT docs for them.
Reasons being:
-Why expose documentation that will never be relevant for a client?
-ETL logic / privacy / legal reasons
My initial thought is to make it possible to generate 'sub-sites' using the already existing model selector methods syntax:
Examples:
This should generate docs based on a tag and all further outgoing nodes:
dbt docs --models tag:client_x+
This should generate docs based on a tag, the incoming nodes, all outgoing nodes and the parents of the children:
dbt docs --models +tag:client_x@
etc..
Support for exclude is also important for us here.
Describe alternatives you've considered
James Weakley showed this DIY method on DBT slack:
import json data = None with open('target/manifest.json') as f: data = json.load(f) for node in data['nodes']: print(f"Checking node: {node}") if 'some_tag' in data['nodes'][node]['tag']: del data['nodes'][node] with open('target/manifest.json', 'w') as f: json.dump(data, f)
While that could partially work, it will not be as complete as what we would like.
Additional context
This is not database specific, I guess the metadata is there in DBT to make this possible
Who will this benefit?
Anyone that has the same use case as us: reducing the amount of info for a consumer and privacy / legal reasons
Are you interested in contributing this feature?
My team can code in python / they could potentially help
Beta Was this translation helpful? Give feedback.
All reactions