Make it possible to generate a DBT docs subset #5096

bashyroger · 2021-03-23T11:34:27Z

bashyroger
Mar 23, 2021

Describe the feature

As we are building a data platform for multiple, different clients we would like to generate a subset of the complete DBT docs for them.
Reasons being:
-Why expose documentation that will never be relevant for a client?
-ETL logic / privacy / legal reasons

My initial thought is to make it possible to generate 'sub-sites' using the already existing model selector methods syntax:

Examples:
This should generate docs based on a tag and all further outgoing nodes:
dbt docs --models tag:client_x+

This should generate docs based on a tag, the incoming nodes, all outgoing nodes and the parents of the children:
dbt docs --models +tag:client_x@

etc..

Support for exclude is also important for us here.

Describe alternatives you've considered

James Weakley showed this DIY method on DBT slack:
import json data = None with open('target/manifest.json') as f: data = json.load(f) for node in data['nodes']: print(f"Checking node: {node}") if 'some_tag' in data['nodes'][node]['tag']: del data['nodes'][node] with open('target/manifest.json', 'w') as f: json.dump(data, f)

While that could partially work, it will not be as complete as what we would like.

Additional context

This is not database specific, I guess the metadata is there in DBT to make this possible

Who will this benefit?

Anyone that has the same use case as us: reducing the amount of info for a consumer and privacy / legal reasons

Are you interested in contributing this feature?

My team can code in python / they could potentially help

jtcohen6 · 2021-03-23T14:10:12Z

jtcohen6
Mar 23, 2021
Maintainer

@bashyroger Thanks for opening the issue for this! I was just having this conversation again yesterday. I'm completely on board.

My initial thought is to make it possible to generate 'sub-sites' using the already existing model selector methods syntax:

What do you think of using YAML selectors for this in particular? As of v0.19.0, the selectors dict is populated in manifest.json, and we even did some extra work to make sure it's done in a consistent and expressive way.

To me, that feels like the right lever to pull. Each selector is a meaningful grouping, defined and saved in advance (and in version control!), with the full power of node selection syntax. The question is how:

Should the docs site make available a top-level selection pane, that would alternately show or hide each group of resources?
Should dbt docs generate --selector finance write a manifest.json that disables or otherwise flags unselected resources (with the docs site index.html encoding the logic that ignores flagged resources)?
- What if a selected resource refs an unselected resource? This would need to happen after most other manifest construction, i.e. after checking DAG validity.
Should there be a command or script—similar to the one James shared in Slack—that takes a --selector argument, reads a full manifest.json, and outputs a "partial" manifest of only the selected resources? That could live inside or outside of dbt.

I think the use case will need to determine the specific implementation. If indeed the motivation is privacy / legal reasons, then having the sensitive information still populate in manifest.json, but simply be ignored by logic in index.html, may not be sufficient. Of course, figuring out a way to write a partial manifest is going to be more complex than simply adding one more boolean flag.

0 replies

bashyroger · 2021-03-24T09:14:03Z

bashyroger
Mar 24, 2021
Author

Using YAML selectors for this indeed makes sense @jtcohen6 , It should achieve the same results but then better (version controlled, documented e.o)
For our use case, we have multiple clients that each need a sub section BUT there will also be parts that are 'shared'. Solving this in one 'site' would require being able to log in DBT docs with different access rights, which I think is a bridge to far.
So I would say that is is no problem that docs content is duplicated accross multiple sites. As in, really generate a different manifest.json and index.html in a web subdirectory, based on the selectors and a 'output directory' name.

Then, for now we can just copy / serve that data from the subdir to a client 'as is'.

And later, in DBT cloud, for users that having a read-only account, you could add config data as to which 'output directory' subdir they have access too (ie: root, client_x, client_y)....

0 replies

bashyroger · 2021-06-08T14:34:43Z

bashyroger
Jun 8, 2021
Author

In addition, as an alternative to generating multiple DBT docs manifests, another way to achieve similar results would be to change the docs visibility based on the user logging in by adding model (or YML selectors) as a parameter / filter in the user config screen.

As an example, when I would add user client_X to dbt cloud, the same metadata I mentioned in my first post, added to the user would then filter the incoming nodes, all outgoing nodes and the parents of the children 'under the hood' as the only nodes allowed to use / search in:

client_x --models +tag:client_x@

0 replies

avaitla · 2021-08-29T07:45:07Z

avaitla
Aug 29, 2021

We currently use DBT cloud and would love to see this improved there as well. Our teams currently face the issue where they only need to see a specific set of docs relevant to their roles but dbt generate docs makes docs for every single model in our project. This means they are unable to effectively use the docs due to information overload. We currently have a workflow similar to the following:

dbt seed --select tag:marketing
dbt run -m tag:marketing
dbt test -m tag:marketing

This only generates a handful of tables in our destination bigquery, but the docs generated in dbt cloud by clicking the generate docs button (which uses dbt generate docs under the hood) makes hundreds of model files. So when our team clicks view documentation there is way too much information for the docs to be useful. The only workaround would be to fracture the entire dbt repository into much smaller ones, which for the sake of docs generation seems extremely heavy handed (we'd have possibly 10 different repos at that point just to have simpler docs generated).

0 replies

itajaja · 2021-10-12T20:22:32Z

itajaja
Oct 12, 2021

i am kind of confused, because dbt docs generate already has some available command line options:

> p run dbt docs generate --help
usage: dbt docs generate [-h] [--project-dir PROJECT_DIR] [--profiles-dir PROFILES_DIR] [--profile PROFILE] [-t TARGET] [--vars VARS]
                         [--bypass-cache] [--no-compile] [--threads THREADS] [--no-version-check] [-m MODELS [MODELS ...]]
                         [--exclude EXCLUDE [EXCLUDE ...]] [--selector SELECTOR_NAME] [--state STATE]

optional arguments:
  -h, --help            show this help message and exit
  --project-dir PROJECT_DIR
                        Which directory to look in for the dbt_project.yml file. Default is the current working directory and its
                        parents.
  --profiles-dir PROFILES_DIR
                        Which directory to look in for the profiles.yml file. Default = /Users/gtagliabue/.dbt
  --profile PROFILE     Which profile to load. Overrides setting in dbt_project.yml.
  -t TARGET, --target TARGET
                        Which target to load for the given profile
  --vars VARS           Supply variables to the project. This argument overrides variables defined in your dbt_project.yml file. This
                        argument should be a YAML string, eg. '{my_variable: my_value}'
  --bypass-cache        If set, bypass the adapter-level cache of database state
  --no-compile          Do not run "dbt compile" as part of docs generation
  --threads THREADS     Specify number of threads to use while executing models. Overrides settings in profiles.yml.
  --no-version-check    If set, skip ensuring dbt's version matches the one specified in the dbt_project.yml file ('require-dbt-
                        version')
  -m MODELS [MODELS ...], --models MODELS [MODELS ...]
                        Specify the models to include.
  --exclude EXCLUDE [EXCLUDE ...]
                        Specify the models to exclude.
  --selector SELECTOR_NAME
                        The selector name to use, as defined in selectors.yml
  --state STATE         If set, use the given directory as the source for json files to compare with this project.

still, they don't seem to work. is it just an error in the command line documentation?

0 replies

jb-delafosse · 2022-01-12T09:23:17Z

jb-delafosse
Jan 12, 2022

I think it's an error in the generate method in the CLI. A quick improvement would be to at least remove the "selection options" from the dbt docs generate command.

I'm interested in doing a PR in that direction

IMO another PR should be in the future to make selection available in the future in dbt docs generate but as @jtcohen6 explained, there are design decisions to consider on how it should behave

0 replies

jtcohen6 · 2022-01-12T12:43:31Z

jtcohen6
Jan 12, 2022
Maintainer

Just to clarify that point: The --select argument to dbt docs generate does perform a function, by selecting which resources will be compiled. It does not affect which resources will be included in manifest.json (all enabled resources in the project + packages), but it does affect which of those resources include their compiled_sql.

If you skip all compilation, dbt docs generate --no-compile, then then inclusion of a --select flag wouldn't have any effect.

0 replies

jb-delafosse · 2022-01-19T12:22:33Z

jb-delafosse
Jan 19, 2022

I made a simple CLI to help me modify the manifest.json and catalog.json for this usecase.
Let me know if that helps : https://github.com/jb-delafosse/dbt-subdocs

0 replies

jmesterh · 2022-12-08T21:26:20Z

jmesterh
Dec 8, 2022

I'd like to second this, our DBT ETL has 32679 sources and creates a 60MB manifest.json..

0 replies

scottsoithongsuk · 2023-03-10T19:18:08Z

scottsoithongsuk
Mar 10, 2023

When the manifest grows to even >100MB, it really has problems... - the python CLI referenced is great, but only for tags.

Ideally I need something that can do a --select <model> and a bonus is to lean on the + pre/suffix as well

Is there any work planned on this coming up?

0 replies

culpgrant · 2023-07-10T19:46:05Z

culpgrant
Jul 10, 2023

This would be very nice to have. When I am developing I like to run the docs to verify the documentation looks good and it adds a lot of extra time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it possible to generate a DBT docs subset #5096

{{title}}

Replies: 11 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Make it possible to generate a DBT docs subset #5096

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

Replies: 11 comments

jtcohen6 Mar 23, 2021 Maintainer

bashyroger Mar 24, 2021 Author

bashyroger Jun 8, 2021 Author

jtcohen6 Jan 12, 2022 Maintainer

jtcohen6
Mar 23, 2021
Maintainer

bashyroger
Mar 24, 2021
Author

bashyroger
Jun 8, 2021
Author

jtcohen6
Jan 12, 2022
Maintainer