Skip to content

lvkrlv/dbtvault_greenplum_demo

 
 

Repository files navigation

Data Vault powered by dbtVault and Greenplum

Assignment TODO

⚠️ Attention! Always delete resources after you finish your work!

1. Configure Developer Environment

  1. You have got 3 options to set up:

    Start with GitHub Codespaces / Dev Container:

    Open in Github Codespace:

    GitHub Codespaces

    Or open in a local Dev Container (VS Code):

    Dev Container

    Set up Docker containers manually:

    Install Docker and run commands:

    # build & run container
    docker-compose build
    docker-compose up -d
    
    # alias docker exec command
    alias dbt="docker-compose exec dev dbt"

    Alternatively, install on a local machine:

    1. Install dbt

      Configure profile manually by yourself. By default, dbt expects the profiles.yml file to be located in the ~/.dbt/ directory. Use this template and enter your own credentials.

    2. Intsall yc CLI

    3. Install Terraform

  2. Populate .env file

    .env is used to store secrets as environment variables.

    Copy template file .env.template to .env file:

    cp .env.template .env

    Open file in editor and set your own values.

    ❗️ Never commit secrets to git

2. Deploy Infrastructure

  1. Get familiar with Managed Service for Greenplum

  2. Install and configure yc CLI: Getting started with the command-line interface by Yandex Cloud

    yc init
  3. Set environment variables:

    export YC_TOKEN=$(yc iam create-token)
    export YC_CLOUD_ID=$(yc config get cloud-id)
    export YC_FOLDER_ID=$(yc config get folder-id)
    export $(xargs <.env)
  4. Deploy using yc CLI

Add network, greenplum, egress NAT (s3)

```bash
yc managed-greenplum cluster create gp_datavault \
--network-name default \
--zone-id ru-central1-a \
--environment prestable \
--master-host-count 2 \
--segment-host-count 2 \
--master-config resource-id=s3-c2-m8,disk-size=30,disk-type=network-ssd \
--segment-config resource-id=s3-c2-m8,disk-size=30,disk-type=network-ssd \
--segment-in-host 1 \
--user-name greenplum \
--user-password $(TF_VAR_greenplum_password) \
--greenplum-version 6.22 \
--assign-public-ip
```
  1. Deploy using Terraform

    terraform init
    terraform validate
    terraform fmt
    terraform plan
    terraform apply

    Store terraform output values as Environment Variables:

    export DBT_HOST=$(terraform output -raw greenplum_host_fqdn)
    export DBT_USER='greenplum'
    export DBT_PASSWORD=${TF_VAR_greenplum_password}
    
    export DBT_HOST='rc1b-j9injttb11tl6ohd.mdb.yandexcloud.net,rc1b-o0tu24372qtf0qko.mdb.yandexcloud.net'
    export DBT_USER='greenplum'
    export DBT_PASSWORD='greenplum'

    [EN] Reference: Getting started with Terraform by Yandex Cloud

    [RU] Reference: Начало работы с Terraform by Yandex Cloud

! To connect to external sources, set up an NAT gateway for the subnet hosting the Managed Service for Greenplum® cluster. https://cloud.yandex.com/en/docs/vpc/operations/create-nat-gateway

Check database connection

WIP Populate Data Vault day-by-day

  1. First read the official guide:**

dbtVault worked example

  1. Install dependencies

Initial repo is intended to run on Snowflake only.

I have forked it and adapted to run on Greenplum/PostgreSQL. Check out what has been changed: 47e0261cea67c3284ea409c86dacdc31b1175a39

packages.yml:

packages:
  # - package: Datavault-UK/dbtvault
  #   version: 0.7.3
  - git: "https://github.com/kzzzr/dbtvault.git"
    revision: master
    warn-unpinned: false

Install package:

dbt deps
  1. Adapt models to Greenplum/PostgreSQL

Check out the commit history.

  • a97a224 - adapt prepared staging layer for greenplum - Artemiy Kozyr (HEAD -> master, kzzzr/master)
  • dfc5866 - configure raw layer for greenplum - Artemiy Kozyr
  • bba7437 - configure data sources for greenplum - Artemiy Kozyr
  • aa25600 - configure package (adapted dbt_vault) for greenplum - Artemiy Kozyr
  • eafed95 - configure dbt_project.yml for greenplum - Artemiy Kozyr
  1. Run models step-by-step

Load one day to Data Vault structures:

dbt run -m tag:raw
dbt run -m tag:stage
dbt run -m tag:hub
dbt run -m tag:link
dbt run -m tag:satellite
dbt run -m tag:t_link
  1. Load next day

Simulate next day load by incrementing load_date varible:

# dbt_profiles.yml

vars:
  load_date: '1992-01-08' # increment by one day '1992-01-09'

Build Business Vault on top of Data Vault

Create PR and pass CI tests

About

dbtVault + Greenplum demo

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HCL 71.6%
  • Dockerfile 28.4%