Add azure storage account export target (#11)

* Add .vscode to .gitignore * Extract data_types, ignore_keys, rename_cols into separate files to prepare for using ConfigMaps * Added azure dependencies * Fixed wrong content for allocation keys * Implement AZURE specific env vars, create new test for azure, added factory pattern for storage backend * Run & fix all pylint issues * aws_s3_storage: change to original upload procedure, azure_storage: use to_parquet * Remove unnecessary print statements * Added new ENV vars + respective tests. Implemented mechanism for conditional adding of query parameters * Adding tests for load_config_file * Added environment variable for json_normalize separator char. * Added new ENV vars to README. Added section for required permissions on Storage Account and S3 * Added short docs on necessary Azure permissions * Change back to original window param * Add files to Dockerfile as per review. * Add 'command' field to allow for changes in Dockerfiles's 'ENTRYPOINT'.
opencost · Sep 7, 2024 · 1c0e817 · 1c0e817
1 parent 6ee0c91
commit 1c0e817
Show file tree

Hide file tree

Showing 16 changed files with 550 additions and 135 deletions.
diff --git a/.gitignore b/.gitignore
@@ -158,3 +158,6 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+
+# VSCode
+.vscode
diff --git a/Dockerfile b/Dockerfile
@@ -10,6 +10,9 @@ RUN apt-get update && apt-get -y upgrade && apt-get -y clean
 RUN useradd --create-home --shell /bin/sh  --uid 8000 opencost
 COPY --from=builder /app /app 
 COPY src/opencost_parquet_exporter.py /app/opencost_parquet_exporter.py
+COPY src/data_types.json /app/data_types.json
+COPY src/rename_cols.json /app/rename_cols.json
+COPY src/ignore_alloc_keys.json /app/ignore_alloc_keys.json
 RUN chmod 755 /app/opencost_parquet_exporter.py && chown -R opencost /app/  
 USER opencost
 ENV PATH="/app/.venv/bin:$PATH"

diff --git a/README.md b/README.md
@@ -20,6 +20,26 @@ The script supports the following environment variables:
 * OPENCOST_PARQUET_FILE_KEY_PREFIX: This is the prefix used for the export, by default it is '/tmp'. The export is going to be saved inside this prefix, in the following structure: year=window_start.year/month=window_start.month/day=window_start.day , ex: tmp/year=2024/month=1/date=15
 * OPENCOST_PARQUET_AGGREGATE: This is the dimentions used to aggregate the data. by default we use "namespace,pod,container" which is the same dimensions used for the CSV native export.
 * OPENCOST_PARQUET_STEP: This is the Step for the export, by default we use 1h steps, which result in 24 steps in a day and make easier to match the exported data to AWS CUR, since cur also export on hourly base.
+* OPENCOST_PARQUET_RESOLUTION: Duration to use as resolution in Prometheus queries. Smaller values (i.e. higher resolutions) will provide better accuracy, but worse performance (i.e. slower query time, higher memory use). Larger values (i.e. lower resolutions) will perform better, but at the expense of lower accuracy for short-running workloads. 
+* OPENCOST_PARQUET_ACCUMULATE: If `"true"`, sum the entire range of time intervals into a single set. Default value is `"false"`. 
+* OPENCOST_PARQUET_INCLUDE_IDLE: Whether to return the calculated __idle__ field for the query. Default is `"false"`.
+* OPENCOST_PARQUET_IDLE_BY_NODE: If `"true"`, idle allocations are created on a per node basis. Which will result in different values when shared and more idle allocations when split. Default is `"false"`.
+* OPENCOST_PARQUET_STORAGE_BACKEND: The storage backend to use. Supports `aws`, `azure`. See below for Azure specific variables.
+* OPENCOST_PARQUET_JSON_SEPARATOR: The OpenCost API returns nested objects. The used [JSON normalization method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html) allows for a custom separator. Use this to specify the separator of your choice. 
+
+## Azure Specific Environment Variables
+* OPENCOST_PARQUET_AZURE_STORAGE_ACCOUNT_NAME: Name of the Azure Storage Account you want to export the data to.
+* OPENCOST_PARQUET_AZURE_CONTAINER_NAME:  The container within the storage account you want to save the data to. The service principal requires write permissions on the container
+* OPENCOST_PARQUET_AZURE_TENANT: You Azure Tenant ID
+* OPENCOST_PARQUET_AZURE_APPLICATION_ID: ClientID of the Service Principal
+* OPENCOST_PARQUET_AZURE_APPLICATION_SECRET: Secret of the Service Principal
+
+# Prerequisites
+## AWS IAM
+
+## Azure RBAC
+The current implementation allows for authentication via [Service Principals](https://learn.microsoft.com/en-us/entra/identity-platform/app-objects-and-service-principals?tabs=browser) on the Azure Storage Account. Therefore, to use the Azure storage backend you need an existing service principal with according role assignments. Azure RBAC has built-in roles for Storage Account Blob Storage operations. The [Storage-Blob-Data-Contributor](https://learn.microsoft.com/en-us/azure/role-based-access-control/built-in-roles/storage#storage-blob-data-contributor) allows to write data to a Azure Storage Account container. A less permissivie custom role can be built and is encouraged!
+
 
 # Usage:
 

diff --git a/examples/k8s_cron_job.yaml b/examples/k8s_cron_job.yaml
@@ -54,6 +54,7 @@ spec:
               runAsUser: 1000
             terminationMessagePath: /dev/termination-log
             terminationMessagePolicy: File
+            command: ["/app/.venv/bin/python3"] # Update this is if the ENTRYPOINT changes
           dnsConfig:
             options:
             - name: single-request-reopen

diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -7,6 +7,8 @@ pytz==2023.3.post1
 six==1.16.0
 tzdata==2023.4
 pyarrow==14.0.1
+azure-storage-blob==12.19.1
+azure-identity==1.15.0
 # The dependencies bellow are only used for development.
 freezegun==1.4.0
 pylint==3.0.3

diff --git a/requirements.txt b/requirements.txt
@@ -7,3 +7,5 @@ pytz==2023.3.post1
 six==1.16.0
 tzdata==2023.4
 pyarrow==14.0.1
+azure-storage-blob==12.19.1
+azure-identity==1.15.0
diff --git a/src/data_types.json b/src/data_types.json
@@ -0,0 +1,39 @@
+{
+    "cpuCoreHours": "float",
+    "cpuCoreRequestAverage": "float",
+    "cpuCoreUsageAverage": "float",
+    "cpuCores": "float",
+    "cpuCost": "float",
+    "cpuCostAdjustment": "float",
+    "cpuEfficiency": "float",
+    "externalCost": "float",
+    "gpuCost": "float",
+    "gpuCostAdjustment": "float",
+    "gpuCount": "float",
+    "gpuHours": "float",
+    "loadBalancerCost": "float",
+    "loadBalancerCostAdjustment": "float",
+    "networkCost": "float",
+    "networkCostAdjustment": "float",
+    "networkCrossRegionCost": "float",
+    "networkCrossZoneCost": "float",
+    "networkInternetCost": "float",
+    "networkReceiveBytes": "float",
+    "networkTransferBytes": "float",
+    "pvByteHours": "float",
+    "pvBytes": "float",
+    "pvCost": "float",
+    "pvCostAdjustment": "float",
+    "ramByteHours": "float",
+    "ramByteRequestAverage": "float",
+    "ramByteUsageAverage": "float",
+    "ramBytes": "float",
+    "ramCost": "float",
+    "ramCostAdjustment": "float",
+    "ramEfficiency": "float",
+    "running_minutes": "float",
+    "sharedCost": "float",
+    "totalCost": "float",
+    "totalEfficiency": "float"
+}
+
diff --git a/src/ignore_alloc_keys.json b/src/ignore_alloc_keys.json
@@ -0,0 +1,3 @@
+{
+    "keys": ["pvs", "lbAllocations"]
+}