Databricks Integration (Beta)

What is Databricks?

Databricks is a unified data analytics platform built on Apache Spark that helps teams process, analyze, and build machine learning models on massive datasets. It’s become the go-to solution for data engineering and analytics teams who need to handle complex data workflows at scale.

Here’s what makes it powerful:

Unified workspace – Combines data engineering, data science, and business analytics in one platform so teams can collaborate without switching tools
Built on Apache Spark – Handles massive data processing jobs that would be impossible with traditional databases
Cloud-native – Runs on AWS, Azure, or GCP, integrating seamlessly with your existing cloud infrastructure
Notebooks & workflows – Interactive notebooks for development plus production-grade job scheduling for automated data pipelines

What does this Integration do?

Our Databricks integration gives you complete visibility into where your DBU (Databricks Unit) costs are actually going, so you can optimize spending instead of guessing. We unify your Databricks and AWS costs in one place, making it easy to see the full picture of your cloud infrastructure spend. Here’s what you get:

Unified AWS + Databricks view – Stop switching between platforms—see your complete cloud cost story in CloudForecast alongside your existing AWS infrastructure
DBU-level cost tracking – See exactly how much each workspace, cluster, and job is costing you in real-time
Query-level analysis – Identify which specific queries are burning through your budget so you can optimize the expensive ones
Team & workspace allocation – Break down costs by team, project, or workspace to enable accurate chargeback and accountability
More coming soon!

Setup Requirements

What do we need:

To set up the integration, we’ll need a few things on your end:

CloudForecast workspace – A dedicated workspace where we can run our cost analysis queries
Small serverless SQL warehouse – For running our queries efficiently without spinning up dedicated infrastructure
Service principal with specific permissions:
- Read access to the billing and metadata tables listed below
- Ability to manage the cluster (this lets us shut it down quickly after we’re done using it)

Tables we’ll read from:

system.billing.usage
system.billing.list_prices 
system.compute.clusters
system.compute.warehouses
system.access.workspaces_latest
system.query.history 
system.lakeflow.jobs 
system.compute.node_timeline
system.access.table_lineage 
system.lakeflow.pipelines

Setup Process

There are five key steps to setting up Databricks as a Datasource for CloudForecast:

Create Service Principal
Create workspace
Configure the SQL Warehouse
Grant access to System Tables
Run queries in SQL editor

1. Create Service Principal:

Visit: https://accounts.cloud.databricks.com/user-management/serviceprincipals
Create a Service Principal with a name (e.g. cloudforecast).
Go to the Secret Tab and generate a secret for 730 day
🔥 Important note: Copy both APPLICATION_ID AND SECRET VALUE for later use

2. Create Workspace:

Visit:https://accounts.cloud.databricks.com/
Click on Workspaces then Create Workspace
Enter workspace name (e.g. cloudforecast), select N.Virginia (if possible) and select `Use Serverless compute with default storage)
Once created, click on the Permissions Tab
Click on Add Permssions and add the recenly created user as a User
Important note: 🔥 Copy both WORKSPACE_URL

3. Configure SQL Warehouse

Open the workspace and SQL Warehouse under SQL
Select Create SQL warehouse
- Name
- Cluster size 2X-Small
- Auto stop —> 5 minutes
Add your service Principal as a Can Manage
- This will allow us to turn off the warehouse as soon as possible
Important note: 🔥 Copy the Warehouse ID

4. Grant access to the System Tables:

system.billing.usage – Provides SKU-level usage records by workspace for cost and consumption analysis.
system.billing.list_prices – Provides official Databricks list pricing for each SKU.
system.compute.clusters – Provides metadata and configuration details for Databricks clusters.
system.compute.warehouses – Provides metadata and configuration details for Databricks SQL warehouses.
system.access.workspaces_latest – Provides the latest identifiers and human-readable names for workspaces.
system.query.history – Provides execution history and performance details for queries.
system.lakeflow.jobs – Provides metadata and configuration details for Lakeflow jobs.
system.compute.node_timeline – Provides lifecycle and timing information for compute nodes.
system.access.table_lineage – Provides upstream and downstream lineage relationships between tables.
system.lakeflow.pipelines – Provides metadata and configuration details for Lakeflow pipelines, including definitions and execution settings

5. Run queries in SQL editor

Run the following queries in SQL editor (must be a Metastore ADMIN)

-- Step 1: Grant USE_SCHEMA on each system schema
GRANT USE_SCHEMA ON SCHEMA system.billing TO `<service-principal-application-id>`;
GRANT USE_SCHEMA ON SCHEMA system.compute TO `<service-principal-application-id>`;
GRANT USE_SCHEMA ON SCHEMA system.access TO `<service-principal-application-id>`;
GRANT USE_SCHEMA ON SCHEMA system.query TO `<service-principal-application-id>`;
GRANT USE_SCHEMA ON SCHEMA system.lakeflow TO `<service-principal-application-id>`;

-- Step 2: Grant SELECT on specific tables
GRANT SELECT ON TABLE system.billing.usage TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.billing.list_prices TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.compute.clusters TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.compute.warehouses TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.compute.node_timeline TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.access.workspaces_latest TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.access.table_lineage TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.query.history TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.lakeflow.jobs TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.lakeflow.pipelines TO `<service-principal-application-id>`;

Go back to CloudForecast and enter all the information needed in a new DataBricks datasource.