What is Databricks?
Databricks is a unified data analytics platform built on Apache Spark that helps teams process, analyze, and build machine learning models on massive datasets. It’s become the go-to solution for data engineering and analytics teams who need to handle complex data workflows at scale.
Here’s what makes it powerful:
- Unified workspace – Combines data engineering, data science, and business analytics in one platform so teams can collaborate without switching tools
- Built on Apache Spark – Handles massive data processing jobs that would be impossible with traditional databases
- Cloud-native – Runs on AWS, Azure, or GCP, integrating seamlessly with your existing cloud infrastructure
- Notebooks & workflows – Interactive notebooks for development plus production-grade job scheduling for automated data pipelines
What does this Integration do?
Our Databricks integration gives you complete visibility into where your DBU (Databricks Unit) costs are actually going, so you can optimize spending instead of guessing. We unify your Databricks and AWS costs in one place, making it easy to see the full picture of your cloud infrastructure spend. Here’s what you get:
- Unified AWS + Databricks view – Stop switching between platforms—see your complete cloud cost story in CloudForecast alongside your existing AWS infrastructure
- DBU-level cost tracking – See exactly how much each workspace, cluster, and job is costing you in real-time
- Query-level analysis – Identify which specific queries are burning through your budget so you can optimize the expensive ones
- Team & workspace allocation – Break down costs by team, project, or workspace to enable accurate chargeback and accountability
- More coming soon!
Setup Requirements
What do we need:
To set up the integration, we’ll need a few things on your end:
- CloudForecast workspace – A dedicated workspace where we can run our cost analysis queries
- Small serverless SQL warehouse – For running our queries efficiently without spinning up dedicated infrastructure
- Service principal with specific permissions:
- Read access to the billing and metadata tables listed below
- Ability to manage the cluster (this lets us shut it down quickly after we’re done using it)
Tables we’ll read from:
system.billing.usage
system.billing.list_prices
system.compute.clusters
system.compute.warehouses
system.access.workspaces_latest
system.query.history
system.lakeflow.jobs
system.compute.node_timeline
system.access.table_lineage
system.lakeflow.pipelines Setup Process
There are five key steps to setting up Databricks as a Datasource for CloudForecast:
- Create Service Principal
- Create workspace
- Configure the SQL Warehouse
- Grant access to System Tables
- Run queries in SQL editor
1. Create Service Principal:
- Visit:
https://accounts.cloud.databricks.com/user-management/serviceprincipals - Create a
Service Principalwith a name (e.g.cloudforecast). - Go to the Secret Tab and generate a secret for 730 day
- 🔥 Important note: Copy both APPLICATION_ID AND SECRET VALUE for later use
2. Create Workspace:
- Visit:https://accounts.cloud.databricks.com/
- Click on Workspaces then
Create Workspace - Enter workspace name (e.g. cloudforecast), select N.Virginia (if possible) and select `Use Serverless compute with default storage)
- Once created, click on the Permissions Tab
- Click on
Add Permssionsand add the recenly created user as a User - Important note: 🔥 Copy both WORKSPACE_URL
3. Configure SQL Warehouse
- Open the
workspaceandSQL WarehouseunderSQL - Select
Create SQL warehouse- Name
- Cluster size 2X-Small
- Auto stop —> 5 minutes
- Add your service Principal as a
Can Manage- This will allow us to turn off the warehouse as soon as possible
- Important note: 🔥 Copy the Warehouse ID
4. Grant access to the System Tables:
- system.billing.usage – Provides SKU-level usage records by workspace for cost and consumption analysis.
- system.billing.list_prices – Provides official Databricks list pricing for each SKU.
- system.compute.clusters – Provides metadata and configuration details for Databricks clusters.
- system.compute.warehouses – Provides metadata and configuration details for Databricks SQL warehouses.
- system.access.workspaces_latest – Provides the latest identifiers and human-readable names for workspaces.
- system.query.history – Provides execution history and performance details for queries.
- system.lakeflow.jobs – Provides metadata and configuration details for Lakeflow jobs.
- system.compute.node_timeline – Provides lifecycle and timing information for compute nodes.
- system.access.table_lineage – Provides upstream and downstream lineage relationships between tables.
- system.lakeflow.pipelines – Provides metadata and configuration details for Lakeflow pipelines, including definitions and execution settings
5. Run queries in SQL editor
Run the following queries in SQL editor (must be a Metastore ADMIN)
-- Step 1: Grant USE_SCHEMA on each system schema
GRANT USE_SCHEMA ON SCHEMA system.billing TO `<service-principal-application-id>`;
GRANT USE_SCHEMA ON SCHEMA system.compute TO `<service-principal-application-id>`;
GRANT USE_SCHEMA ON SCHEMA system.access TO `<service-principal-application-id>`;
GRANT USE_SCHEMA ON SCHEMA system.query TO `<service-principal-application-id>`;
GRANT USE_SCHEMA ON SCHEMA system.lakeflow TO `<service-principal-application-id>`;
-- Step 2: Grant SELECT on specific tables
GRANT SELECT ON TABLE system.billing.usage TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.billing.list_prices TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.compute.clusters TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.compute.warehouses TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.compute.node_timeline TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.access.workspaces_latest TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.access.table_lineage TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.query.history TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.lakeflow.jobs TO `<service-principal-application-id>`;
GRANT SELECT ON TABLE system.lakeflow.pipelines TO `<service-principal-application-id>`; Go back to CloudForecast and enter all the information needed in a new DataBricks datasource.