Implementing CI/CD in Databricks Using Repos API
I recently started working on a new project that uses Databricks as its data platform. While I have worked with Databricks in the past, the last time I used it in a production setting was probably a few years ago. Whenever I work with a platform or tool, I naturally like to explore its CI/CD workflow to understand how deployments and automation can be structured.
In this article, I’ll explore the CI/CD options within Databricks, focusing on setting up a pipeline using Git folders and the Repos API. Through this process, I aim to understand how Databricks handles CI/CD and whether this approach aligns with best practices I’ve followed in other data platforms.
Through reading the documentation, I discovered two primary approaches:
- Using Git folders (Repos API)
- Using Databricks Asset Bundles (DAB), a more recent approach
In this article, I’ll focus on the first approach using Git folders and the Repos API. In a future article, I plan to explore the DAB approach.
Prerequisites
Before implementing this setup, ensure you have:
- Familiarity with Databricks CLI
- A Databricks personal access token (which you can generate in your Databricks workspace under “User” → “Developer” → “Access tokens”)
- Three GitHub environments corresponding to your Databricks environments (
dev
,test
,main
).
Each GitHub environment should contain the following secrets:
-DATABRICKS_HOST
: Your Databricks workspace URL
-DATABRICKS_TOKEN
: Your personal access token
-DATABRICKS_REPO_PATH
: Path to the repository in Databricks

Setup
For this project, I’m using a Databricks free trial, meaning I had to simulate environment separation by creating folders within the same workspace, rather than using separate workspaces for each environment.
Workflows
- Diagram 1: Illustrates my free trial setup, where different environments (
dev
,test
,main
) are represented as folders inside a single workspace.

- Diagram 2: Shows the recommended setup, where each environment resides in its own Databricks workspace.

🔹 Note: If you have multiple Databricks workspaces (one for each environment), you can keep the folder or repo path uniform across all workspaces. This ensures consistency and simplifies deployment. However, in the setup using a free trial, all environments exist within the same workspace, so we need to define three different repo paths to separate them.
Branching and Folder Structure
To maintain separate environments, I’m creating a branch for each environment in GitHub:
dev
branchtest
branchmain
branch (production)
💡 Important: The branch name must match the environment name set up in GitHub, as we will reference the branch name dynamically in our GitHub Actions workflow.
On Databricks, I’ll create three folders in the workspace, each corresponding to an environment, and check out the respective branch within each folder.
CI/CD Pipeline Setup
This approach effectively implements both Continuous Integration (CI) and Continuous Deployment (CD):
Continuous Integration (CI):
- Any push to
dev
,test
, ormain
triggers the CI pipeline. - The pipeline syncs the latest changes from GitHub to the respective Databricks folder (environment).
Continuous Deployment (CD):
- Since each Databricks environment folder is directly linked to a Git branch, pushing to
main
automatically deploys changes to the main folder in Databricks. - No manual promotion steps exist — deployment happens instantly when code is merged into a branch.
This setup ensures that environments are always updated with the latest code while following a structured branching strategy.
GitHub Actions Workflow
Here’s the workflow that automates the sync between GitHub and Databricks:
name: CI Databricks Sync
on:
push:
branches:
- dev
- test
- main
paths:
- 'src/**'
jobs:
sync-databricks:
runs-on: ubuntu-latest
environment: ${{ github.ref_name }} # Dynamically selects the environment based on branch name
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
steps:
- name: Checkout GitHub repo
uses: actions/checkout@v4
# Download the Databricks CLI. See https://github.com/databricks/setup-cli
- uses: databricks/setup-cli@main
- name: Extract branch name
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
id: extract_branch
# DEBUG
- name: Print GitHub branch name
run: |
echo "This workflow was triggered by branch: ${{ steps.extract_branch.outputs.branch }}"
- name: Pull latest changes into Databricks Repo
run: |
databricks repos update ${{ secrets.DATABRICKS_REPO_PATH }} --branch ${{ steps.extract_branch.outputs.branch }}
Final Thoughts
While this Git folder approach provides a straightforward way to implement CI/CD, it has several limitations:
Lack of Environment Isolation
- Since all environments reside in the same Databricks workspace, there is no true separation between them.
- If you need isolated environments (e.g., to enforce stricter access controls or to run environment-specific tests), using separate workspaces per environment is the recommended approach.
Git-Connected Higher Environments
- I personally consider connecting higher environments (test/prod) directly to Git an anti-pattern.
- Typically, I expect only the dev environment to be Git-connected, while promotion to test/prod is handled via CI/CD pipelines.
- However, Databricks’ official documentation explicitly promotes this approach, which was surprising to me.
No Approval Mechanisms
- In this setup, there is no approval step before deploying changes to production.
- If your organisation requires manual approvals before promoting code, you may need to introduce additional approval workflows in GitHub Actions or use DAB for more structured deployments.
Next Steps: Exploring Databricks Asset Bundles (DAB)
This article covered Git folders and Repos API for CI/CD in Databricks. In the next article, I plan to explore Databricks Asset Bundles (DAB), which provide a more structured and flexible way to manage Databricks deployments.
If you’d like to see the full implementation, feel free to explore my GitHub repo here.