Implementing CI/CD in Databricks Using Repos API

melbdataguy
4 min readMar 28, 2025

--

I recently started working on a new project that uses Databricks as its data platform. While I have worked with Databricks in the past, the last time I used it in a production setting was probably a few years ago. Whenever I work with a platform or tool, I naturally like to explore its CI/CD workflow to understand how deployments and automation can be structured.

In this article, I’ll explore the CI/CD options within Databricks, focusing on setting up a pipeline using Git folders and the Repos API. Through this process, I aim to understand how Databricks handles CI/CD and whether this approach aligns with best practices I’ve followed in other data platforms.

Through reading the documentation, I discovered two primary approaches:

  1. Using Git folders (Repos API)
  2. Using Databricks Asset Bundles (DAB), a more recent approach

In this article, I’ll focus on the first approach using Git folders and the Repos API. In a future article, I plan to explore the DAB approach.

Prerequisites

Before implementing this setup, ensure you have:

  • Familiarity with Databricks CLI
  • A Databricks personal access token (which you can generate in your Databricks workspace under “User” → “Developer” → “Access tokens”)
  • Three GitHub environments corresponding to your Databricks environments (dev, test, main).
    Each GitHub environment should contain the following secrets:
    - DATABRICKS_HOST: Your Databricks workspace URL
    - DATABRICKS_TOKEN: Your personal access token
    - DATABRICKS_REPO_PATH: Path to the repository in Databricks
Github environments

Setup

For this project, I’m using a Databricks free trial, meaning I had to simulate environment separation by creating folders within the same workspace, rather than using separate workspaces for each environment.

Workflows

  • Diagram 1: Illustrates my free trial setup, where different environments (dev, test, main) are represented as folders inside a single workspace.
Diagram 1. My setup using free trial account
  • Diagram 2: Shows the recommended setup, where each environment resides in its own Databricks workspace.
Diagram 2. Recommended setup using multiple workspaces

🔹 Note: If you have multiple Databricks workspaces (one for each environment), you can keep the folder or repo path uniform across all workspaces. This ensures consistency and simplifies deployment. However, in the setup using a free trial, all environments exist within the same workspace, so we need to define three different repo paths to separate them.

Branching and Folder Structure

To maintain separate environments, I’m creating a branch for each environment in GitHub:

  • dev branch
  • test branch
  • main branch (production)

💡 Important: The branch name must match the environment name set up in GitHub, as we will reference the branch name dynamically in our GitHub Actions workflow.

On Databricks, I’ll create three folders in the workspace, each corresponding to an environment, and check out the respective branch within each folder.

CI/CD Pipeline Setup

This approach effectively implements both Continuous Integration (CI) and Continuous Deployment (CD):

Continuous Integration (CI):

  • Any push to dev, test, or main triggers the CI pipeline.
  • The pipeline syncs the latest changes from GitHub to the respective Databricks folder (environment).

Continuous Deployment (CD):

  • Since each Databricks environment folder is directly linked to a Git branch, pushing to main automatically deploys changes to the main folder in Databricks.
  • No manual promotion steps exist — deployment happens instantly when code is merged into a branch.

This setup ensures that environments are always updated with the latest code while following a structured branching strategy.

GitHub Actions Workflow

Here’s the workflow that automates the sync between GitHub and Databricks:

name: CI Databricks Sync

on:
push:
branches:
- dev
- test
- main
paths:
- 'src/**'

jobs:
sync-databricks:
runs-on: ubuntu-latest
environment: ${{ github.ref_name }} # Dynamically selects the environment based on branch name
env:
DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

steps:
- name: Checkout GitHub repo
uses: actions/checkout@v4

# Download the Databricks CLI. See https://github.com/databricks/setup-cli
- uses: databricks/setup-cli@main

- name: Extract branch name
shell: bash
run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
id: extract_branch

# DEBUG
- name: Print GitHub branch name
run: |
echo "This workflow was triggered by branch: ${{ steps.extract_branch.outputs.branch }}"

- name: Pull latest changes into Databricks Repo
run: |
databricks repos update ${{ secrets.DATABRICKS_REPO_PATH }} --branch ${{ steps.extract_branch.outputs.branch }}

Final Thoughts

While this Git folder approach provides a straightforward way to implement CI/CD, it has several limitations:

Lack of Environment Isolation

  • Since all environments reside in the same Databricks workspace, there is no true separation between them.
  • If you need isolated environments (e.g., to enforce stricter access controls or to run environment-specific tests), using separate workspaces per environment is the recommended approach.

Git-Connected Higher Environments

  • I personally consider connecting higher environments (test/prod) directly to Git an anti-pattern.
  • Typically, I expect only the dev environment to be Git-connected, while promotion to test/prod is handled via CI/CD pipelines.
  • However, Databricks’ official documentation explicitly promotes this approach, which was surprising to me.

No Approval Mechanisms

  • In this setup, there is no approval step before deploying changes to production.
  • If your organisation requires manual approvals before promoting code, you may need to introduce additional approval workflows in GitHub Actions or use DAB for more structured deployments.

Next Steps: Exploring Databricks Asset Bundles (DAB)

This article covered Git folders and Repos API for CI/CD in Databricks. In the next article, I plan to explore Databricks Asset Bundles (DAB), which provide a more structured and flexible way to manage Databricks deployments.

If you’d like to see the full implementation, feel free to explore my GitHub repo here.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

melbdataguy
melbdataguy

Written by melbdataguy

Just a curious data engineer based in Melbourne | Self-learner

No responses yet

Write a response