Implementing CI/CD in Databricks Using Repos API

4 min readMar 28, 2025

I recently started working on a new project that uses Databricks as its data platform. While I have worked with Databricks in the past, the last time I used it in a production setting was probably a few years ago. Whenever I work with a platform or tool, I naturally like to explore its CI/CD workflow to understand how deployments and automation can be structured.

In this article, I’ll explore the CI/CD options within Databricks, focusing on setting up a pipeline using Git folders and the Repos API. Through this process, I aim to understand how Databricks handles CI/CD and whether this approach aligns with best practices I’ve followed in other data platforms.

Through reading the documentation, I discovered two primary approaches:

Using Git folders (Repos API)
Using Databricks Asset Bundles (DAB), a more recent approach

In this article, I’ll focus on the first approach using Git folders and the Repos API. In a future article, I plan to explore the DAB approach.

Prerequisites

Before implementing this setup, ensure you have:

Familiarity with Databricks CLI
A Databricks personal access token (which you can generate in your Databricks workspace under “User” → “Developer” → “Access tokens”)
Three GitHub environments corresponding to your Databricks environments (dev, test, main).
Each GitHub environment should contain the following secrets:
- DATABRICKS_HOST: Your Databricks workspace URL
- DATABRICKS_TOKEN: Your personal access token
- DATABRICKS_REPO_PATH: Path to the repository in Databricks

Setup

For this project, I’m using a Databricks free trial, meaning I had to simulate environment separation by creating folders within the same workspace, rather than using separate workspaces for each environment.

Workflows

Diagram 1: Illustrates my free trial setup, where different environments (dev, test, main) are represented as folders inside a single workspace.

Diagram 1. My setup using free trial account

Diagram 2: Shows the recommended setup, where each environment resides in its own Databricks workspace.

Diagram 2. Recommended setup using multiple workspaces

🔹 Note: If you have multiple Databricks workspaces (one for each environment), you can keep the folder or repo path uniform across all workspaces. This ensures consistency and simplifies deployment. However, in the setup using a free trial, all environments exist within the same workspace, so we need to define three different repo paths to separate them.

Branching and Folder Structure

To maintain separate environments, I’m creating a branch for each environment in GitHub:

dev branch
test branch
main branch (production)

💡 Important: The branch name must match the environment name set up in GitHub, as we will reference the branch name dynamically in our GitHub Actions workflow.

On Databricks, I’ll create three folders in the workspace, each corresponding to an environment, and check out the respective branch within each folder.

CI/CD Pipeline Setup

This approach effectively implements both Continuous Integration (CI) and Continuous Deployment (CD):

Continuous Integration (CI):

Any push to dev, test, or main triggers the CI pipeline.
The pipeline syncs the latest changes from GitHub to the respective Databricks folder (environment).

Continuous Deployment (CD):

Since each Databricks environment folder is directly linked to a Git branch, pushing to main automatically deploys changes to the main folder in Databricks.
No manual promotion steps exist — deployment happens instantly when code is merged into a branch.

This setup ensures that environments are always updated with the latest code while following a structured branching strategy.

GitHub Actions Workflow

Here’s the workflow that automates the sync between GitHub and Databricks:

name: CI Databricks Sync

on:
  push:
    branches:
      - dev
      - test
      - main
    paths:
      - 'src/**'

jobs:
  sync-databricks:
    runs-on: ubuntu-latest
    environment: ${{ github.ref_name }}  # Dynamically selects the environment based on branch name
    env:
      DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }}
      DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}

    steps:
      - name: Checkout GitHub repo
        uses: actions/checkout@v4

      # Download the Databricks CLI. See https://github.com/databricks/setup-cli
      - uses: databricks/setup-cli@main

      - name: Extract branch name
        shell: bash
        run: echo "##[set-output name=branch;]$(echo ${GITHUB_REF#refs/heads/})"
        id: extract_branch

      # DEBUG
      - name: Print GitHub branch name
        run: |
          echo "This workflow was triggered by branch: ${{ steps.extract_branch.outputs.branch }}"

      - name: Pull latest changes into Databricks Repo
        run: |
          databricks repos update ${{ secrets.DATABRICKS_REPO_PATH }} --branch ${{ steps.extract_branch.outputs.branch }}

Final Thoughts

While this Git folder approach provides a straightforward way to implement CI/CD, it has several limitations:

Lack of Environment Isolation

Since all environments reside in the same Databricks workspace, there is no true separation between them.
If you need isolated environments (e.g., to enforce stricter access controls or to run environment-specific tests), using separate workspaces per environment is the recommended approach.

Git-Connected Higher Environments

I personally consider connecting higher environments (test/prod) directly to Git an anti-pattern.
Typically, I expect only the dev environment to be Git-connected, while promotion to test/prod is handled via CI/CD pipelines.
However, Databricks’ official documentation explicitly promotes this approach, which was surprising to me.

No Approval Mechanisms

In this setup, there is no approval step before deploying changes to production.
If your organisation requires manual approvals before promoting code, you may need to introduce additional approval workflows in GitHub Actions or use DAB for more structured deployments.

Next Steps: Exploring Databricks Asset Bundles (DAB)

This article covered Git folders and Repos API for CI/CD in Databricks. In the next article, I plan to explore Databricks Asset Bundles (DAB), which provide a more structured and flexible way to manage Databricks deployments.

If you’d like to see the full implementation, feel free to explore my GitHub repo here.

Implementing CI/CD in Databricks Using Repos API

Prerequisites

Setup

Workflows

Branching and Folder Structure

CI/CD Pipeline Setup

GitHub Actions Workflow

Final Thoughts

Next Steps: Exploring Databricks Asset Bundles (DAB)

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by melbdataguy

No responses yet