Real-Time End-to-End Data Streaming with Postgres, Kafka, MinIO, Prometheus, and Grafana

melbdataguy
8 min readJul 9, 2024

--

As data engineering evolves, proficiency in both batch and streaming architectures is increasingly essential. Batch processing handles large datasets in periodic chunks, suited for non-time-sensitive tasks, while real-time streaming processes data as it arrives, enabling immediate analytics and responsiveness. As a data engineer experienced primarily in batch processing, I embarked on this Proof of Concept (POC) project to explore real-time data streaming using Kafka and other essential technologies.

Apache Kafka excels in real-time data streaming due to its scalability, fault tolerance, and robust ecosystem. Kafka’s architecture, including topics, partitions, and consumer groups, efficiently manages streaming data at scale. Complementary tools like Kafka Connect for seamless data integration, and monitoring solutions such as Prometheus and Grafana, further enhance the ability to build resilient streaming pipelines. Together, these technologies empower data engineers to optimise workflows across diverse use cases.

This project demonstrates setting up and using these technologies to achieve seamless real-time data flow from a Postgres database to Kafka and MinIO. By showcasing this end-to-end process, it aims to provide a foundational understanding of modern data streaming practices and support data engineers transitioning into real-time data environments.

To explore the complete implementation and access the project repository, please visit my Github repo here.

Pre-requisites

Before diving into the project, ensure the following pre-requisites are met:

  • Docker and Docker Compose installed on your machine
  • Python virtual environment set up with dependencies from requirements.txt
  • Download connector JAR files and put under connectors/ directory (e.g. Postgres Source Connector , MinIO (S3) Sink Connector)
  • API Testing Tool (Postman or Thunder Client extensions on VS Code)

Docker-compose Breakdown

Our project will utilise Docker and Docker Compose to orchestrate the following essential services:

  • postgres: Acts as the source database for our streaming data.
  • zookeeper: Essential for Kafka as a coordination service.
  • kafka broker: Core component of Kafka responsible for message storage and retrieval.
  • schema registry: Manages schema metadata for Kafka topics, ensuring compatibility and consistency.
  • kafka connect: Facilitates seamless integration of data between Kafka and other data sources or sinks.
  • minio: Mocks AWS S3 storage, enabling testing of data flows into object storage.
  • prometheus: Enables monitoring and alerting functionalities within our system.
  • grafana: Offers visualisation and monitoring dashboards to complement Prometheus.
  • ksqldb (optional): While not used for stream processing in this demo, ksqlDB allows SQL-like querying and exploration of Kafka topics. Alternatives include kafkacat for topic inspection and management.
  • kafka control center (optional): Provides a web-based interface for managing Kafka clusters and topics. Alternatives such as akhq, CMAK (Kafka Manager), or kpow offer similar functionalities.
  • pgAdmin (optional): Provides a UI for database management and data visualisation. It can be omitted if CLI access suffices for you.
Project Architecture

Preparation Steps

  • Start all services using Docker Compose:
  • Verify that all services are running:
All running containers
  • Check endpoints of services to ensure they are accessible.

Actual Development:

Part 1: Postgres to Kafka

  1. Access the Postgres container

Once inside the container, run:

2. Initialise the database and tables using thepg_initialise.py script in the repo

  • This creates two tables (employees and products) and sets REPLICA IDENTITY FULL.
  • It also populate tables with initial data (20 records each).
pg_initialise.py

3. Access ksqlDB container to inspect topics and connectors

4. Check existing topics and connectors

5. Configure Kafka Connect to stream data from Postgres using Debezium connector

  • Prepare postgres-source-connector.json which can be found underkafka-connect-config/ directory.
  • Use Postman or Thunder Client VS Code extension (or similar tool) to do a POST request using the connector configuration:
postgres-source-connector config

6. Verify connector status. We can use our running ksqldb container to check.

You can verify status by doing a GET request as well to localhost:8083/connectors/

7. Verify if new topics are created on Kafka broker

8. In a separate terminal, run pg_streaming_insert.py to insert random records into Postgres tables.

pg_streaming_insert.py

9. Monitor streamed messages in ksqlDB

KSQL Syntax
Printing messages for ‘streaming.public.products’

We have successfully demonstrated the flow of data from Postgres to Kafka in real-time. Leveraging Kafka Connect with the Debezium connector ensures seamless integration, facilitating continuous streaming of data updates from the database to Kafka topics.

Part 2: Kafka to MinIO

In this section, we focus on setting up a Kafka-to-MinIO data pipeline to store streaming data into MinIO, a mock AWS S3 service.

  1. Ensure MinIO is running by accessing localhost:9001 in your web browser. Use the credentials:
  • Username: minio
  • Password: minio123

2. Create a bucket named raw in MinIO and adjust its access policy to Public to facilitate data accessibility.

3. Configure Kafka Sink Connector. Utilise the Aiven S3 Sink Connector to establish the Kafka-to-MinIO connection. This connector facilitates the transfer of data from Kafka topics to the MinIO bucket.

minio-sink-connector config

Initiate a POST request using the configuration file minio-sink-connector.json, located in the kafka-connect-config/ directory. This file specifies the settings and mappings required for the sink connector to operate effectively.

4. Again, using ksql, confirm the status of the connector to ensure it has been successfully deployed and is actively transferring data from Kafka to MinIO.

5. While the pg_streaming_insert.py script continues to run, monitor the raw bucket in MinIO. The data is organised in JSON format within the bucket, following the hierarchy specified in the sink connector configuration:

/{{topic}}/{{timestamp:unit=yyyy}}-{{timestamp:unit=MM}}-{{timestamp:unit=dd}}/{{timestamp:unit=HH}}/{{partition:padding=true}}-{{start_offset:padding=true}}.json

Folders created per topic in our raw bucket

By completing this section, we establish a robust end-to-end data streaming pipeline from Kafka through MinIO, demonstrating practical application and integration of these technologies in a real-time data environment.

Part 3: Monitoring with Prometheus and Grafana

Prometheus and Grafana have become renowned tools for monitoring Kafka clusters due to their robust features and ease of integration. They provide comprehensive insights into the performance and health of Kafka components, helping operators and administrators maintain optimal system operation.

How Prometheus Works

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It works by scraping metrics exposed by monitored services at regular intervals. This pull-based model allows Prometheus to collect time-series data, storing it locally in a time-series database. Prometheus supports flexible querying and visualisation of metrics, making it suitable for monitoring dynamic environments like Kafka clusters.

How Grafana Works

Grafana complements Prometheus by offering visualisation and dashboarding capabilities. It connects to Prometheus as a data source, enabling users to create customised dashboards that display metrics collected from Kafka components. Grafana’s intuitive interface supports real-time updates and interactive charts, empowering users to monitor Kafka cluster performance effectively.

Monitoring Setup in Our Project

For monitoring purposes, our project includes three directories:

  1. grafana

Contains a provisioning/ folder with subdirectories datasources/ and dashboards/. This setup allows Grafana to automatically configure Prometheus as a datasource and preload sample dashboards in JSON format.

We are utilising a sample dashboard from Grafana Dashboard 11962 for Kafka metrics. Alternatively, comprehensive dashboard examples are available from Confluent's GitHub.

2. jmx-exporter

This directory integrates with Prometheus by leveraging the JMX Exporter. Download the JMX Exporter JAR file from Prometheus JMX Exporter Releases and place it here. It includes a Java agent (jmx_prometheus_javaagent-1.0.1.jar) that is mounted into services such as Schema Registry, Kafka Connect, Zookeeper, and Kafka Broker.

Each service has an exporter config YAML file (obtained from Prometheus JMX Exporter Example Configs) defining metrics to be exposed.

3. prometheus

Contains prometheus.yml, defining scrape configurations (scrape jobs) for each monitored service.

In the docker-compose configuration, specific services such as schema registry, kafka connect, zookeeper, and kafka broker are individually configured with KAFKA_JMX_OPTSin their ENVIRONMENT section. This setting defines the Java agent path and exporter YAML file location necessary for exposing JMX metrics.

Additionally, the jmx-exporter directory is mounted into each container at /usr/share/jmx_exporter, ensuring that Prometheus can access and collect metrics seamlessly from these designated services.

Prometheus Jobs

By utilising Prometheus and Grafana in our monitoring setup, we enhance visibility into Kafka’s operational metrics, enabling proactive management and optimisation of our streaming pipelines. This integrated approach ensures robust performance monitoring and facilitates troubleshooting in real-time data environments.

Grafana dashboard

In summary, this article has showcased the seamless integration of Postgres, Kafka, MinIO, Prometheus, and Grafana into a robust real-time data streaming pipeline. By harnessing Kafka’s scalability and Kafka Connect’s integration capabilities, alongside Prometheus and Grafana for monitoring, it illustrates effective management of dynamic data environments. This practical guide equips data engineers with essential skills for implementing and optimising modern data streaming architectures.

For more insights and updates on data engineering innovations, subscribe to stay tuned for upcoming articles and projects!

--

--

melbdataguy
melbdataguy

Written by melbdataguy

Just a curious data engineer based in Melbourne | Self-learner

No responses yet