Before going into the various types of Postgres CDC let us understand the two components – Postgres and CDC as two different standalone entities.
Postgres
Often referred to as PostgreSQL, Postgres is an open-source relational database platform. It can be used for various activities such as carrying out OLTP workloads, analytics, and data warehousing. Businesses typically prefer relational databases for handling transactional functions after which the reports are aggregated and analyzed through a separate data warehouse.
However, it is essential that the data warehouse used is updated with the latest data in its transactional database. This is because reporting and analytics are impacted adversely if the periodical batch requirements are not synced between the databases. To get around this problem, the Change Data Capture feature comes in handy in Postgres CDC to continuously sync data between the instances.
Change Data Capture (CDC)
The Change Data Capture feature is an in-built software design pattern in databases. It tracks and monitors all changes made in the source database so that action be taken on them at a later stage. Data integration is another aspect of the CDC in cases of identification, capture, and delivery of the changes made to data in businesses. In a nutshell, CDC captures data, transforms it into a structure that is supported by the target database, and finally uploads it to the destination database.
Benefits of Postgres CDC
Let us now briefly go through some benefits of implementing Postgres CDC in databases.
- When data warehouses and other downstream systems are in sync with PostgreSQL, it is possible to capture data in real-time.
- Since only changes that are linked to Postgres are processed, database loads are reduced with Postgres CDC.
- It is not necessary to change the application code for efficiently implementing use cases that need access to any changes made to Postgres.
Types of Postgres CDC and their Pros and Cons
There are three types of Postgres CDC that will now be discussed in detail along with their respective pros and cons.
# Trigger-based Postgres CDC
In this model, changes such as Insert, Delete, and Update taking place in the table of interest can be identified. A changelog can be built by inserting a row for every change made into a change table. This Postgres trigger function is supported by all 9.1 versions or later of Postgres where all the change events are stored in the audit.logged actions.
This trigger-based approach of Postgres CDC stores events captured within PostgreSQL only. If users want to sync change events to another data warehouse or other data systems, they have to constantly query the PostgreSQL table that contains the change events. Not only is this process time-consuming but also quite complex.
Pros
- Useful metadata is automatically added to the change events with the PostgreSQL trigger approach. This metadata includes the session user name, a statement that triggered the change, and transaction ID.
- Since any change is captured immediately by trigger-based Postgres CDC, it is possible to process the changes in real-time.
Cons
- There is an increase in the time of execution of the original statement as the triggers affect the performance and speed of Postgres.
- Separate data pipelines have to be set up that poll the tables populated by the trigger function. This is especially so when change events need to be synced with a database or a data warehouse outside the Postgres database.
- Users have to undertake complex activities to create and manage triggers in this form of Postgres CDC.
# Query-based Postgres CDC
In this type of Postgres CDC, there is a timestamp column in the tracked database that indicates the time of the last change of a row. To access the information, that column in Postgres has to be recurringly queried. This action provides records of all changes that have taken place since the last query. Query-based Postgres CDC can only record Update and Insert events only and not Delete changes.
Pros
- To implement query-based CDC, changes need not be made to the PostgreSQL database. However, this benefit is limited to the schema having a timestamp column indicating the time of modification of the rows.
Cons
- PostgreSQL database performance is often affected since the query-based Postgres CDC uses the query layer for extracting data.
- Resources are wasted when recurring polling of the monitored table does not throw up any changes.
- There must be an additional column in query-based Postgres CDC to track the time of the last change in records.
- Unless it is a soft activity, “delete” changes cannot be captured by Postgres CDC.
# Logical Replication-based Postgres CDC
The Logical Replication-based Postgres CDC is the latest offering of PostgreSQL and was released with the 9.4 version. It can replicate data between various PostgreSQL instances running on separate systems. In a nutshell, this form of Postgres CDC is a write-ahead log-on disc consisting of all changes made to data like Delete, Insert, and Update in the PostgreSQL database.
The logical replication model of Postgres CDC is enabled through changes made to the configuration file. While a decoding plugin automatically implements logical replication, the same has to be performed manually for all versions older than 10.
Pros
- Downstream applications can access the current data from PostgreSQL since log-based Postgres CDC captures data changes in real time.
- While the above two versions do not capture all forms of changes, the logical replication model captures all three – Insert, Update, and Delete.
- Changes implemented through the logical replication Postgres CDC provides direct access to file systems. Hence, the performance and speed of the Postgres database are not affected in any way.
Cons
- Only versions of PostgreSQL that are 9.4 and later support this form of Postgres CDC.
An analysis of all three shows logical replication to have an edge over the others.