Monitoring v4.3.8
Monitoring replication setups is important to ensure that your system performs optimally and does not run out of disk space or encounter other faults that may halt operations.
It is important to have automated monitoring in place to ensure that if, for example, replication slots start falling badly behind, the administrator is alerted and can take proactive action.
EDB provides Postgres Enterprise Manager (PEM), which supports BDR from version 8.1. Alternatively, tools or users can make their own calls into BDR using the facilities discussed below.
Monitoring Overview
A BDR Group consists of multiple servers, often referred to as nodes. All of the nodes need to be monitored to ensure the health of the whole group.
The bdr_monitor role may execute the bdr.monitor functions to provide an
assessment of BDR health using one of three levels:
- OK- often shown as Green
- WARNING- often shown as Yellow
- CRITICAL- often shown as Red
- as well as UNKNOWN- for unrecognized situations, often shown as Red
BDR also provides dynamic catalog views that show the instantaneous state of various internal metrics and also BDR metadata catalogs that store the configuration defaults and/or configuration changes requested by the user. Some of those views and tables are accessible by bdr_monitor or bdr_read_all_stats, but some contain user or internal information that has higher security requirements.
BDR allows you to monitor each of the nodes individually, or to monitor the
whole group by access to a single node. If you wish to monitor each node individually,
simply connect to each node and issue monitoring requests. If you wish to monitor
the group from a single node then use the views starting with bdr.group since these
requests make calls to other nodes to assemble a group-level information set.
If you have been granted access to the bdr.run_on_all_nodes() function by
bdr_superuser then you may make your own calls to all nodes.
Monitoring Node Join and Removal
By default, the node management functions wait for the join or part
operation to complete.  This can be turned off using the respective
wait_for_completion function argument.  If waiting is turned off,
then to see when a join or part operation finishes,
check the node state indirectly via bdr.node_summary and
bdr.state_journal_details.
When called, the helper function bdr.wait_for_join_completion() will cause
a PostgreSQL session to pause until all outstanding node join operations
complete.
Here is an example output of a SELECT query from bdr.node_summary that
indicates that two nodes are active and another one is joining:
# SELECT node_name, interface_connstr, peer_state_name, # node_seq_id, node_local_dbname # FROM bdr.node_summary; -[ RECORD 1 ]-----+----------------------------------------- node_name | node1 interface_connstr | host=localhost dbname=postgres port=7432 peer_state_name | ACTIVE node_seq_id | 1 node_local_dbname | postgres -[ RECORD 2 ]-----+----------------------------------------- node_name | node2 interface_connstr | host=localhost dbname=postgres port=7433 peer_state_name | ACTIVE node_seq_id | 2 node_local_dbname | postgres -[ RECORD 3 ]-----+----------------------------------------- node_name | node3 interface_connstr | host=localhost dbname=postgres port=7434 peer_state_name | JOINING node_seq_id | 3 node_local_dbname | postgres
Also, the table bdr.node_catchup_info will give information
on the catch-up state, which can be relevant to joining nodes or parting nodes.
When a node is parted, it could be that some nodes in the cluster did not receive all the data from that parting node. So it will create a temporary slot from a node that already received that data and can forward it.
The catchup_state can be one of the following:
10 = setup 20 = start 30 = catchup 40 = done
Monitoring Replication Peers
There are two main views used for monitoring of replication activity:
- bdr.node_slotsfor monitoring outgoing replication
- bdr.subscription_summaryfor monitoring incoming replication
Most of the information provided by bdr.node_slots can be also obtained by querying
the standard PostgreSQL replication monitoring views
pg_catalog.pg_stat_replication
and
pg_catalog.pg_replication_slots.
Each node has one BDR group slot which should never have a connection to it
and will very rarely be marked as active. This is normal, and does not imply
something is down or disconnected. See Replication Slots created by BDR.
Monitoring Outgoing Replication
There is an additional view used for monitoring of outgoing replication activity:
- bdr.node_replication_ratesfor monitoring outgoing replication
The bdr.node_replication_rates view gives an overall picture of the outgoing
replication activity along with the catchup estimates for peer nodes,
specifically.
# SELECT * FROM bdr.node_replication_rates; -[ RECORD 1 ]----+----------- peer_node_id | 112898766 target_name | node1 sent_lsn | 0/28AF99C8 replay_lsn | 0/28AF99C8 replay_lag | 00:00:00 replay_lag_bytes | 0 replay_lag_size | 0 bytes apply_rate | 822 catchup_interval | 00:00:00 -[ RECORD 2 ]----+----------- peer_node_id | 312494765 target_name | node3 sent_lsn | 0/28AF99C8 replay_lsn | 0/28AF99C8 replay_lag | 00:00:00 replay_lag_bytes | 0 replay_lag_size | 0 bytes apply_rate | 853 catchup_interval | 00:00:00
The apply_rate above refers to the rate in bytes per second. It is the rate
at which the peer is consuming data from the local node. The replay_lag when
a node reconnects to the cluster is immediately set to zero. We are working on
fixing this information; as a workaround, we suggest you use the catchup_interval
column that refers to the time required for the peer node to catch up to the
local node data. The other fields are also available via the bdr.node_slots
view, as explained below.
Note
This catalog is only present when bdr-enteprise extension is installed.
Administrators may query bdr.node_slots for outgoing replication from the
local node. It shows information about replication status of all other nodes
in the group that are known to the current node, as well as any additional
replication slots created by BDR on the current node.
# SELECT node_group_name, target_dbname, target_name, slot_name, active_pid, # catalog_xmin, client_addr, sent_lsn, replay_lsn, replay_lag, # replay_lag_bytes, replay_lag_size # FROM bdr.node_slots; -[ RECORD 1 ]---+---------------------------- node_group_name | bdrgroup target_dbname | postgres target_name | node3 slot_name | bdr_postgres_bdrgroup_node3 active_pid | 15089 catalog_xmin | 691 client_addr | 127.0.0.1 sent_lsn | 0/23F7B70 replay_lsn | 0/23F7B70 replay_lag | [NULL] replay_lag_bytes| 120 replay_lag_size | 120 bytes -[ RECORD 2 ]---+---------------------------- node_group_name | bdrgroup target_dbname | postgres target_name | node2 slot_name | bdr_postgres_bdrgroup_node2 active_pid | 15031 catalog_xmin | 691 client_addr | 127.0.0.1 sent_lsn | 0/23F7B70 replay_lsn | 0/23F7B70 replay_lag | [NULL] replay_lag_bytes| 84211 replay_lag_size | 82 kB
Note that because BDR is a mesh network, to get full view of lag in the cluster, this query has to be executed on all nodes participating.
replay_lag_bytes reports the difference in WAL positions between the local
server's current WAL write position and replay_lsn, the last position
confirmed replayed by the peer node. replay_lag_size is just a human-readable
form of the same.  It is important to understand that WAL usually contains a lot
of writes that are not replicated but still count in replay_lag_bytes,
including VACUUM activity, index changes, writes associated with other
databases on the same node, writes for tables that are not part of a
replication set, etc. So the lag in bytes reported here is not the amount of
data that must be replicated on the wire to bring the peer node up to date,
only the amount of server-side WAL that must be processed.
Similarly, replay_lag is not a measure of how long the peer node will take to
catch up, or how long it will take to replay from its current position to the
write position at the time bdr.node_slots was queried. It measures the delay
between when the peer confirmed the most recent commit and the current
wall-clock time. We suggest that you monitor replay_lag_bytes and replay_lag_size
or catchup_interval in bdr.node_replication_rates, as this column is set to
zero immediately after the node reconnects.
The lag in both bytes and time does not advance while logical replication is streaming a transaction. It only changes when a commit is replicated. So the lag will tend to "sawtooth", rising as a transaction is streamed, then falling again as the peer node commits it, flushes it, and sends confirmation. The reported LSN positions will "stair-step" instead of advancing smoothly, for similar reasons.
When replication is disconnected (active = 'f'), the active_pid column
will be NULL, as will client_addr and other fields that only make sense
with an active connection. The state field will be 'disconnected'. The
_lsn fields will be the same as the confirmed_flush_lsn, since that is the
last position that the client is known for certain to have replayed to and saved.
The _lag fields will show the elapsed time between the most recent confirmed
flush on the client and the current time, and the _lag_size and _lag_bytes
fields will report the distance between confirmed_flush_lsn and the local
server's current WAL insert position.
Note: It is normal for restart_lsn to be behind the other lsn columns;
this does not indicate a problem with replication or a peer node lagging. The
restart_lsn is the position that PostgreSQL's internal logical decoding must
be reading WAL at if interrupted, and generally reflects the position of the
oldest transaction that is not yet replicated and flushed. A very old
restart_lsn can make replication slow to restart after disconnection and
force retention of more WAL than is desirable, but will otherwise be harmless.
If you are concerned, look for very long running transactions and forgotten
prepared transactions.
Monitoring Incoming Replication
Incoming replication (also called subscription) can be monitored by querying
the bdr.subscription_summary view. This shows the list of known subscriptions
to other nodes in the EDB Postgres Distributed cluster and the state of the replication worker, e.g.:
# SELECT node_group_name, origin_name, sub_enabled, sub_slot_name, # subscription_status # FROM bdr.subscription_summary; -[ RECORD 1 ]-------+---------------------------- node_group_name | bdrgroup origin_name | node2 sub_enabled | t sub_slot_name | bdr_postgres_bdrgroup_node1 subscription_status | replicating -[ RECORD 2 ]-------+---------------------------- node_group_name | bdrgroup origin_name | node3 sub_enabled | t sub_slot_name | bdr_postgres_bdrgroup_node1 subscription_status | replicating
Monitoring WAL senders using LCR
If the Decoding Worker is enabled, information about the
current LCR (Logical Change Record) file for each WAL sender can be monitored
via the function bdr.wal_sender_stats,
e.g.:
postgres=# SELECT * FROM bdr.wal_sender_stats(); pid | is_using_lcr | decoder_slot_name | lcr_file_name ---------+--------------+-------------------------------+------------------------------------------ 2059904 | f | | 2059909 | t | bdr_postgres_bdrgroup_decoder | 0000000000000000000000140000000000000000 2059916 | t | bdr_postgres_bdrgroup_decoder | 0000000000000000000000140000000000000000 (3 rows)
If is_using_lcr is FALSE, decoder_slot_name/lcr_file_name will be NULL.
This will be the case if the Decoding Worker is not enabled, or the WAL sender is
serving a logical standby.
Additionally, information about the Decoding Worker can be monitored via the function bdr.get_decoding_worker_stat, e.g.:
postgres=# SELECT * FROM bdr.get_decoding_worker_stat(); pid | decoded_upto_lsn | waiting | waiting_for_lsn ---------+------------------+---------+----------------- 1153091 | 0/1E5EEE8 | t | 0/1E5EF00 (1 row)
Monitoring BDR Replication Workers
All BDR workers show up in the system view bdr.stat_activity,
which has the same columns and information content as
pg_stat_activity.
So this view offers these insights into the state of a BDR system:
- The wait_event column has enhanced information, if the reason for waiting is related to BDR.
- The querycolumn will be blank in BDR workers, except when a writer process is executing DDL
The bdr.workers view shows BDR worker specific details, that are not
available from bdr.stat_activity.
The view bdr.worker_errors shows last error (if any) reported by any worker
which has a problem continuing the work. This is persistent information, so
it's important to note the time of the error not just the existence of one,
because most errors are transient in their nature and BDR workers will retry
the failed operation.
Monitoring BDR Writers
There is another system view bdr.writers to monitor writer activities.
This views shows the current status of only writer workers. It includes:
- sub_nameto identify the subscription which the writer belongs to
- pidof the writer process
- streaming_allowedto know if the writer supports application of in-progress streaming transactions
- is_streamingto know if the writer is currently applying a streaming transaction
- commit_queue_positionto check the position of the writer in the commit queue.
BDR honours commit ordering by following the same commit order as
happened on the origin. In case of parallel writers, multiple writers
could be applying different transactions at the same time. The
commit_queue_position shows in which order they will commit. Value 0
means that the writer is the first one to commit. Value -1 means that
the commit position is not yet known. This can happen for a streaming
transaction or when the writer is not applying any transaction at the
moment.
Monitoring Global Locks
The global lock, which is currently only used for DDL replication, is a heavyweight lock that exists across the whole BDR group.
There are currently two types of global locks:
- DDL lock, used for serializing all DDL operations on permanent (not temporary) objects (i.e. tables) in the database
- DML relation lock, used for locking out writes to relations during DDL operations that change the relation definition
Either or both entry types may be created for the same transaction, depending on
the type of DDL operation and the value of the bdr.ddl_locking setting.
Global locks held on the local node are visible in the bdr.global_locks
view.  This view shows the type of the lock; for
relation locks it shows which relation is being locked, the PID holding the
lock (if local), and whether the lock has been globally granted or not. In case
of global advisory locks, lock_type column shows GLOBAL_LOCK_ADVISORY and
relation column shows the advisory key(s) on which the lock is acquired.
The following is an example output of bdr.global_locks while running an
ALTER TABLE statement with bdr.ddl_locking = on:
# SELECT lock_type, relation, pid FROM bdr.global_locks; -[ RECORD 1 ]-------------- lock_type | GLOBAL_LOCK_DDL relation | [NULL] pid | 15534 -[ RECORD 2 ]-------------- lock_type | GLOBAL_LOCK_DML relation | someschema.sometable pid | 15534
See the catalog documentation for details on all fields including lock timing information.
Monitoring Conflicts
Replication conflicts can arise when multiple nodes make changes that affect the same rows in ways that can interact with each other. The BDR system should be monitored to ensure that conflicts are identified and, where possible, application changes are made to eliminate them or make them less frequent.
By default, all conflicts are logged to bdr.conflict_history. Since this
contains full details of conflicting data, the rows are protected by
row-level security to ensure they are visible only by
owners of replicated tables. Owners should expect conflicts and analyze them
to see which, if any, might be considered as problems to be resolved.
For monitoring purposes use bdr.conflict_history_summary, which does
not contain user data. An example query to count the number of conflicts
seen within the current day using an efficient query plan is:
SELECT count(*) FROM bdr.conflict_history_summary WHERE local_time > date_trunc('day', current_timestamp) AND local_time < date_trunc('day', current_timestamp + '1 day');
External Monitoring
User supplied metadata can be stored to allow monitoring tools to understand and monitor the EDB Postgres Distributed cluster. By centralizing this information, external tools can access any single node and read details about the whole cluster, such as network cost and warning/alarm thresholds for specific connections.
bdr_superuser has the privileges on these functions and tables.
The view bdr.network_monitoring is also accessible by the
bdr_read_all_stats role.
bdr.set_node_location
This function inserts node metadata into bdr.node_location
Synopsis
bdr.set_node_location( node_group_name text, node_name text, node_region text, node_location text);
Parameters
- node_group_name - name of the BDR group
- node_name - name of the node
- node_region - the datacenter site or Region
- node_location - the server name, availability zone etc..
bdr.set_network_path_info
This function inserts network path metadata for network paths between nodes
into the table bdr.network_path_info.
Synopsis
bdr.set_network_path_info( node_group_name text, region1 text, region2 text, location1 text, location2 text, network_cost numeric, warning_threshold numeric, alarm_threshold numeric)
Parameters
- node_group_name - name of the BDR group
- region1 - the origin server name
- region2 - the remote server name
- location1 - the origin datacente name
- location2 - the remote datacenter name
- network_cost - an abstract value representing the cost of network transfer
- warning_threshold - a delay above which a threshold should be raised
- alarm_threshold - a delay above which an alarm should be raised
bdr.network_monitoring view
This view collects information about the network path between nodes.
The configuration of logging is defined by the bdr.alter_node_set_log_config
function.
Apply Statistics
BDR collects statistics about replication apply, both for each subscription and for each table.
Two monitoring views exist: bdr.stat_subscription for subscription statistics
and bdr.stat_relation for relation statistics. These views both provide:
- Number of INSERTs/UPDATEs/DELETEs/TRUNCATEs replicated
- Block accesses and cache hit ratio
- Total I/O time for read/write
- Number of in-progress transactions streamed to file
- Number of in-progress transactions streamed to writers
- Number of in-progress streamed transactions committed/aborted
and for relations only, these statistics:
- Total time spent processing replication for the relation
- Total lock wait time to acquire lock (if any) for the relation (only)
and for subscriptions only, these statistics:
- Number of COMMITs/DDL replicated for the subscription
- Number of times this subscription has connected upstream
Tracking of these statistics is controlled by the BDR GUCs
bdr.track_subscription_apply and bdr.track_relation_apply
respectively.
The example output from these would look like this:
# SELECT sub_name, nconnect, ninsert, ncommit, nupdate, ndelete, ntruncate, nddl FROM bdr.stat_subscription; -[ RECORD 1 ]---------------------------------- sub_name | bdr_regression_bdrgroup_node1_node2 nconnect | 3 ninsert | 10 ncommit | 5 nupdate | 0 ndelete | 0 ntruncate | 0 nddl | 2
In this case the subscription connected 3 times to the upstream, inserted 10 rows and did 2 DDL commands inside 5 transactions.
Stats counters for these views can be reset to zero using the functions
bdr.reset_subscription_stats and bdr.reset_relation_stats.
Standard PostgreSQL Statistics Views
Statistics on table and index usage are updated normally by the downstream master. This is essential for the correct function of autovacuum. If there are no local writes on the downstream master and statistics have not been reset, these two views should show corresponding results between upstream and downstream:
- pg_stat_user_tables
- pg_statio_user_tables
Note
We don't necessarily expect the upstream table statistics to be similar to the downstream ones; we only expect them to change by the same amounts. Consider the example of a table whose statistics show 1M inserts and 1M updates; when a new node joins the BDR group, the statistics for the same table in the new node will show 1M inserts and zero updates. However, from that moment, the upstream and downstream table statistics will change by the same amounts, because all changes on one side will be replicated to the other side.
Since indexes are used to apply changes, the identifying indexes on the
downstream side may appear more heavily used with workloads that perform
UPDATEs and DELETEs than non-identifying indexes are.
The built-in index monitoring views are:
- pg_stat_user_indexes
- pg_statio_user_indexes
All these views are discussed in detail in the PostgreSQL documentation on the statistics views.
Monitoring BDR Versions
BDR allows running different Postgres versions as well as different BDR versions across the nodes in the same cluster. This is useful for upgrading.
The view bdr.group_versions_details uses the function
bdr.run_on_all_nodes() to retrieve Postgres and BDR versions from all nodes
at the same time. For example:
bdrdb=# SELECT node_name, postgres_version, bdr_version FROM bdr.group_versions_details; node_name | postgres_version | bdr_version -----------+------------------+------------- node1 | 14.1 | 4.0.0 node2 | 14.1 | 4.0.0
The recommended setup is to try to have all nodes running the same latest versions as soon as possible. It is recommended that the cluster does not run different BDR versions for too long.
For monitoring purposes, we recommend the following alert levels:
- status=UNKNOWN, message=This node is not part of any BDR group
- status=OK, message=All nodes are running same BDR versions
- status=WARNING, message=There is at least 1 node that is not accessible
- status=WARNING, message=There are node(s) running different BDR versions when compared to other nodes
The described behavior is implemented in the function
bdr.monitor_group_versions(), which uses BDR version
information returned from the view bdr.group_version_details
to provide a cluster-wide version check. For example:
bdrdb=# SELECT * FROM bdr.monitor_group_versions(); status | message --------+----------------------------------------- OK | All nodes are running same BDR versions
Monitoring Raft Consensus
Raft Consensus should be working cluster-wide at all times. The impact of running a EDB Postgres Distributed cluster without Raft Consensus working might be as follows:
- BDR data changes replication may still be working correctly
- Global DDL/DML locks will not work
- Galloc sequences will eventually run out of chunks
- Eager Replication will not work
- Cluster maintenance operations (join node, part node, promote standby) are still allowed but they might not finish (simply hang)
- Node statuses might not be correctly synced among the BDR nodes
- BDR group replication slot does not advance LSN, thus keeps WAL files on disk
The view bdr.group_raft_details uses the functions
bdr.run_on_all_nodes() and bdr.get_raft_status() to retrieve Raft
Consensus status from all nodes at the same time. For example:
bdrdb=# SELECT node_id, node_name, state, leader_id FROM bdr.group_raft_details; node_id | node_name | state | leader_id ------------+-----------+---------------+------------ 1148549230 | node1 | RAFT_LEADER | 1148549230 3367056606 | node2 | RAFT_FOLLOWER | 1148549230
We can say that Raft Consensus is working correctly if all below conditions are met:
- A valid state (RAFT_LEADERorRAFT_FOLLOWER) is defined on all nodes
- Only one of the nodes is the RAFT_LEADER
- The leader_idis the same on all rows and must match thenode_idof the row wherestate = RAFT_LEADER
From time to time, Raft Consensus will start a new election to define a
new RAFT_LEADER. During an election, there might be an intermediary
situation where there is no RAFT_LEADER and some of the nodes consider
themselves as RAFT_CANDIDATE. The whole election should not take longer
than bdr.raft_election_timeout (by default it is set to 6 seconds). If
the query above returns an in-election situation, then simply wait for
bdr.raft_election_timeout and run the query again. If after
bdr.raft_election_timeout has passed and some the conditions above are
still not met, then Raft Consensus is not working.
Raft Consensus might not be working correctly on a single node only;
for example one of the nodes does not recognize the current leader and
considers itself as a RAFT_CANDIDATE. In this case, it is important to
make sure that:
- All BDR nodes are accessible to each other through both regular and
replication connections (check file pg_hba.conf)
- BDR versions are the same on all nodes
- bdr.raft_election_timeoutis the same on all nodes
In some cases, especially if nodes are geographically distant from each
other and/or network latency is high, the default value of
bdr.raft_election_timeout (6 seconds) might not be enough. If Raft
Consensus is still not working even after making sure everything is
correct, consider increasing bdr.raft_election_timeout to, say, 30
seconds on all nodes. From BDR 3.6.11 onwards, setting
bdr.raft_election_timeout requires only a server reload.
Given how Raft Consensus affects cluster operational tasks, and also as Raft Consensus is directly responsible for advancing the group slot, we can define monitoring alert levels as follows:
- status=UNKNOWN, message=This node is not part of any BDR group
- status=OK, message=Raft Consensus is working correctly
- status=WARNING, message=There is at least 1 node that is not accessible
- status=WARNING, message=There are node(s) as RAFT_CANDIDATE, an election might be in progress
- status=WARNING, message=There is no RAFT_LEADER, an election might be in progress
- status=CRITICAL, message=There is a single node in Raft Consensus
- status=CRITICAL, message=There are node(s) as RAFT_CANDIDATE while a RAFT_LEADER is defined
- status=CRITICAL, message=There are node(s) following a leader different than the node set as RAFT_LEADER
The described behavior is implemented in the function
bdr.monitor_group_raft(), which uses Raft Consensus status
information returned from the view bdr.group_raft_details
to provide a cluster-wide Raft check. For example:
bdrdb=# SELECT * FROM bdr.monitor_group_raft(); status | message --------+------------------------------------- OK | Raft Consensus is working correctly
Monitoring Replication Slots
Each BDR node keeps:
- One replication slot per active BDR peer
- One group replication slot
For example:
bdrdb=# SELECT slot_name, database, active, confirmed_flush_lsn FROM pg_replication_slots ORDER BY slot_name; slot_name | database | active | confirmed_flush_lsn --------------------------+----------+--------+--------------------- bdr_bdrdb_bdrgroup | bdrdb | f | 0/3110A08 bdr_bdrdb_bdrgroup_node2 | bdrdb | t | 0/31F4670 bdr_bdrdb_bdrgroup_node3 | bdrdb | t | 0/31F4670 bdr_bdrdb_bdrgroup_node4 | bdrdb | t | 0/31F4670
Peer slot names follow the convention bdr_<DATABASE>_<GROUP>_<PEER>,
while the BDR group slot name follows the convention
bdr_<DATABASE>_<GROUP>, which can be accessed using the function
bdr.local_group_slot_name().
Peer replication slots should be active on all nodes at all times. If a peer replication slot is not active, then it might mean:
- The corresponding peer is shutdown or not accessible; or
- BDR replication is broken.
Grep the log file for ERROR or FATAL and also check bdr.worker_errors on
all nodes.  The root cause might be, for example, an incompatible DDL was
executed with DDL replication disabled on one of the nodes.
The BDR group replication slot is however inactive most of the time. BDR maintains this slot and advances its LSN when all other peers have already consumed the corresponding transactions. Consequently it is not necessary to monitor the status of the group slot.
The function bdr.monitor_local_replslots() provides a summary of whether all
BDR node replication slots are working as expected, e.g.:
bdrdb=# SELECT * FROM bdr.monitor_local_replslots(); status | message --------+------------------------------------------------- OK | All BDR replication slots are working correctly
One of the following status summaries will be returned:
- UNKNOWN:- This node is not part of any BDR group
- OK:- All BDR replication slots are working correctly
- OK:- This node is part of a subscriber-only group
- CRITICAL:- There is at least 1 BDR replication slot which is inactive
- CRITICAL:- There is at least 1 BDR replication slot which is missing
Monitoring Transaction COMMITs
By default, BDR transactions commit only on the local node. In that case,
transaction COMMIT will be processed quickly.
BDR can be used with standard PostgreSQL synchronous replication, while
BDR also provides two new transaction commit modes: CAMO and Eager
replication. Each of these modes provides additional robustness
features, though at the expense of additional latency at COMMIT.
The additional time at COMMIT can be monitored dynamically using the
bdr.stat_activity catalog, where processes report different wait_event
states.  A transaction in COMMIT waiting for confirmations from one or
more synchronous standbys reports a SyncRep wait event, whereas the
two new modes report EagerRep.
- On this page
- Monitoring Overview
- Monitoring Node Join and Removal
- Monitoring Replication Peers
- Monitoring BDR Replication Workers
- Monitoring BDR Writers
- Monitoring Global Locks
- Monitoring Conflicts
- External Monitoring
- Apply Statistics
- Standard PostgreSQL Statistics Views
- Monitoring BDR Versions
- Monitoring Raft Consensus
- Monitoring Replication Slots
- Monitoring Transaction COMMITs