ERROR: requested WAL segment xxxx has already been removed 错误处理

碰到 ERROR: requested WAL segment xxxx has already been removed 的错误。出现这个报错信息主要原因是主库checkpoint了，但wal日志还没有发送到备库，主库wal日志被清理，导致备库请求恢复的wal和主库实际存在的wal日志不一致。后面三种生产上都会用上，有必要实验看看效果。

24K老游

2301人浏览 · 2024-01-15 10:51:50

24K老游 · 2024-01-15 10:51:50 发布

碰到 ERROR: requested WAL segment xxxx has already been removed 的错误。

出现这个报错信息主要原因是主库checkpoint了，但wal日志还没有发送到备库，主库wal日志被清理，导致备库请求恢复的wal和主库实际存在的wal日志不一致。
可能的原因是：

备库停机时间过长，超过了 checkpoint 的默认间隔，历史wal已经被清理。导致 wal 还未发送到备库却被 remove 掉了。
主库大事务产生了大量 wal，因为 PostgreSQL 在执行事务过程中，直到提交时才会发送到备库。由于该事务需要执行的时间过长，超过了 checkpoint 的默认间隔，所以导致有的 wal 还未发送到备库却被 remove 掉了。

解决办法有如下几种：

设置wal_keep_segments 为更大的值
- 这个值调大以后可以在主库堆积更多的wal日志，测试环境数据量小可以用用，但生产环境系统操作频繁，数据量大，这不能解决根本问题。
重做备机
- 没有开启归档和replication slots的时候用这个方法，终极大法，缺点是如果数据量大，耗时太长。
提前开启归档
- 将wal归档，可以保留更多的wal日志以便后期恢复使用。
启用 replication slots
- 为每个备机准备一个插槽，大概就是找个地方记录备机当前同步的状态，如果新生成的wal没有成功同步，就一直保留，在这之后的wal都不会被清理，直到备机恢复同步。如果不及时处理故障的备机，wal日志可能会写满磁盘。

后面三种生产上都会用上，有必要实验看看效果。

模拟故障

修改主库参数
[pg@localhost ~]$ vi /data/db1/postgresql.conf
---------------------------------------
synchronous_commit = local
checkpoint_timeout = 30s
max_wal_size = 100MB
min_wal_size = 10MB
----------------------------------------

重载配置
[pg@localhost ~]$ pg_ctl -D /data/db1/ reload

关闭备机
[pg@localhost ~]$ pg_ctl -D /data/db2/ stop

人工切换wal日志
[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres
postgres=# select pg_switch_wal();
 pg_switch_wal
---------------
 0/2C000158
(1 row)

postgres=# create table t6(a int);
CREATE TABLE
postgres=# insert into t6(a) values (generate_series(1,100));
INSERT 0 1
postgres=# select pg_switch_wal();
 pg_switch_wal
---------------
 0/2D0142F8
(1 row)

postgres=# checkpoint;
CHECKPOINT
postgres=#

启动备库
[pg@localhost ~]$ pg_ctl -D /data/db2/ -l /data/db2/server.log start

查看日志，报错
[pg@localhost ~]$ tail -f /data/db2/log/*
2024-01-12 14:36:17.218 CST [11689] LOG:  entering standby mode
2024-01-12 14:36:17.219 CST [11689] LOG:  redo starts at 0/2C000060
2024-01-12 14:36:17.219 CST [11689] LOG:  consistent recovery state reached at 0/2C000108
2024-01-12 14:36:17.219 CST [11689] LOG:  invalid record length at 0/2C000140: wanted 24, got 0
2024-01-12 14:36:17.220 CST [11687] LOG:  database system is ready to accept read only connections
2024-01-12 14:36:17.224 CST [11693] LOG:  started streaming WAL from primary at 0/2C000000 on timeline 1
2024-01-12 14:36:17.224 CST [11693] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002C has already been removed
2024-01-12 14:36:17.228 CST [11695] LOG:  started streaming WAL from primary at 0/2C000000 on timeline 1
2024-01-12 14:36:17.228 CST [11695] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002C has already been removed
2024-01-12 14:36:22.242 CST [11717] LOG:  started streaming WAL from primary at 0/2C000000 on timeline 1
2024-01-12 14:36:22.242 CST [11717] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002C has already been removed

重做备机

两种方式重做备机：

pg_basebackup 程序，用普通文件或创建tar包的方式进行基础备份，数据量小的时候很方便。

干掉原来的备库(注意备份postgresql.conf和recovery.conf)
[pg@localhost ~]$ rm -rf /data/db2

创建备库(创建完成后将postgresql.conf和recovery.conf复制回来)
[pg@localhost ~]$ pg_basebackup -Pv -Upg -h localhost -p 15431 -D /data/db2

重启备库，再查看日志，无报错。

也可以使用 pg_start_backup 和 pg_stop_backup 创建基础备份。
主要三个步骤：
- 执行 pg_start_backup命令开始执行备份；
- 使用命令创建数据库目录的副本；
- 执行pg_stop_backup命令结束备份。

执行 pg_start_backup命令
[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres -c "select pg_start_backup('initial_backup');"

同步基础备份文件到备库
[pg@localhost ~]$ rsync -cva --inplace --exclude=*pg_wal* --exclude=postgresql.conf /data/db1/* /data/db2

执行pg_stop_backup命令
[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres -c "select pg_stop_backup();"

重启备库，再查看日志，无报错。

提前开启归档

通过归档复制的流程

新建归档目录
主库开启归档，配置归档命令archive_command，用于将wal文件复制到归档目录
备库配置归档恢复命令restore_command，用于将归档文件复制到备库的wal目录

新建归档目录
[pg@localhost ~]$ mkdir /data/arc

主库开启归档
[pg@localhost ~]$ vi /data/db1/postgresql.conf
---------------------------------------
archive_mode = on
archive_command = 'test ! -f /data/arc/%f && cp %p /data/arc/%f'
archive_timeout = 1
---------------------------------------

使用模拟故障相关操作，模拟故障，启动备库查看日志

2024-01-12 16:37:01.977 CST [12695] LOG:  entering standby mode
2024-01-12 16:37:01.979 CST [12695] LOG:  redo starts at 0/48000028
2024-01-12 16:37:01.979 CST [12695] LOG:  consistent recovery state reached at 0/49000060
2024-01-12 16:37:01.979 CST [12695] LOG:  invalid record length at 0/49000060: wanted 24, got 0
2024-01-12 16:37:01.980 CST [12693] LOG:  database system is ready to accept read only connections
2024-01-12 16:37:01.984 CST [12699] LOG:  started streaming WAL from primary at 0/49000000 on timeline 1
2024-01-12 16:37:01.984 CST [12699] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000000000049 has already been removed
2024-01-12 16:37:01.988 CST [12701] LOG:  started streaming WAL from primary at 0/49000000 on timeline 1
2024-01-12 16:37:01.988 CST [12701] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000000000049 has already been removed
2024-01-12 16:37:07.006 CST [12703] LOG:  started streaming WAL from primary at 0/49000000 on timeline 1
2024-01-12 16:37:07.006 CST [12703] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 000000010000000000000049 has already been removed

备库配置归档恢复命令
[pg@localhost ~]$ vi /data/db2/recovery.conf
---------------------------------------
restore_command = 'cp /data/arc/%f %p'
---------------------------------------

重新启动备库，查看备库日志，可以看到备库启动后先从归档中恢复数据，然后再通过WAL流复制数据。

2024-01-12 16:39:04.646 CST [12742] LOG:  entering standby mode
2024-01-12 16:39:04.681 CST [12742] LOG:  restored log file "000000010000000000000048" from archive
2024-01-12 16:39:04.933 CST [12742] LOG:  redo starts at 0/48000028
2024-01-12 16:39:04.948 CST [12742] LOG:  restored log file "000000010000000000000049" from archive
2024-01-12 16:39:05.209 CST [12742] LOG:  consistent recovery state reached at 0/49000060
2024-01-12 16:39:05.210 CST [12740] LOG:  database system is ready to accept read only connections
2024-01-12 16:39:05.218 CST [12742] LOG:  restored log file "00000001000000000000004A" from archive
2024-01-12 16:39:05.480 CST [12742] LOG:  restored log file "00000001000000000000004B" from archive
2024-01-12 16:39:05.767 CST [12742] LOG:  restored log file "00000001000000000000004C" from archive
2024-01-12 16:39:06.077 CST [12742] LOG:  restored log file "00000001000000000000004D" from archive
2024-01-12 16:39:06.427 CST [12742] LOG:  restored log file "00000001000000000000004E" from archive
2024-01-12 16:39:06.727 CST [12742] LOG:  restored log file "00000001000000000000004F" from archive
cp: cannot stat ‘/data/arc/000000010000000000000050’: No such file or directory
2024-01-12 16:39:07.044 CST [12755] LOG:  started streaming WAL from primary at 0/50000000 on timeline 1

启用 replication slots

这个实验得关闭归档，否则pg会优先使用归档。

将主库 archive_mode 设置为 off，重新加载主库
修改配置 max_replication_slots 的值，设置值需大于订阅节点的数量
创建 slot
为备库配置 slot
模拟故障并观察日志

pg10的 max_replication_slots默认为10，所以主库配置不用调整。直接创建slot即可

[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres
psql (10.23)
Type "help" for help.

postgres=# SELECT * FROM pg_create_physical_replication_slot('s1_slot');
 slot_name | lsn
-----------+-----
 s1_slot   |
(1 row)

备库配置 slot
[pg@localhost ~]$ vi /data/db2/recovery.conf
---------------------------------------
primary_slot_name = 's1_slot'
---------------------------------------

[pg@localhost ~]$ psql -h localhost -p 15431 -U pg postgres
postgres=# select * from pg_replication_slots;
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_ls
n | confirmed_flush_lsn
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-----------
--+---------------------
 s1_slot   |        | physical  |        |          | f         | t      |      14535 |      |              | 0/7F000060
  |
(1 row)

启用replication slots后，备机发生故障时不会清理WAL，所以先查看故障前wal文件情况

[pg@localhost pg_wal]$ ll
total 65536
-rw-------. 1 pg pg       16777216 Jan 15 09:37 00000001000000000000006C
-rw-------. 1 pg postgres 16777216 Jan 15 09:38 00000001000000000000006D
-rw-------. 1 pg pg       16777216 Jan 15 09:38 00000001000000000000006E
-rw-------. 1 pg postgres 16777216 Jan 15 09:37 00000001000000000000006F
drwx------. 2 pg pg              6 Jan 15 09:32 archive_status

关闭备库模拟故障，查看看故障时wal文件情况

[pg@localhost pg_wal]$ ll
total 180224
-rw-------. 1 pg pg       16777216 Jan 15 10:08 00000001000000000000006E
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 00000001000000000000006F
-rw-------. 1 pg pg       16777216 Jan 15 10:08 000000010000000000000070
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000071
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000072
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000073
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000074
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000075
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000076
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000077
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000078
drwx------. 2 pg pg              6 Jan 15 09:32 archive_status

在主库上执行checkpoint，wal文件仍然不会被清理。

可以看到wal没有被清理，一直在增加。

重新启动备库，数据从slot标记的位置开始恢复，无报错。

[pg@localhost ~]$ pg_ctl -D /data/db2/ -l /data/db2/server.log start

查看日志
2024-01-15 10:11:45.044 CST [14528] LOG:  entering standby mode
cp: cannot stat ‘/data/arc/00000001000000000000006D’: No such file or directory
2024-01-15 10:11:45.048 CST [14528] LOG:  redo starts at 0/6D000060
cp: cannot stat ‘/data/arc/00000001000000000000006E’: No such file or directory
2024-01-15 10:11:45.050 CST [14528] LOG:  consistent recovery state reached at 0/6E000060
2024-01-15 10:11:45.050 CST [14528] LOG:  invalid record length at 0/6E000060: wanted 24, got 0
2024-01-15 10:11:45.050 CST [14526] LOG:  database system is ready to accept read only connections
2024-01-15 10:11:45.055 CST [14534] LOG:  started streaming WAL from primary at 0/6E000000 on timeline 1

在主库上执行checkpoint，再查看主库的wal目录，同步过的文件已被清理。

[pg@localhost pg_wal]$ ll
total 98304
-rw-------. 1 pg postgres 16777216 Jan 15 10:10 00000001000000000000007E
-rw-------. 1 pg postgres 16777216 Jan 15 10:28 00000001000000000000007F
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000080
-rw-------. 1 pg postgres 16777216 Jan 15 10:08 000000010000000000000081
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000082
-rw-------. 1 pg postgres 16777216 Jan 15 10:09 000000010000000000000083
drwx------. 2 pg pg              6 Jan 15 09:32 archive_status