Setting up Startup Probe in Kubernetes
Issue
Liquibase runs on start up of cps and can take a relatively long time to load in the necessary changesets. Kubernetes uses liveness probes to check if the CPS pod is still running. if Kubernetes does not receive a positive liveness probe after an amount of time, it will restart the pod.
Liquibase implements a ChangeLogLock on the database table while it is running the changesets. If the pod is restarted while Liquibase is running the changesets, the ChangeLogLock will not be removed by Liquibase and on the restart of the CPS pod the following issue will be found as the restarted CPS instance is unable to acquire a ChangeLogLock since the previous one was not removed:
ChangeLogLock error in CPS
2022-04-04T18:37:36.740Z|main|| com.zaxxer.hikari.HikariDataSource - HikariPool-1 - Starting...
2022-04-04 18:37:37.151 INFO [cps-application,,] 1 --- [ main] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Start completed.
2022-04-04T18:37:37.151Z|main|| com.zaxxer.hikari.HikariDataSource - HikariPool-1 - Start completed.
2022-04-04 18:37:37.857 INFO [cps-application,,] 1 --- [ main] liquibase.database : Set default schema name to public
2022-04-04T18:37:37.857Z|main|| liquibase.database - Set default schema name to public
2022-04-04 18:37:38.940 INFO [cps-application,,] 1 --- [ main] liquibase.lockservice : Waiting for changelog lock....
2022-04-04T18:37:38.940Z|main|| liquibase.lockservice - Waiting for changelog lock....
2022-04-04 18:37:48.973 INFO [cps-application,,] 1 --- [ main] liquibase.lockservice : Waiting for changelog lock....
2022-04-04T18:37:48.973Z|main|| liquibase.lockservice - Waiting for changelog lock....
2022-04-04 18:37:59.054 INFO [cps-application,,] 1 --- [ main] liquibase.lockservice : Waiting for changelog lock....
Recommended Fix
We recommend to use a Startup probe which was introduced in Kubernetes 1.20+. This can be implemented in the Kubernetes deployment file. The Liquibase setup time is environment dependent and in our experience can take up to 5 minutes. As such we recommend a failure threshold above this value.
Startup Probe in deployment.yaml
startupProbe:
httpGet:
path: /healthz
port: liveness-port
failureThreshold: 30
periodSeconds: 10