Setting up Startup Probe in Kubernetes

Issue

Liquibase runs on start up of cps and can take a relatively long time to load in the necessary changesets. Kubernetes uses liveness probes to check if the CPS pod is still running. if Kubernetes does not receive a positive liveness probe after an amount of time, it will restart the pod.

Liquibase implements a ChangeLogLock on the database table while it is running the changesets. If the pod is restarted while Liquibase is running the changesets, the ChangeLogLock will not be removed by Liquibase and on the restart of the CPS pod the following issue will be found as the restarted CPS instance is unable to acquire a ChangeLogLock since the previous one was not removed:

ChangeLogLock error in CPS
2022-04-04T18:37:36.740Z|main|| com.zaxxer.hikari.HikariDataSource - HikariPool-1 - Starting... 2022-04-04 18:37:37.151 INFO [cps-application,,] 1 --- [ main] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Start completed. 2022-04-04T18:37:37.151Z|main|| com.zaxxer.hikari.HikariDataSource - HikariPool-1 - Start completed. 2022-04-04 18:37:37.857 INFO [cps-application,,] 1 --- [ main] liquibase.database : Set default schema name to public 2022-04-04T18:37:37.857Z|main|| liquibase.database - Set default schema name to public 2022-04-04 18:37:38.940 INFO [cps-application,,] 1 --- [ main] liquibase.lockservice : Waiting for changelog lock.... 2022-04-04T18:37:38.940Z|main|| liquibase.lockservice - Waiting for changelog lock.... 2022-04-04 18:37:48.973 INFO [cps-application,,] 1 --- [ main] liquibase.lockservice : Waiting for changelog lock.... 2022-04-04T18:37:48.973Z|main|| liquibase.lockservice - Waiting for changelog lock.... 2022-04-04 18:37:59.054 INFO [cps-application,,] 1 --- [ main] liquibase.lockservice : Waiting for changelog lock....

Recommended Fix

We recommend to use a Startup probe which was introduced in Kubernetes 1.20+. This can be implemented in the Kubernetes deployment file. The Liquibase setup time is environment dependent and in our experience can take up to 5 minutes. As such we recommend a failure threshold above this value.  

Startup Probe in deployment.yaml
startupProbe: httpGet: path: /healthz port: liveness-port failureThreshold: 30 periodSeconds: 10