GetJobId failed to cleanup old checkpoints retrying after next checkpoint

GetJobId failed to cleanup old checkpoints retrying after next checkpoint – How to solve this Elasticsearch error

Opster Team

Aug-23, Version: 8.3-8.9

Briefly, this error occurs when Elasticsearch’s Machine Learning feature tries to clean up old model snapshots and fails. This could be due to a temporary issue like network glitches or insufficient permissions. To resolve this, you can manually delete old model snapshots if they are not needed. Alternatively, check the cluster’s health and ensure there are no network issues. Also, verify that Elasticsearch has the necessary permissions to delete these checkpoints. If the issue persists, consider increasing the retry interval or the number of retries in the Elasticsearch settings.

This guide will help you check for common problems that cause the log ” [” + getJobId() + “] failed to cleanup old checkpoints; retrying after next checkpoint ” to appear. To understand the issues related to this log, read the explanation below about the following Elasticsearch concepts: plugin.

Log Context

Log “[” + getJobId() + “] failed to cleanup old checkpoints; retrying after next checkpoint” classname is TransformIndexer.java.
We extracted the following from Elasticsearch source code for those seeking an in-depth context :

                ActionListener.wrap(deletes -> {
                    logger.debug("[{}] deleted [{}] outdated checkpoints"; getJobId(); deletes);
                    listener.onResponse(null);
                    lastCheckpointCleanup = context.getCheckpoint();
                }; e -> {
                    logger.warn(() -> "[" + getJobId() + "] failed to cleanup old checkpoints; retrying after next checkpoint"; e);
                    auditor.warning(
                        getJobId();
                        "Failed to cleanup old checkpoints; retrying after next checkpoint. Exception: " + e.getMessage()
                    );