What it means
If the OpenSearch cluster starts to reject search requests, there could be a number of causes. Generally it indicates that one or more nodes cannot keep up with the volume of search requests, resulting in a queue building up on that node. Once the queue exceeds the search queue maximum size, then the node will start to reject the requests.
How to resolve it
Check to see the state of the thread pool, to find out whether the search rejections are always occurring on the same node, or are spread across all of the nodes.
GET /_cat/thread_pool/search
- If the rejection is only happening on specific data nodes, then you may have a load balancing or sharding issue. See Loaded Data Nodes – Important OpenSearch Guide for more information.
- If the rejection is associated with high CPU , then this is generally the consequence of JVM garbage collection which in turn is caused by configuration or query related issues. For a discussion of JVM garbage collection, see: Heap Size Usage and JVM Garbage Collection in OS – A Detailed Guide.
- Queue rejection associated with high CPU may also be a symptom of memory swapping to disk if that has not been deactivated properly on the node. For a deeper understanding and practical steps to resolve the issue, read: The Bootstrap Memory Lock Setting is Set to False – An OpenSearch Guide.
- If you have a large number of shards on your cluster, then you may have an issue with oversharding. To learn more about how to fix this, read Shards Too Small (Oversharding) – A Detailed Guide. You can also read our case study about this.
- It may be useful to activate slow search query logging as described in this guide.
- For a discussion on optimising slow or expensive search queries please see: 10 Important Tips to Improve Search in OpenSearch.
To minimize the impact of distressed node on your search queries, make sure you have the following setting on your cluster (version 6.1 and above):
PUT /_cluster/settings { "transient": { "cluster.routing.use_adaptive_replica_selection": true } }