Overview
High CPU usage is often a symptom of other underlying issues, and as such there are a number of possible causes for it.
Causes of high CPU should be investigated and fixed, because a distressed node will at best slow down query response times resulting in time outs for clients, and at worst cause the node to disconnect and be lost from the cluster altogether.
How to resolve it
To minimize the impact of distressed nodes on your search queries, make sure you have the following setting on your cluster (version 6.1 and above):
PUT /_cluster/settings { "transient": { "cluster.routing.use_adaptive_replica_selection": true } }
Check JVM garbage collection
High CPU is generally the consequence of JVM garbage collection, which in turn is caused by configuration or query related issues.
In a healthy JVM, garbage collection should ideally meet the following conditions:
- Young GC is processed quickly (within 50 ms).
- Young GC is not frequently executed (about 10 seconds).
- Old GC is processed quickly (within 1 second).
- Old GC is not frequently executed (once per 10 minutes or more).
There can be a variety of reasons why heap memory usage can increase:
- Oversharding
- Large aggregation sizes
- Excessive bulk index size
- Mapping issues
- Heap being set incorrectly
- JVM new ratio set incorrectly
To learn how to correct memory usage issues related to JVM garbage collection, see: Heap Size Usage and JVM Garbage Collection in OpenSearch – A Detailed Guide.
Check load on data nodes
If the CPU is high only on specific data nodes (some more than others), then you may have a load balancing or sharding issue.
This can occasionally be caused by applications that are not load balancing correctly across the data nodes, and are making all their HTTP calls to just one or some of the nodes. You should fix this in your application.
However it is more frequently caused by “hot” indices being located on just a small number of nodes. A typical example of this would be a logging application creating daily indices with just one shard per index. In this case although you may have many indices spread across all of the shards, you may find that all of the indexing is being done on just one shard on one node which contains today’s logging index.
Check memory swapping to disk
CPU may also be a symptom of memory swapping to disk, if that has not been deactivated properly on the node.
OpenSearch performance can be heavily penalized if the node is allowed to swap memory to disk. OpenSearch can be configured to automatically prevent memory swapping on its host machine by adding the bootstrap memory_lock true setting to elasticsearch.yml. If bootstrap checks are enabled, OpenSearch will not start if memory swapping is not disabled.
Check number of shards for oversharding
If you have a large number of shards on your cluster, then you may have an issue with oversharding.
Oversharding is a status that indicates that you have too many shards, and thus they are too small. While there is no minimum limit for an OpenSearch shard size, having a larger number of shards on an OpenSearch cluster requires extra resources since the cluster needs to maintain metadata on the state of all the shards in the cluster.
If your shards are too small, then you have 3 options:
- Eliminate empty indices
- Delete or close indices with old or unnecessary data
- Re-index into bigger indices
Check indexing efficiency
If your indexing is inefficient it can affect your CPU.
Optimize slow and expensive search queries
If your searches are slow, it can affect your CPU.