Briefly, this error occurs when Elasticsearch is unable to allocate shards to nodes, possibly due to insufficient resources, network issues, or configuration errors. To resolve this, you can increase the system resources, check the network connectivity between nodes, or review the shard allocation settings. Also, ensure that the Elasticsearch cluster is properly configured and that there are no issues with the underlying hardware. Lastly, check the Elasticsearch logs for more specific error messages that can help identify the root cause.
Before you dig into reading this guide, have you tried asking OpsGPT what this log means? You’ll receive a customized analysis of your log.
Try OpsGPT now for step-by-step guidance and tailored insights into your Elasticsearch/OpenSearch operation.
For a complete solution to your to your search operation and to understand why all shards failed, try for free AutoOps for Elasticsearch & OpenSearch . With AutoOps and Opster’s proactive support, you don’t have to worry about your search operation – we take charge of it.
What this error means
The exception “all shards failed” arises when at least one shard failed. This can occur due to various reasons, such as: if text fields are being used for document aggregations or performing metric aggregation; if a given search failed on the shard and is in an unrecoverable state, and therefore no response could be given for that shard (though the shard itself is fine); or some special aggregations (like global and reverse nested aggregation) are not used in the proper order.
Possible causes
Below are 5 reasons why this error may appear, including troubleshooting steps for each.
1. Text fields are not optimized for operations
This error sometimes occurs because text fields are not optimized for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default.
Quick troubleshooting steps
To overcome the above error, you need to enable the field data on the field if you want to get rid of the error but beware – it can cause performance issues.
If you are not using any explicit index mapping definition, then you can simply use the .keyword sub-field in order to run aggregation on it.
However, if you have defined index mapping,and if you don’t have the keyword field then you can use a multi-field which is useful to index the same field in different ways for different purposes. You can also change the data type of the name field from text to keyword type in the index mapping definition (to enable aggregation on it), as shown below –
{ "mappings": { "properties": { "name":{ "type":"keyword" // note this } } } }
2. Metric aggregations can’t be performed on text fields
Metric Aggregation mainly refers to the maths calculation done on the documents present in parent buckets. Therefore, you cannot perform metric aggregation on text fields. If these aggregations are performed on a text field, you will get the “all shards failed” exception.
Quick troubleshooting steps
- The sum/max/min/ ie metric aggregation can work on a script instead of a field. The script would transform the text into a numeric value (e.g. Integer.parseInt(doc.cost.value)) or starting ES 7.11 you can use the runtime field which can be used in the query and aggregations.
- If you want to avoid scripts in search query, you can change the data type of the cost field to a numeric type, to avoid the error. The index mapping definition will be like below:
{ "mappings": { "properties": { "cost":{ "type":"integer" // note this } } } }
3. At least one shard has failed
The aforementioned exception may arise when at least one shard has failed. Upon restarting the remote server, some shards may not recover, causing the cluster to stay red. You can check the health status of the cluster, by using the Elasticsearch Check-Up or cluster health API:
GET _cluster/health
One way to resolve the error is to delete the index completely (but it’s not an ideal solution).
4. Global aggregations are not defined as top-level
Global aggregation is a special kind of aggregation that is executed globally on all the documents without being influenced by the query. If global aggregations are not defined as top-level aggregations, then you’ll get the “all shards failed” exception.
Quick troubleshooting steps
To avoid this error, you should ensure that global aggregations are defined only as top-level aggregations and not as sub-level aggregation.
For example – In the above case you should change the search query as follows (note here that global aggregation is defined as a top-level aggregation):
{ "size": 0, "aggs": { "all_products": { "global": {}, "aggs": { "genres": { "terms": { "field": "cost" } } } } } }
5. Reverse nested aggregation is not used inside a nested aggregation
Reverse nested aggregation is a single bucket aggregation that enables aggregating on parent docs from nested documents.
The reverse_nested aggregation must be defined inside a nested aggregation.
But if reverse nested aggregation is not used inside a nested aggregation, you’ll see this exception.
Quick troubleshooting steps
To avoid this error, you should ensure that the reverse_nested aggregation is defined inside a nested aggregation.
The modified search query will be –
{ "aggs": { "comments": { "nested": { "path": "comments" }, "aggs": { "top_usernames": { "terms": { "field": "comments.username" }, "aggs": { "comment_issue": { "reverse_nested": {}, "aggs": { "top_tags": { "terms": { "field": "tags" } } } } } } } } } }
Overview
Data in an Elasticsearch index can grow to massive proportions. In order to keep it manageable, it is split into a number of shards. Each Elasticsearch shard is an Apache Lucene index, with each individual Lucene index containing a subset of the documents in the Elasticsearch index. Splitting indices in this way keeps resource usage under control. An Apache Lucene index has a limit of 2,147,483,519 documents.
Examples
The number of shards is set when an index is created, and this number cannot be changed later without reindexing the data. When creating an index, you can set the number of shards and replicas as properties of the index using:
PUT /sensor { "settings" : { "index" : { "number_of_shards" : 6, "number_of_replicas" : 2 } } }
The ideal number of shards should be determined based on the amount of data in an index. Generally, an optimal shard should hold 30-50GB of data. For example, if you expect to accumulate around 300GB of application logs in a day, having around 10 shards in that index would be reasonable.
During their lifetime, shards can go through a number of states, including:
- Initializing: An initial state before the shard can be used.
- Started: A state in which the shard is active and can receive requests.
- Relocating: A state that occurs when shards are in the process of being moved to a different node. This may be necessary under certain conditions, such as when the node they are on is running out of disk space.
- Unassigned: The state of a shard that has failed to be assigned. A reason is provided when this happens. For example, if the node hosting the shard is no longer in the cluster (NODE_LEFT) or due to restoring into a closed index (EXISTING_INDEX_RESTORED).
In order to view all shards, their states, and other metadata, use the following request:
GET _cat/shards
To view shards for a specific index, append the name of the index to the URL, for example:
sensor: GET _cat/shards/sensor
This command produces output, such as in the following example. By default, the columns shown include the name of the index, the name (i.e. number) of the shard, whether it is a primary shard or a replica, its state, the number of documents, the size on disk, the IP address, and the node ID.
sensor 5 p STARTED 0 283b 127.0.0.1 ziap sensor 5 r UNASSIGNED sensor 2 p STARTED 1 3.7kb 127.0.0.1 ziap sensor 2 r UNASSIGNED sensor 3 p STARTED 3 7.2kb 127.0.0.1 ziap sensor 3 r UNASSIGNED sensor 1 p STARTED 1 3.7kb 127.0.0.1 ziap sensor 1 r UNASSIGNED sensor 4 p STARTED 2 3.8kb 127.0.0.1 ziap sensor 4 r UNASSIGNED sensor 0 p STARTED 0 283b 127.0.0.1 ziap sensor 0 r UNASSIGNED
Notes and good things to know
- Having shards that are too large is simply inefficient. Moving huge indices across machines is both a time- and labor-intensive process. First, the Lucene merges would take longer to complete and would require greater resources. Moreover, moving the shards across the nodes for rebalancing would also take longer and recovery time would be extended. Thus by splitting the data and spreading it across a number of machines, it can be kept in manageable chunks and minimize risks.
- Having the right number of shards is important for performance. It is thus wise to plan in advance. When queries are run across different shards in parallel, they execute faster than an index composed of a single shard, but only if each shard is located on a different node and there are sufficient nodes in the cluster. At the same time, however, shards consume memory and disk space, both in terms of indexed data and cluster metadata. Having too many shards can slow down queries, indexing requests, and management operations, and so maintaining the right balance is critical.
How to reduce your Elasticsearch costs by optimizing your shards
Watch the video below to learn how to save money on your deployment by optimizing your shards.
Overview
Search refers to the searching of documents in an index or multiple indices. The simple search is just a GET API request to the _search endpoint. The search query can either be provided in query string or through a request body.
Examples
When looking for any documents in this index, if search parameters are not provided, every document is a hit and by default 10 hits will be returned.
GET my_documents/_search
A JSON object is returned in response to a search query. A 200 response code means the request was completed successfully.
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 2, "successful" : 2, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 1.0, "hits" : [ ... ] } }
Notes and good things to know
- Distributed search is challenging and every shard of the index needs to be searched for hits, and then those hits are combined into a single sorted list as a final result.
- There are two phases of search: the query phase and the fetch phase.
- In the query phase, the query is executed on each shard locally and top hits are returned to the coordinating node. The coordinating node merges the results and creates a global sorted list.
- In the fetch phase, the coordinating node brings the actual documents for those hit IDs and returns them to the requesting client.
- A coordinating node needs enough memory and CPU in order to handle the fetch phase.
Log Context
Log “all shards failed” class name is SearchScrollAsyncAction.java. We extracted the following from Elasticsearch source code for those seeking an in-depth context :
addShardFailure(new ShardSearchFailure(failure; searchShardTarget)); int successfulOperations = successfulOps.decrementAndGet(); assert successfulOperations >= 0 : "successfulOperations must be >= 0 but was: " + successfulOperations; if (counter.countDown()) { if (successfulOps.get() == 0) { listener.onFailure(new SearchPhaseExecutionException(phaseName; "all shards failed"; failure; buildShardFailures())); } else { SearchPhase phase = nextPhaseSupplier.get(); try { phase.run(); } catch (Exception e) {