Elasticsearch Elasticsearch Indexing Performance

By Opster Team

Updated: Aug 28, 2023

| 2 min read

Quick links

Introduction

Elasticsearch is a highly scalable and flexible system that can handle a large volume of data. However, when dealing with high-volume data ingestion, it’s crucial to optimize the indexing performance to ensure efficient and fast data processing. This article will delve into advanced strategies to enhance Elasticsearch indexing performance.

Bulk indexing

Bulk indexing is a method that allows you to index multiple documents in a single request. This approach reduces the overhead of indexing each document individually, thus improving the overall indexing speed.

Here’s an example of how to use the bulk API:

json
POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_id" : "2" } }
{ "create" : { "_index" : "test", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

Refresh interval adjustment

The refresh interval is the frequency at which Elasticsearch makes the newly indexed documents available for search. By default, it’s set to one second. However, for high-volume data ingestion, you can increase the refresh interval or even disable it during the indexing process to enhance performance.

Here’s how to update the refresh interval to 30 seconds instead of the default of 1 second:

json
PUT /my_index/_settings
{
  "index" : {
    "refresh_interval" : "30s"
  }
}

Indexing buffer size tuning

Elasticsearch allocates a certain amount of heap space to the indexing buffer for holding the newly indexed documents before they’re written to the disk. By default, it’s set to 10% of the heap space. If you’re dealing with high-volume data ingestion, consider increasing the indexing buffer size.

Here’s how to update the indexing buffer size:

json
PUT /_all/_settings
{
  "index" : {
    "indexing.buffer.size" : "30%"
  }
}

Use of concurrent indexing

Elasticsearch can handle multiple indexing requests concurrently. This feature can be leveraged to improve the indexing performance. However, it’s important to note that too many concurrent requests can overwhelm the system and degrade performance. Therefore, it’s crucial to find a balance that suits your specific use case.

Optimizing mappings

Mapping is the process of defining how a document and its fields are stored and indexed. By optimizing your mappings, you can significantly improve indexing performance. For instance, using the `keyword` type instead of `text` for string fields that don’t require full-text search, and avoiding nested types and parent-child relationships can enhance performance.

Disk I/O optimization

Disk I/O is often a bottleneck in high-volume data ingestion. To mitigate this, you can use SSDs, which offer faster disk I/O than traditional hard drives. Additionally, you can use RAID 0 configuration to stripe data across multiple disks, thereby increasing the disk I/O.

Disabling replicas

On initial loads, it can be useful to completely disable replica shards, so that the indexing of documents only happens in the primary shards. When the initial load is done, you can add replicas again.

How helpful was this guide?

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?