Elasticsearch How to Ingest Data into Elasticsearch: A Comprehensive Guide

By Opster Team

Updated: Jul 23, 2023

| 3 min read

Introduction

Ingesting data into Elasticsearch is a crucial step in setting up a powerful search and analytics engine. This article will provide a detailed guide on various methods to ingest data into Elasticsearch, including Logstash, Beats, Elasticsearch Ingest Node, and the Elasticsearch Bulk API. We will also discuss the pros and cons of each method and provide examples when relevant.

1. Logstash

Logstash is a popular open-source data processing pipeline that can ingest data from various sources, transform it, and then send it to Elasticsearch. It supports a wide range of input plugins, filters, and output plugins, making it a versatile choice for data ingestion.

Pros:

  • Supports a wide range of input sources and formats
  • Provides powerful data transformation capabilities
  • Can handle complex data pipelines

Cons:

  • Can be resource-intensive
  • Requires JVM to run

Step-by-step instructions:

a. Install Logstash on your system by following the official installation guide.

b. Create a Logstash configuration file that specifies the input source, filters, and output destination. For example:

input {
  file {
    path => "/path/to/your/logfile.log"
    start_position => "beginning"
  }
}

filter {
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
}

output {
  elasticsearch {
    hosts => ["http://localhost:9200"]
    index => "my-index"
  }
}

c. Run Logstash with the configuration file:

bin/logstash -f /path/to/your/logstash.conf

2. Beats

Beats are lightweight data shippers that can collect various types of data and send it directly to Elasticsearch or Logstash. There are several types of Beats available, such as Filebeat for log files, Metricbeat for metrics, and Packetbeat for network data.

Pros:

  • Lightweight and resource-efficient
  • Easy to set up and configure
  • Supports various data types

Cons:

  • Limited data transformation capabilities

Step-by-step instructions:

a. Install the desired Beat on your system by following the official installation guide.

b. Configure the Beat by editing its configuration file (e.g., filebeat.yml for Filebeat). Specify the input source and output destination:

filebeat.inputs:
type: log
  paths:
/path/to/your/logfile.log

output.elasticsearch:
  hosts: ["http://localhost:9200"]
  index: "my-index"

c. Start the Beat service:

sudo service filebeat start

3. Elasticsearch Ingest Node

Elasticsearch Ingest Node is a built-in feature that allows you to perform simple data transformations directly within Elasticsearch. You can define an ingest pipeline with a series of processors to modify the data before indexing it.

Pros:

  • No additional software required
  • Suitable for simple data transformations

Cons:

  • Limited data processing capabilities compared to Logstash

Step-by-step instructions:

a. Define an ingest pipeline with the desired processors:

PUT _ingest/pipeline/my_pipeline
{
  "description": "My custom pipeline",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IP:client_ip} %{WORD:method} %{URIPATHPARAM:request}"]
      }
    },
    {
      "geoip": {
        "field": "client_ip",
        "target_field": "geo"
      }
    }
  ]
}

b. Index your data using the defined pipeline:

PUT my_index/_doc/1?pipeline=my_pipeline
{
  "message": "23.23.11.10 GET /search?q=elasticsearch"
}

After going through the ingest pipeline, the enriched document will look like this:

{
  "client_ip": "23.23.11.10",
  "message": "23.23.11.10 GET /search?q=elasticsearch",
  "method": "GET",
  "request": "/search?q=elasticsearch",
  "geo": {
    "continent_name": "North America",
    "region_iso_code": "US-VA",
    "city_name": "Ashburn",
    "country_iso_code": "US",
    "country_name": "United States",
    "region_name": "Virginia",
    "location": {
      "lon": -77.4903,
      "lat": 39.0469
    }
  }
}

4. Elasticsearch Bulk API

The Elasticsearch Bulk API allows you to perform multiple index, update, or delete operations in a single request. This can significantly improve indexing performance when ingesting large amounts of data.

Pros:

  • High-performance data ingestion
  • Suitable for large-scale data indexing

Cons:

  • Requires manual data formatting

Step-by-step instructions:

a. Format your data in the bulk API format:

{ "index" : { "_index" : "my-index", "_id" : "1" } }
{ "field1" : "value1", "field2" : "value2" }
{ "index" : { "_index" : "my-index", "_id" : "2" } }
{ "field1" : "value3", "field2" : "value4" }

b. Send the bulk request to Elasticsearch:

POST _bulk
{ "index" : { "_index" : "my-index", "_id" : "1" } }
{ "field1" : "value1", "field2" : "value2" }
{ "index" : { "_index" : "my-index", "_id" : "2" } }
{ "field1" : "value3", "field2" : "value4" }

When using the bulk API, consider the following best practices:

  • Keep the bulk request size reasonable, typically between 5-15 MB.
  • Monitor the indexing performance and adjust the bulk request size accordingly.
  • Use multiple threads or processes to send bulk requests concurrently.

Conclusion

In conclusion, there are several methods to ingest data into Elasticsearch, each with its own advantages and limitations. Depending on your specific use case and data processing requirements, you can choose the most suitable method to ensure efficient and reliable data ingestion.

How helpful was this guide?

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?