More about Elasticsearch
- Self-Managed Elasticsearch vs. Elastic Cloud Managed Service
- Elasticsearch: Concepts, Deployment Options and Best Practices
- How to Deploy Elasticsearch with Cloud Volumes ONTAP
- Elasticsearch vs MongoDB: 6 Key Differences
- Elasticsearch on Kubernetes: DIY vs. Elasticsearch Operator
- Elasticsearch on AWS: Deploying Your First Managed Cluster
- Elasticsearch on Azure: A Quick Start Guide
- Elasticsearch on Google Cloud: Your First Managed Cluster
- Elasticsearch Architecture: 7 Key Components
- Comparing the Two AWS Deployment Options for Elasticsearch
- Elasticsearch Optimization with Cloud Volumes ONTAP
What is Elasticsearch?
Elasticsearch is a NoSQL database and analytics engine, which can process any type of data, structured or unstructured, textual or numerical. Developed by Elasticsearch N.V. (now Elastic) and based on Apache Lucene, it is free, open-source, and distributed in nature.
Elasticsearch is the main component of ELK Stack (also known as the Elastic Stack), which stands for Elasticsearch, Logstash, and Kibana. These free tools combine to offer scalability, speed and simple REST APIs for data analysis, visualization and storage. ELK Stack also offers a range of Beats (lightweight agents) to send data from various IT systems to Elasticsearch.
Elasticsearch is most commonly used for:
- Enterprise search—enabling fast text search of large, constantly updated data sets
- Observability—performing log analysis and assisting with monitoring of IT systems, identifying trends and problems
- Security—analyzing security events from multiple IT systems and security tools to identify threats and perform forensic analysis
In this article, you will learn:
- Elasticsearch Concepts
- Elasticsearch Deployment Options
- Elasticsearch Best Practices
- Elasticsearch Storage with Cloud Volumes ONTAP
Fields are the basic unit of data in Elasticsearch. You can customize fields, and use them to include information like raw text, document title, date, author, project, etc. Fields can have several categories of data types, including:
- Common types—numbers, strings, dates, Booleans
- Object types—JSON objects, nested objects, flattened objects and join relationships
- Text search types—plain text, auto completions, token count, etc.
- Geospatial types—points and shapes
- Document ranking types—dense vector, sparse vector, feature ranks, etc.
Documents are data structures serialized as JSON objects, which are used to store data. When documents are stored in Elasticsearch, they are immediately added to the index for fast searching. Documents store data as keys and values, where the key is the name of the field, and the value is the data itself, which can be a string, integer, other objects, or arrays of values.
Indices represent groupings of documents—like individual databases in a relational database system. You can define as many indexes as you need—each index holds a unique set of documents. Index names typically indicate the action that can be performed on documents in the index—such as search or delete.
Elasticsearch is based on Lucene, the open-source search engine. Shards are a Lucene index. You can use shards to split up an index horizontally, to prevent performance issues and crashes in Elasticsearch. When index size approaches its limit, you should split it into shards to improve performance.
Replicas are copies of index shards. They are useful to backup index data, so you don’t lose it if a node crashes. Replicas are not just backups—they can also serve read requests in parallel, so they improve performance. You can change the number of replicas at any time, adding more to boost parallel read performance and resilience.
A node is a machine (physical or virtual) that holds some of all of the data in Elasticsearch. The computing resources on nodes are used by Elasticsearch to index and search.
Within an Elasticsearch cluster, a node can serve several roles:
- Data nodes store data and execute operations like aggregation and search
- Master nodes manage the cluster
- Client nodes forward requests from data nodes to master nodes
- Tribe nodes perform read/write operations on the entire cluster
- Ingestion nodes are used to pre-process documents
A cluster is a group of Elasticsearch nodes, which provide indexing and search for one or more datasets. A cluster has a unique identifier, and any node must specify this identifier to join the cluster.
The master node manages the cluster, performing tasks like adding and configuring nodes, and removing nodes. If the master node fails, another master can be elected by the remaining nodes. Elasticsearch clusters are highly scalable and can support up to thousands of nodes.
Related content: read our guide to the Elasticsearch architecture
Elasticsearch Deployment Options
While many companies deploy Elasticsearch directly on machines in their local data center, it is increasingly common to deploy Elasticsearch in the public cloud or using container orchestrators. We’ll see how you can deploy Elasticsearch on the Amazon and Azure public clouds, and via Kubernetes.
Elasticsearch on AWS
Amazon Elasticsearch Service (Amazon ES) manages Elasticsearch clusters in the AWS Cloud, making it easier to deploy and operate. Elasticsearch is popular for use cases such as log analytics, clickstream analysis and real-time application monitoring. Amazon ES provides direct access to Elasticsearch APIs, enabling existing applications and code to work seamlessly with Elasticsearch. It also reduces the overhead associated with a self-managed infrastructure.
- Provisions your cluster resources for you
- Launches and manages your Elasticsearch cluster
- Automatically detects failed Elasticsearch nodes and replaces them
To use Amazon ES:
- Create an Amazon ES domain -- this is equivalent to an Elasticsearch cluster
- Specify settings, including storage resources, instance types and number of instances (equivalent to Elasticsearch nodes). You can use either the AWS console, AWS Command Line Interface (CLI), or the SDK.
- Scale your cluster programmatically using API calls, or via the AWS console
Read more in our detailed guide to Elasticsearch on AWS
Elasticsearch on Azure
You can run Elasticsearch on Azure using an Azure Marketplace solution, which provides a preconfigured template for deploying Elasticsearch clusters, with the full ELK stack.
This solution lets you provide several parameters via a web UI, Azure CLI or Azure PowerShell commands, and deploys an Elasticsearch cluster based on your inputs. Behind the scenes, the solution uses an Azure Resource Manager (ARM) template, which deploys the Elasticsearch cluster into an Azure resource group.
Read more in our detailed guide to Elasticsearch on Azure
Elasticsearch on Google Cloud
Elastic on Google Cloud is a hosted solution that provides four deployment options: Elastic Enterprise Search, Elastic Observability, and Elastic Security are three options optimized for specific use cases. The fourth option, Elastic Stack is a plain Elasticsearch deployment you can use for any purpose.
Elastic Cloud lets you deploy Elasticsearch as a hosted service or manage it with orchestration tools hosted in the Google Cloud environment.
The Elastic Cloud solution provides:
- Easy creation of dashboards and visualizations to analyze and act on data trends. The Elastic Observability deployment option lets you easily integrate logs, metrics, and APM tracking at scale to gain unified visibility of technology stacks.
- Enterprise-scale search that lets you capture and store data in any format from any source and enable indexing and real-time search.
- Special support for security data with SIEM integration and a customized machine-learning detection engine. Elasticsearch can automatically detect attacks and configuration errors in log data using advanced correlations and preset rules.
Read more in our detailed guide to Elasticsearch on Google Cloud
Elasticsearch on Kubernetes
Elastic Cloud on Kubernetes (ECK) supports the deployment of the ELK stack on Kubernetes (including Elasticsearch, Logstash, Kibana and Beats). ECK takes advantage of Kubernetes orchestration capabilities.
ECK allows you to streamline critical operations, including managing and scaling clusters and storage, monitoring multiple clusters, securing clusters and using rolling upgrades for safe configuration. To distribute Elasticsearch resources across availability zones in the cloud, you can enable zone awareness.
You can also set up hot-warm-cold architectures for data storage. ECK lets you tier your data to meet different needs and conserve costs. Hot data is frequently accessed, warm data is infrequently accessed, and cold data is archival or backup storage—you can use lower-cost archive cloud storage tiers for warm and cold data.
Read more in our detailed guide to Elasticsearch on Kubernetes
Elasticsearch Best Practices
The following best practices can help you operate and maintain Elasticsearch more effectively.
It's important to prepare for sizing, by determining the amount of data you need to store in Elasticsearch and the speed and volume of new data entering the system. You also determine the amount of RAM required for each node and master node in the cluster.
There are no specific guidelines for the required capacity—the best way to estimate capacity is to perform a simulation. Create an Elasticsearch cluster and serve it at roughly the same data rate as expected in a production environment. Start big—provision more resources than are actually needed—and scale down until you see resources are exactly suitable for your data rate.
Organizing Data in Elasticsearch Indices
The structure of data entered into your Elasticsearch indexes is crucial—it will affect the accuracy and flexibility of search queries and will affect the way the data is analyzed and visualized.
A particular challenge is data with a similar structure, coming from different sources and indexed to different fields. Elasticsearch can find it difficult to search across these different data fields.
A good way to improve usability of indices is to create index mappings. Elasticsearch can infer data types based on the input data it receives, but this is based on small samples of data sets and may not be accurate. By explicitly creating the mapping, you can help Elasticsearch avoid data type conflicts in the index.
The Elastic Common Schema, available in Elasticsearch version 7.x and later, provides standards for integrating field names and data types, making it easier to find and visualize similar data from different data sources. This allows users to get a single unified view of the various heterogeneous systems.
Elasticsearch data is stored in one or more indices. At a larger scale, each index is split into shards, making storage easier to manage.
For very old indices that are rarely accessed, it is advisable to completely free the memory they use. Elasticsearch version 6.6 and later provide a Freeze API for this. When the index is “frozen”, it becomes read-only, and its resources are no longer active.
The downside is that frozen indices provide slower search performance, because memory resources must be allocated on demand and then revoked. To reduce the performance impact, let Elasticsearch know you are searching on frozen indices, using the query parameter ignore_throttled=false.
Backing up Elasticsearch Data
Scheduling regular backups of your Elasticsearch data should be part of your disaster recovery strategy. Ensure that the backup reflects the latest state of the cluster and is not corrupted—otherwise, the backup is useless.
Elasticsearch includes a Snapshot and Restore module that allows you to create and restore snapshots of your data for specific indexes and data streams, and save them to local or remote storage. The module supports storage systems including Amazon S3, HDFS, Microsoft Azure, and Google Cloud Storage.
Elasticsearch snapshots provide the following capabilities:
- Application consistency—snapshots reflect the latest state of the database, by writing all pending in-memory operations and data to disk.
- Resource efficiency—each new snapshot is created incrementally, only storing unsaved data from the previous snapshot. This also means snapshots can be created quickly with minimal overhead.
- Lifecycle management—allows you to automatically capture and manage snapshots.
Optimize Thread Pools
Elasticsearch nodes use multiple thread pools to process data within the node. However, be aware that the amount of data each thread can process is limited. When there are more incoming requests to a node than it can handle using available thread pools, those requests enter a queue.
You can track request queues using the property threadpool.bulk.queue_size. This tells Elasticsearch how many shard requests can be queued to run on the node if there are no threads available to process the request. When the number of tasks exceeds this value, a RemoteTransportException is thrown. Make sure to handle this exception in your code.
The higher the value you set, the larger the heap space required by the node, and the more resources are required by the JVM heap.
Enabling TLS Encryption
SSL/TLS encryption helps prevent threats such as man in the middle (MitM) attacks, and other attempts to compromise Elasticsearch modes and gain unauthorized access to data.
If encryption is disabled, Elasticsearch sends data from the nodes and clients in plain text. This includes data that may contain sensitive information and credentials like passwords. This gives rise to an attack in which attackers may create malicious nodes, attempt to join your clusters, and replicate data to gain access to it.
To prevent these issues, it is crucial to enable TLS for all production environments. This ensures Elasticsearch nodes must use a certificate from a specified certificate authority (CA) when communicating with each other. Each node must identify itself and cannot access the cluster without a valid certificate.
Learn more lessons about running Elasticsearch in our guide to Elasticsearch in production
Elasticsearch Storage with Cloud Volumes ONTAP
NetApp Cloud Volumes ONTAP, the leading enterprise-grade storage management solution, delivers secure, proven storage management services on AWS, Azure and Google Cloud. Cloud Volumes ONTAP supports up to a capacity of 368TB, and supports various use cases such as file services, databases, DevOps or any other enterprise workload, with a strong set of features including high availability, data protection, storage efficiencies, Kubernetes integration, and more.
Cloud Volumes ONTAP supports advanced features for managing SAN storage in the cloud, catering for NoSQL database systems, as well as NFS shares that can be accessed directly from cloud big data analytics clusters.
In addition, Cloud Volumes ONTAP provides storage efficiency features, including thin provisioning, data compression, and deduplication, reducing the storage footprint and costs by up to 70%.
For more on optimizing Elasticsearch deployment with NetApp, download our free eBook Optimize Elasticsearch Performance and Costs with Cloud Volumes ONTAP today.
Learn More About Elasticsearch
Elasticsearch Architecture: 7 Key Components
Elasticsearch is distributed and supports all data types, including numerical, textual, structured, unstructured, and geospatial data. Learn how Elasticsearch works and discover the 7 key components of the Elasticsearch architecture.
Read more: Elasticsearch Architecture: 7 Key Components
Elasticsearch on AWS: Deploying Your First Managed Cluster
Amazon’s managed Elasticsearch Service (Amazon ES) simplifies its deployment, scaling, and operation of clusters in the Amazon Web Service (AWS) cloud environment. The service runs the same code and APIs as on-premises Elasticsearch, ensuring that existing applications are fully compatible. Understand Elasticsearch on AWS and learn how to set up a new Elasticsearch cluster, feed data into it, and perform a basic search.
Elasticsearch on Azure: A Quick Start Guide
Elastic Stack on Azure is a deployment template that lets you deploy Elasticsearch clusters, including the full ELK Stack (Elasticsearch, Logstash and Kibana) in a fully automated manner. Learn about the Elasticsearch on Azure solution provided via the Azure marketplace and learn to set up your first ES cluster on Azure.
Read more: Elasticsearch on Azure: A Quick Start Guide
Elasticsearch on Kubernetes: DIY vs. Elasticsearch Operator
Elasticsearch is a distributed database using a clustered architecture. Kubernetes, the world’s most popular container orchestrator, makes it easier to deploy, scale, and manage Elasticsearch clusters at large scale. Learn two ways to deploy Elasticsearch on Kubernetes: directly using StatefulSets and Deployments, or automatically using the Elasticsearch Kubernetes Operator.
Elasticsearch vs MongoDB: 6 Key Differences
Discover how Elasticsearch and MongoDB compare on six dimensions including architecture, paid features, backups, and language support.
Read more: Elasticsearch vs MongoDB: 6 Key Differences
Elasticsearch on Google Cloud: Your First Managed Cluster
Elasticsearch is a popular NoSQL database based on the open source Lucene search engine, which facilitates fast search across large datasets. Elasticsearch provides the Elastic on Google Cloud solution, which lets you deploy Elasticsearch clusters on the Google Cloud Platform. Learn how to deploy Elasticsearch on Google Cloud, step by step: deployment options, customizing deployment, and analyzing data with Kibana.
Elasticsearch Optimization with Cloud Volumes ONTAP
With the growth in the volume and increasing complexity of production deployments and operational pipelines, managing logs and metrics is a crucial component of successful IT deployment. DevOps teams and administrators need logs and metrics to gain operational insights, meet their SLA obligations, prevent unauthorized access, and to identify errors, anomalies, or suspicious activity. For that, many companies turn to Elasticsearch. Elasticsearch optimization challenges include designing for high availability, lowering cloud costs, and protecting data. NetApp Cloud Volumes ONTAP can help.
How to Deploy Elasticsearch with Cloud Volumes ONTAP
As Elasticsearch grows in popularity, finding a way to make it work with your existing NetApp Cloud Volumes ONTAP deployment may become a critical goal for your IT operations. This post will give you a step-by-step walkthrough on setting up an Elasticsearch deployment using Cloud Volumes ONTAP, using AWS as an example. We’ll cover everything you need to get started, including all the prerequisites that you need, how to prepare your EC2 instances, provisioning your Cloud Volumes ONTAP LUNs, and tying both together.
Read more in How to Deploy Elasticsearch with Cloud Volumes ONTAP
Managed Service or Self-Managed?: Comparing the Two AWS Deployment Options for Elasticsearch
With two deployment options for AWS-based Elasticsearch—the AWS managed service, Amazon Elasticsearch, or self-managed Elasticsearch built on native AWS EC2 compute instances—enterprises looking to leverage the free-form search engine’s capabilities should be aware of the pros and cons of both approaches.
This post looks at the managed service and self-managed options, breaks them down, and shows you how Cloud Volumes ONTAP can help.
Self-Managed Elasticsearch vs. Elastic Cloud Managed Service
Elasticsearch is becoming more widely used by organizations thanks to its simple and effective search and catalog features which can enhance efforts to analyze data. But with two available deployment options—running native Elasticsearch vs Elastic Cloud managed service—potential users have an important choice to make.
In this blog we look at the differences between these two deployment models. Read on to compare the efforts involved, costs, and customizability of the Elasticsearch vs. Elastic Cloud managed service deployment models to help you decide.