Deprecated: See https://untergeek.com/2014/06/13/curator-1-1-0-released/
In my last post I mentioned curator, an update to the logstash_index_cleaner.py
script I’d written so very long ago (okay, is 2.5 years a long time? It sure seems like it in sysops time…). I even linked to my blog post at elasticsearch.org about it. It hasn’t been quite a month, yet, but there have been some changes since then so I thought I’d write another blog post about it.
Installation
Curator is now in PyPI! Yay! This makes it so much easier to install:
pip install elasticsearch-curator
However, if you are using a version of Elasticsearch less than 1.0.0, you will need to use a different version of curator:
pip install elasticsearch-curator==0.6.2
Why specify a specific version? We’re branching curator to be able to accommodate changes in the Elasticsearch API for version 1.0 (which have correlating changes in the python elasticsearch module).
Upgrading
Already using a version of curator? Upgrading is easy!
pip install -U elasticsearch-curator
The same pattern applies if you need to upgrade to a specific version (==X.Y.Z
).
Usage
If you’ve installed via pip
then you’re all ready to go. You don’t even need to specify .py
afterwards, as before, and it installs to /usr/local/bin
so if that’s in your path, you don’t have to change a thing to use it:
curator -h
This will show you the help output, which is rather long. I will touch on a few of the features and configuration options.
Delete
This is by far the most common use for curator. But did you know you can delete by space or by date?
Date
Deleting by date is simple! To delete indices older than 30 days,
curator --host my-host -d 30
You can even delete by date + hour! If you have indices defined like logstash-%{+YYYY.MM.dd.HH}
you can delete indices older than 48 hours like this:
curator --host my-host --time-unit hours -d 48
Space
You can delete by space if you need to, but with some provisos and warnings.
- If you close indices you will not get an accurate count of bytes. Elasticsearch cannot report on the space consumed by closed indices.
- If you choose this method (and keep a large number of daily indices as a result) you may eventually exhaust the portion of your Elasticsearch heap space reserved for indexing. I’ll revisit this later (in another blog post), but the short answer is you could wind up with too many simultaneously open indices. One way to fix that is to close indices you’re not actively using, but then you get looped back to #1.
- Deleting by space across a cluster is more complicated because the index size reported will be divided among all of your nodes. Since Elasticsearch tries to balance storage equally across all nodes, you’ll need to calculate accordingly.
To delete indices beyond 1TB (1024GB) of space:
curator --host my-host --curation-style space -g 1024
Optimize (or rather forceMerge)
The term “optimize” will not die, unfortunately. Optimizing a hard drive is something you used to have to do every so often to defragment and re-order things for the best performance. Businesses optimize constantly to improve efficiency and save money. But in the Elasticsearch world, optimizing is something you never have to do. It is completely optional. Truthfully it does have a measurable but nearly negligible performance impact on searches (1% – 2%). So why do it? I’m glad you asked!
In technical terms, when you perform an optimize API call in Elasticsearch you’re asking it to do what’s known as a forceMerge
. This takes a bit of background, so google that if you want to know the deep down details. The short version is that an index in Elasticsearch is a Lucene index. A Lucene index is made up of shards, each of which is also a Lucene index. Each shard is composed of segments. While indexing, Elasticsearch will merge segments to keep from having too many open simultaneously (which can have an impact on availability of file handles, etc.). Once you are no longer indexing to a given index, it won’t need all of those segments any more. One of the best reasons to optimize is that recovery time during rolling restarts and outages can be dramatically reduced as far fewer segments have to be verified. One of the worst reasons to optimize is that you do get a slight performance boost to searches — as stated, a mere 1% – 2% increase. The cost in terms of disk I/O is tremendous. It is ill advised to optimize indices on busy clusters as both search and indexing can take a performance hit. It is absolutely unnecessary to optimize an index that is currently indexing. Don’t do it.
With these errata out of the way, you can optimize indices older than 1 day down to 1 segment per shard (fully optimized) like this:
curator --host my-host --timeout 7200 --max_num_segments 1 -o 1
If unspecified, --max_num_segments
defaults to 2 segments per shard. Notice that the –timeout directive is specified with 7200 seconds (2 hours) defined. Even small indices can take a long time to optimize. My personal, 2 node cluster on spinning disks with around 2.5G of data per index takes 45 minutes to optimize a single index. I run it in the middle of the night at 2am via cron.
Disable bloom filter cache
This is one of the best new features in curator! It only works, however, if your Elasticsearch version is 0.90.9 or higher. After you learn what it does you’ll hopefully find this incentive to upgrade if you haven’t already.
The bloom filter cache speeds the indexing process. With it disabled, indexing can continue, but at a roughly 40% – 50% speed penalty. But what about time-series indices, like those from Logstash? Today is 2014.02.18 and I’m currently writing to an index called logstash-2014.02.18
. But I am not writing to logstash-2014.02.16
any more, so why should I have the bloom filter cache open there? By disabling the bloom filter cache on “cold” indices I can reclaim valuable resources to benefit the whole cluster.
You can disable the bloom filter cache for indices older than 1 day like this:
curator --host my-host -b 1
Simple, no? The creator of Elasticsearch, Shay Banon, was very keen to get this into curator as soon as possible as it is one of the easiest ways for Logstash users to get a lot of benefit, very quickly.
Close
One of the earliest requests for curator was for staged expiration of indices. That is to say, close indices older than 15 days and delete them after 30 days. This is a big deal because an open index is consuming resources, whether you’re searching through it or not. A closed index only consumes disk space. If you typically aren’t searching past 1 week, then having the indices closed is a fantastic way to free up valuable resources for your cluster. Also, if you’re obliged to keep 30 days of data, but rarely—if ever—search past 2 weeks, you can also meet that requirement easily with this setting.
To close indices older than 15 days:
curator --host my-host -c 15
So simple!
Combining flags
Of course, you don’t need to run one command at a time. If I wanted to close indices older than 15 days, delete older than 30, and disable bloom filters on indices older than 1 day:
curator --host my-host -b 1 -c 15 -d 30
One important limit is that you can’t delete by space and combine with any other operation. That one needs to fly solo.
Order of operations
When combining flags it’s important to know that the script forces a particular order of operations to prevent unneeded API calls. Why optimize an index that’s closed? (hint: it’ll fail if you try anyway) Why close an index that’s slated for deletion?
The order is as follows:
- Delete (by space or time)
- Close
- Disable bloom filters
- Optimize
Other flags
--prefix
With the recent release of Elasticsearch Marvel the –prefix flag will get some frequent usage! Marvel stores its data in indices with a similar naming pattern to Logstash: .marvel-%{+YYYY.MM.dd}
, so if you’re using Marvel and want to prune those older indices, curator will be happy to oblige!
To perform operations on indices with a different prefix (the default is logstash-
), specify it with the --prefix
flag:
curator --host my-host -d 30 --prefix .marvel-
The prefix should be everything right up to the date, including the hyphen in the example above.
--separator
If you format your date differently for some reason, e.g. %{+YYYY-MM-dd}
(with hyphens instead of periods), then you can specify the separator like this:
curator --host my-host -d 30 --separator -
--ssl
If you are accessing Elasticsearch through a proxy which is protected by SSL, you can specify the --ssl
flag in your command-line.
--url_prefix
If you are employing a proxy to isolate your Elasticsearch and are redirecting things through a path you might need this feature.
For example, if your Elasticsearch cluster were behind host foo.bar
and had a url prefix of backend
, your API call to check settings would look like this:
http://foo.bar/backend/_settings
Your curator command-line would then include these options:
curator --host foo.bar --url_prefix "backend"
Combining the --ssl
and --url_prefix
options would allow you to access a proxied, SSL protected Elasticsearch instance like this:
https://foo.bar/backend/_settings
with these command-line options:
curator --host foo.bar --port 443 --ssl --url_prefix "backend"
--dry-run
Adding the --dry-run
flag to your curator command line will show you what actions might have been taken without actually performing them.
–debug
This should be self-explanatory: Increased log verbosity 🙂
--logfile
If you do not specify a log file with this flag, all log messages will be directed to stdout. If you put curator into a crontab without specifying this (or redirecting stdout and stderr) you run the risk of noisy emails every time curator runs.
Conclusion (and future features!)
Curator has come a long way from its humble beginnings, but the best is yet to come! Some of the feature requests we’re examining right now include:
- Shard (index) allocation: Using Elasticsearch’s shard allocation awareness, use curator to move older indices from higher performance nodes (e.g. with SSD drives) to slower nodes (e.g. with spinning disks).
- Custom actions: The ability for users to register their own actions to be run at given thresholds.
- Index pre-allocation: On busy clusters, it can be taxing at UTC rollover to create new indices, create new mappings, etc. on the fly. Use curator to create these in advance (complete with mappings, if provided).
- Snapshot/backup: Use Elasticsearch 1.0’s new snapshot and restore feature to backup indices.
Do you have an idea of something you’d like to see included in curator? Please submit a feature request, or better yet, fork the repository and add it yourself! We accept pull requests!
Hi Aaron,
I just upgraded my test Elasticsearch cluster to 1.0 and wanted to give Curator a try, but after running the “pip install elasticsearch-curator” as per this blog post it doesn’t seem to have given me a compatible version for Elasticsearch 1.0.
When I try running a simple command like “curator –host localhost -d 20” I get the following message back:
Expected Elasticsearch version range > 0.19.4 =0.4.4,<1.0.0 (from elasticsearch-curator)
Downloading elasticsearch-0.4.5.tar.gz (45Kb): 45Kb downloaded
Running setup.py egg_info for package elasticsearch
Thanks,
Will.
Okay ignore me, I removed the package and installed again and this time I’ve got the 1.00 stuff – very strange, but at least now it works!
Cheers,
Will
I know what happened: I downloaded the master branch of your git repository before I re-installed, and it’s taken the version 1.00 from there, so the PyPI version is still 0.6.2 and if you want version 1.0 you have to use the git repository still.
That’s correct. We haven’t released 1.0 to PyPI officially, yet. This will happen soon, perhaps in conjunction with the Logstash 1.4 release.
Hi,
I struggle with the installation of curator 1.0. If I try to start the script, it fails with the exception:
pkg_resources.DistributionNotFound: elasticsearch>=1.0.0,=1.0.0,=1.5,=1.0.0,elasticsearch-curator==1.0.0-dev)
Downloading urllib3-1.7.1.tar.gz (67kB): 67kB downloaded
Running setup.py (path:/tmp/pip_build_root/urllib3/setup.py) egg_info for package urllib3
Installing collected packages: elasticsearch, elasticsearch-curator, urllib3
Running setup.py install for elasticsearch-curator
warning: no previously-included files matching ‘__pycache__’ found under directory ‘*’
warning: no previously-included files matching ‘*.py[co]’ found under directory ‘*’
Installing curator script to /usr/bin
Running setup.py install for urllib3
Successfully installed elasticsearch elasticsearch-curator urllib3
Cleaning up…
Any help would be appreciated.
Sorry to hear you’re having a problem. 1.0 is not yet officially released. How did you install this? The way to install the dev version would be to clone the github repository and do
python setup.py install
After installation, curator should be installed to /usr/bin (or /usr/local/bin) and you should be able to call
curator -h
from the command-line.Thanks four your response. I’ve installed it with “pip install .”.
I’ve tried also your suggested installation method. But this does also not work. Here is the output:
running install
running bdist_egg
running egg_info
writing requirements to elasticsearch_curator.egg-info/requires.txt
writing elasticsearch_curator.egg-info/PKG-INFO
writing top-level names to elasticsearch_curator.egg-info/top_level.txt
writing dependency_links to elasticsearch_curator.egg-info/dependency_links.txt
writing entry points to elasticsearch_curator.egg-info/entry_points.txt
reading manifest file ‘elasticsearch_curator.egg-info/SOURCES.txt’
reading manifest template ‘MANIFEST.in’
warning: no previously-included files matching ‘__pycache__’ found under directory ‘*’
warning: no previously-included files matching ‘*.py[co]’ found under directory ‘*’
writing manifest file ‘elasticsearch_curator.egg-info/SOURCES.txt’
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/curator
copying build/lib/curator/curator.py -> build/bdist.linux-x86_64/egg/curator
copying build/lib/curator/__init__.py -> build/bdist.linux-x86_64/egg/curator
byte-compiling build/bdist.linux-x86_64/egg/curator/curator.py to curator.pyc
byte-compiling build/bdist.linux-x86_64/egg/curator/__init__.py to __init__.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying elasticsearch_curator.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying elasticsearch_curator.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying elasticsearch_curator.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying elasticsearch_curator.egg-info/entry_points.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying elasticsearch_curator.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying elasticsearch_curator.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents…
creating ‘dist/elasticsearch_curator-1.0.0_dev-py2.6.egg’ and adding ‘build/bdist.linux-x86_64/egg’ to it
removing ‘build/bdist.linux-x86_64/egg’ (and everything under it)
Processing elasticsearch_curator-1.0.0_dev-py2.6.egg
creating /usr/lib/python2.6/site-packages/elasticsearch_curator-1.0.0_dev-py2.6.egg
Extracting elasticsearch_curator-1.0.0_dev-py2.6.egg to /usr/lib/python2.6/site-packages
Adding elasticsearch-curator 1.0.0-dev to easy-install.pth file
Installing curator script to /usr/bin
Installed /usr/lib/python2.6/site-packages/elasticsearch_curator-1.0.0_dev-py2.6.egg
Processing dependencies for elasticsearch-curator==1.0.0-dev
Searching for elasticsearch>=1.0.0,=1.0.0,<2.0.0"
I'll create an issue on GitHub for this.
thanks for the article! One question about running curator though: Should it be constantly running? or run from time-to-time? say daily?
Rick,
Sorry for the delay in responding. I’ve been at DevOpsDays, Austin, TX.
Typically I run once a day in the most off-peak hours (to reduce disk I/O impact of an optimize). I use cron to accomplish this.