My current Logstash template — 2015-08-31

I figured it was time to share my current template again, as much has changed since Logstash 1.2. Among the changes include:

doc_values everywhere applicable
Defaults for all numeric types, using doc_values
Proper mapping for the raw sub-field
Leaving the message field analyzed, and with no raw sub-field
Added ip, latitude, and longitude fields to the geoip mapping, using doc_values

If you couldn’t tell, I’m crazy about doc_values. Using doc_values (where permitted) prevents your elasticsearch java heap size from growing out of control when performing large aggregations—for example, a months worth of data with Kibana—with very little upfront cost in additional storage.

This is mostly generic, but it does have a few things which are specific to my use case (like the Nginx entry). Feel free to adapt to your needs.

{
  "template" : "logstash-*",
  "settings" : {
    "index.refresh_interval" : "5s"
  },
  "mappings" : {
    "_default_" : {
       "_all" : {"enabled" : true, "omit_norms" : true},
       "dynamic_templates" : [ {
         "message_field" : {
           "match" : "message",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "string", "index" : "analyzed", "omit_norms" : true
           }
         }
       }, {
         "string_fields" : {
           "match" : "*",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "string", "index" : "analyzed", "omit_norms" : true,
               "fields" : {
                 "raw" : {"type": "string", "index" : "not_analyzed", "doc_values" : true, "ignore_above" : 256}
               }
           }
         }
       }, {
         "float_fields" : {
           "match" : "*",
           "match_mapping_type" : "float",
           "mapping" : { "type" : "float", "doc_values" : true }
         }
       }, {
         "double_fields" : {
           "match" : "*",
           "match_mapping_type" : "double",
           "mapping" : { "type" : "double", "doc_values" : true }
         }
       }, {
         "byte_fields" : {
           "match" : "*",
           "match_mapping_type" : "byte",
           "mapping" : { "type" : "byte", "doc_values" : true }
         }
       }, {
         "short_fields" : {
           "match" : "*",
           "match_mapping_type" : "short",
           "mapping" : { "type" : "short", "doc_values" : true }
         }
       }, {
         "integer_fields" : {
           "match" : "*",
           "match_mapping_type" : "integer",
           "mapping" : { "type" : "integer", "doc_values" : true }
         }
       }, {
         "long_fields" : {
           "match" : "*",
           "match_mapping_type" : "long",
           "mapping" : { "type" : "long", "doc_values" : true }
         }
       }, {
         "date_fields" : {
           "match" : "*",
           "match_mapping_type" : "date",
           "mapping" : { "type" : "date", "doc_values" : true }
         }
       } ],
       "properties" : {
         "@timestamp": { "type": "date", "doc_values" : true },
         "@version": { "type": "string", "index": "not_analyzed", "doc_values" : true },
         "clientip": { "type": "ip", "doc_values" : true },
         "geoip"  : {
           "type" : "object",
           "dynamic": true,
           "properties" : {
             "ip": { "type": "ip", "doc_values" : true },
             "location" : { "type" : "geo_point", "doc_values" : true },
             "latitude" : { "type" : "float", "doc_values" : true },
             "longitude" : { "type" : "float", "doc_values" : true }
           }
         }
       }
    },
    "nginx_json" : {
      "properties" : {
        "duration" : { "type" : "float", "doc_values" : true },
        "status" : { "type" : "short", "doc_values" : true }
      }
    }
  }
}

You can also find this in a GitHub gist.

Feel free to add any suggestions, or adaptations you may have used in the comments below!

Curator 1.1.0 Released

Hi all!

I have been busy working on Curator 1.1.0 since Elasticsearch released version 1.0, with Snapshot/Restore capability. It’s taken some time to get things to work the way I wanted them, but the results are good!

I wrote a full blog post about it on elasticsearch.com.

I did a huge workup of Curator version 1.0.0 in a previous blog post, but the commands are different now, so I went to the trouble of creating a documentation wiki to make things easier.

No, really:

READ THE DOCUMENTATION WIKI

Important: A Brand New Command-Line Structure

Changes to the command-line structure means that your older cron entries will not work with Curator 1.1.0. Please remember to update your commands when upgrading to Curator 1.1.0.

So if you missed my cue to read the new documentation wiki, here are some of the highlights.

Add/Remove indices from an Alias

Add indices older than 30 days to alias last_month:

 curator alias --alias-older-than 30 --alias last_month

Remove indices older than 60 days from alias last_month:

 curator alias --unalias-older-than 60 --alias last_month

Delete indices

Delete indices older than 30 days:

curator delete --older-than 30

Delete by space. Keep 1024GB (1TB) of data in elasticsearch:

curator delete --disk-space 1024

Note that when using size to determine which indices to keep, having closed indices will cause inaccuracies since they cannot be added to the overall size. This is only an issue if you have closed some indices that are not your oldest ones.

Close indices

Close indices older than 14 days:

curator close --older-than 14

Disable bloom filter for indices

Disable bloom filter for indices older than 1:

curator bloom --older-than 1

Optimize (Lucene forceMerge) indices

Optimize is a bit of a misnomer. It is in actuality a Lucene forceMerge operation. With time-series data in a per-day index, Lucene does a good job of keeping the number of segments low. However, if no new data is being ingested, no further segment merging will happen. There are some minor performance benefits from merging segments down to a smaller count, but a greater benefit when it comes to restarts [e.g. version upgrades, etc.] after a shutdown: with fewer segments to have to validate, the cluster comes back up sooner.

Optimize (Lucene forceMerge) indices older than 2 days to 2 segments per shard (the default is 2):

curator optimize --older-than 2

Optimize (Lucene forceMerge) indices older than 2 days to 1 segment per shard:

curator optimize --older-than 2 --max_num_segments 1

Please note that --timeout is no longer required, as in versions older than 1.1.0. A default of 6 hours (21600 seconds) will be applied for optimize and snapshot. Since the optimize operation can take a long time, curator may disconnect and fail to continue with further operations if the timeout is not set high enough. This number may need to be higher, or could be reduced depending on your scenario. The log file will tell you how long it took to perform previous operations, which you could use as a guideline.

Shard/index allocation

You can use curator to apply routing tags to your indices. This is useful for migrating stale indices from your heavy-duty indexing boxes to slower-hardware search boxes. Read more hereabout the index.routing.allocation.require.* settings. In order for the index-level settings to work, you must also have corresponding node-level settings.

Apply setting index.routing.allocation.require.tag=done_indexing to indices older than 2 days:

curator allocation --older-than 2 --rule tag=done_indexing

Snapshots

You can use curator to capture snapshots to a pre-defined repository. To create a repository you can use the API, or the es_repo_mgr script (included with curator 1.1.0). There are other tools available.

One snapshot will be created per index, and it will take its name from the index, e.g. an index named logstash-2014.06.10 will yield a snapshot named logstash-2014.06.10. The only index in each snapshot will be that matching index. This means if you are trying to snapshot multiple indices, it will loop through them one at a time, and it could take a while. You may need to set the initial timeout to something ridiculously large if you’re just barely capturing snapshots. Another potential solution would be to snap them incrementally by changing the --older-than setting. Snapshots can also be capture by --most-recent count, and can be deleted with --delete-older-than:

Snapshot indices older than 20 days to REPOSITORY_NAME:

curator snapshot --older-than 20 --repository REPOSITORY_NAME

Snapshot most recent 3 indices matching prefix .marvel- to REPOSITORY_NAME:

 curator snapshot --most-recent 3 --prefix .marvel- --repository REPOSITORY_NAME

Delete snapshots older than 365 days from REPOSITORY_NAME:

 curator snapshot --delete-older-than 365 --repository REPOSITORY_NAME

A default of 6 hours (21600 seconds) will be applied for optimize and snapshot. Since a snapshot can take a long time, curator may disconnect and fail to continue with further operations if the timeout is not set high enough. This number may need to be higher, or could be reduced depending on your scenario. The log file will tell you how long it took to perform previous operations, which you could use as a guideline.

Display indices and snapshots matching prefix

Display a list of all indices matching prefix (logstash- by default):

curator show --show-indices

Display a list of all snapshots in REPOSITORY_NAME matching prefix (logstash- by default):

curator show --show-snapshots --repository REPOSITORY_NAME

Conclusion

Curator 1.1.0 has awesome new features! Please, go forth and make your time-series index management more awesome! Happy Curating!

New collectd codec (Logstash 1.4.1+) configuration

With the advent of Logstash 1.4.1, I wanted to make sure everyone knows about the new collectd codec.

In Logstash 1.3.x, we introduced the collectd input plugin. It was awesome! We could process metrics in Logstash, store them in Elasticsearch and view them with Kibana. The only downside was that you could only get around 3100 events per second through the plugin. With Logstash 1.4.0 we introduced a newly revamped UDP input plugin which was multi-threaded and had a queue. I refactored the collectd input plugin to be a codec (with some help from my co-workers and the community) to take advantage of this huge performance increase. Now with only 3 threads on my dual-core Macbook Air I can get over 45,000 events per second through the collectd codec!

So, I wanted to provide some quick examples you could use to change your plugin configuration to use the codec instead.

The old way:

input {
  collectd {}
}

The new way:

input {
  udp {
    port => 25826         # Must be specified. 25826 is the default for collectd
    buffer_size => 1452   # Should be specified. 1452 is the default for recent versions of collectd
    codec => collectd { } # This will invoke the default options for the codec
    type => "collectd"
  }
}

This new configuration will use 2 threads and a queue size of 2000 by default for the UDP input plugin. With this you should easily be able to break 30,000 events per second!

I have provided a gist with some other configuration examples. For more information, please check out the Logstash documentation for the collectd codec.

Happy Logstashing!

Introduction to Logstash: Webinar

I realized that I never shared the link to the webinar I did back in December. If you’re interested in learning more about Logstash, check out my webinar here: Introduction to Logstash

Curator: Managing your Logstash and other time-series indices in Elasticsearch — beyond delete and optimize

Deprecated: See https://untergeek.com/2014/06/13/curator-1-1-0-released/

In my last post I mentioned curator, an update to the logstash_index_cleaner.py script I’d written so very long ago (okay, is 2.5 years a long time? It sure seems like it in sysops time…). I even linked to my blog post at elasticsearch.org about it. It hasn’t been quite a month, yet, but there have been some changes since then so I thought I’d write another blog post about it.

Installation

Curator is now in PyPI! Yay! This makes it so much easier to install:

pip install elasticsearch-curator

However, if you are using a version of Elasticsearch less than 1.0.0, you will need to use a different version of curator:

pip install elasticsearch-curator==0.6.2

Why specify a specific version? We’re branching curator to be able to accommodate changes in the Elasticsearch API for version 1.0 (which have correlating changes in the python elasticsearch module).

Upgrading

Already using a version of curator? Upgrading is easy!

pip install -U elasticsearch-curator

The same pattern applies if you need to upgrade to a specific version (==X.Y.Z).

Usage

If you’ve installed via pip then you’re all ready to go. You don’t even need to specify .py afterwards, as before, and it installs to /usr/local/bin so if that’s in your path, you don’t have to change a thing to use it:

curator -h

This will show you the help output, which is rather long. I will touch on a few of the features and configuration options.

Delete

This is by far the most common use for curator. But did you know you can delete by space or by date?

Date

Deleting by date is simple! To delete indices older than 30 days,

curator --host my-host -d 30

You can even delete by date + hour! If you have indices defined like logstash-%{+YYYY.MM.dd.HH} you can delete indices older than 48 hours like this:

curator --host my-host --time-unit hours -d 48

Space

You can delete by space if you need to, but with some provisos and warnings.

If you close indices you will not get an accurate count of bytes. Elasticsearch cannot report on the space consumed by closed indices.
If you choose this method (and keep a large number of daily indices as a result) you may eventually exhaust the portion of your Elasticsearch heap space reserved for indexing. I’ll revisit this later (in another blog post), but the short answer is you could wind up with too many simultaneously open indices. One way to fix that is to close indices you’re not actively using, but then you get looped back to #1.
Deleting by space across a cluster is more complicated because the index size reported will be divided among all of your nodes. Since Elasticsearch tries to balance storage equally across all nodes, you’ll need to calculate accordingly.

To delete indices beyond 1TB (1024GB) of space:

curator --host my-host --curation-style space -g 1024

Optimize (or rather forceMerge)

The term “optimize” will not die, unfortunately. Optimizing a hard drive is something you used to have to do every so often to defragment and re-order things for the best performance. Businesses optimize constantly to improve efficiency and save money. But in the Elasticsearch world, optimizing is something you never have to do. It is completely optional. Truthfully it does have a measurable but nearly negligible performance impact on searches (1% – 2%). So why do it? I’m glad you asked!

In technical terms, when you perform an optimize API call in Elasticsearch you’re asking it to do what’s known as a forceMerge. This takes a bit of background, so google that if you want to know the deep down details. The short version is that an index in Elasticsearch is a Lucene index. A Lucene index is made up of shards, each of which is also a Lucene index. Each shard is composed of segments. While indexing, Elasticsearch will merge segments to keep from having too many open simultaneously (which can have an impact on availability of file handles, etc.). Once you are no longer indexing to a given index, it won’t need all of those segments any more. One of the best reasons to optimize is that recovery time during rolling restarts and outages can be dramatically reduced as far fewer segments have to be verified. One of the worst reasons to optimize is that you do get a slight performance boost to searches — as stated, a mere 1% – 2% increase. The cost in terms of disk I/O is tremendous. It is ill advised to optimize indices on busy clusters as both search and indexing can take a performance hit. It is absolutely unnecessary to optimize an index that is currently indexing. Don’t do it.

With these errata out of the way, you can optimize indices older than 1 day down to 1 segment per shard (fully optimized) like this:

curator --host my-host --timeout 7200 --max_num_segments 1 -o 1

If unspecified, --max_num_segments defaults to 2 segments per shard. Notice that the –timeout directive is specified with 7200 seconds (2 hours) defined. Even small indices can take a long time to optimize. My personal, 2 node cluster on spinning disks with around 2.5G of data per index takes 45 minutes to optimize a single index. I run it in the middle of the night at 2am via cron.

Disable bloom filter cache

This is one of the best new features in curator! It only works, however, if your Elasticsearch version is 0.90.9 or higher. After you learn what it does you’ll hopefully find this incentive to upgrade if you haven’t already.

The bloom filter cache speeds the indexing process. With it disabled, indexing can continue, but at a roughly 40% – 50% speed penalty. But what about time-series indices, like those from Logstash? Today is 2014.02.18 and I’m currently writing to an index called logstash-2014.02.18. But I am not writing to logstash-2014.02.16 any more, so why should I have the bloom filter cache open there? By disabling the bloom filter cache on “cold” indices I can reclaim valuable resources to benefit the whole cluster.

You can disable the bloom filter cache for indices older than 1 day like this:

curator --host my-host -b 1

Simple, no? The creator of Elasticsearch, Shay Banon, was very keen to get this into curator as soon as possible as it is one of the easiest ways for Logstash users to get a lot of benefit, very quickly.

Close

One of the earliest requests for curator was for staged expiration of indices. That is to say, close indices older than 15 days and delete them after 30 days. This is a big deal because an open index is consuming resources, whether you’re searching through it or not. A closed index only consumes disk space. If you typically aren’t searching past 1 week, then having the indices closed is a fantastic way to free up valuable resources for your cluster. Also, if you’re obliged to keep 30 days of data, but rarely—if ever—search past 2 weeks, you can also meet that requirement easily with this setting.

To close indices older than 15 days:

curator --host my-host -c 15

So simple!

Combining flags

Of course, you don’t need to run one command at a time. If I wanted to close indices older than 15 days, delete older than 30, and disable bloom filters on indices older than 1 day:

curator --host my-host -b 1 -c 15 -d 30

One important limit is that you can’t delete by space and combine with any other operation. That one needs to fly solo.

Order of operations

When combining flags it’s important to know that the script forces a particular order of operations to prevent unneeded API calls. Why optimize an index that’s closed? (hint: it’ll fail if you try anyway) Why close an index that’s slated for deletion?

The order is as follows:

Delete (by space or time)
Close
Disable bloom filters
Optimize

Other flags

`--prefix`

With the recent release of Elasticsearch Marvel the –prefix flag will get some frequent usage! Marvel stores its data in indices with a similar naming pattern to Logstash: .marvel-%{+YYYY.MM.dd}, so if you’re using Marvel and want to prune those older indices, curator will be happy to oblige!

To perform operations on indices with a different prefix (the default is logstash-), specify it with the --prefix flag:

curator --host my-host -d 30 --prefix .marvel-

The prefix should be everything right up to the date, including the hyphen in the example above.

`--separator`

If you format your date differently for some reason, e.g. %{+YYYY-MM-dd} (with hyphens instead of periods), then you can specify the separator like this:

curator --host my-host -d 30 --separator -

`--ssl`

If you are accessing Elasticsearch through a proxy which is protected by SSL, you can specify the --ssl flag in your command-line.

`--url_prefix`

If you are employing a proxy to isolate your Elasticsearch and are redirecting things through a path you might need this feature.

For example, if your Elasticsearch cluster were behind host foo.bar and had a url prefix of backend, your API call to check settings would look like this:

http://foo.bar/backend/_settings

Your curator command-line would then include these options:

curator --host foo.bar --url_prefix "backend"

Combining the --ssl and --url_prefix options would allow you to access a proxied, SSL protected Elasticsearch instance like this:

https://foo.bar/backend/_settings

with these command-line options:

curator --host foo.bar --port 443 --ssl --url_prefix "backend"

`--dry-run`

Adding the --dry-run flag to your curator command line will show you what actions might have been taken without actually performing them.

–debug

This should be self-explanatory: Increased log verbosity 🙂

`--logfile`

If you do not specify a log file with this flag, all log messages will be directed to stdout. If you put curator into a crontab without specifying this (or redirecting stdout and stderr) you run the risk of noisy emails every time curator runs.

Conclusion (and future features!)

Curator has come a long way from its humble beginnings, but the best is yet to come! Some of the feature requests we’re examining right now include:

Shard (index) allocation: Using Elasticsearch’s shard allocation awareness, use curator to move older indices from higher performance nodes (e.g. with SSD drives) to slower nodes (e.g. with spinning disks).
Custom actions: The ability for users to register their own actions to be run at given thresholds.
Index pre-allocation: On busy clusters, it can be taxing at UTC rollover to create new indices, create new mappings, etc. on the fly. Use curator to create these in advance (complete with mappings, if provided).
Snapshot/backup: Use Elasticsearch 1.0’s new snapshot and restore feature to backup indices.

Do you have an idea of something you’d like to see included in curator? Please submit a feature request, or better yet, fork the repository and add it yourself! We accept pull requests!

Getting Apache to output JSON (for logstash 1.2.x)

Greetings, travelers, who may have come to this page by way of my other page on this subject, dealing with the same subject matter, but with logstash version 1.1.x.

Logstash 1.2.1 is brand new as of this edition. The changes to my Apache CustomLog JSON recipe are in! I’ve even since updated this page to not use the prune filter but exclusively use the new logstash conditionals.

Apache configuration:

LogFormat "{ 
            "@timestamp": "%{%Y-%m-%dT%H:%M:%S%z}t", 
            "@version": "1", 
            "vips":["vip.example.com"], 
            "tags":["apache"], 
            "message": "%h %l %u %t \"%r\" %>s %b", 
            "clientip": "%a", 
            "duration": %D, 
            "status": %>s, 
            "request": "%U%q", 
            "urlpath": "%U", 
            "urlquery": "%q", 
            "bytes": %B, 
            "method": "%m", 
            "referer": "%{Referer}i", 
            "useragent": "%{User-agent}i" 
           }" ls_apache_json
CustomLog /var/log/apache2/logstash_test.ls_json ls_apache_json

Logstash configuration:

input {
   file {
      path => "/var/log/apache2/*.ls_json"
      tags => "apache_json"
      codec => "json"
   }
}

filter {
    geoip {
	add_tag => [ "GeoIP" ]
	database => "/opt/logstash/GeoLiteCity.dat"
	source => "clientip"
    }
    if [useragent] != "-" and [useragent] != "" {
      useragent {
        add_tag => [ "UA" ]
        source => "useragent"
      }
    }
    if [bytes] == 0 { mutate { remove => "[bytes]" } }
    if [geoip][city_name]      == "" { mutate { remove => "[geoip][city_name]" } }
    if [geoip][continent_code] == "" { mutate { remove => "[geoip][continent_code]" } }
    if [geoip][country_code2]  == "" { mutate { remove => "[geoip][country_code2]" } }
    if [geoip][country_code3]  == "" { mutate { remove => "[geoip][country_code3]" } }
    if [geoip][country_name]   == "" { mutate { remove => "[geoip][country_name]" } }
    if [geoip][latitude]       == "" { mutate { remove => "[geoip][latitude]" } }
    if [geoip][longitude]      == "" { mutate { remove => "[geoip][longitude]" } }
    if [geoip][postal_code]    == "" { mutate { remove => "[geoip][postal_code]" } }
    if [geoip][region_name]    == "" { mutate { remove => "[geoip][region_name]" } }
    if [geoip][time_zone]      == "" { mutate { remove => "[geoip][time_zone]" } }
    if [urlquery]              == "" { mutate { remove => "urlquery" } }

    if "apache_json" in [tags] {
        if [method]    =~ "(HEAD|OPTIONS)" { mutate { remove => "method" } }
        if [useragent] == "-"              { mutate { remove => "useragent" } }
        if [referer]   == "-"              { mutate { remove => "referer" } }
    }
    if "UA" in [tags] {
        if [device] == "Other" { mutate { remove => "device" } }
        if [name]   == "Other" { mutate { remove => "name" } }
        if [os]     == "Other" { mutate { remove => "os" } }
    }
}

output {
     elasticsearch {
       host => "elasticsearch.example.com"
       cluster => "elasticsearch"
     }
}

So let’s analyze these. The apache configuration now has no nesting in @fields (and there was much rejoicing), so it is considerably less cluttered. We’re writing to file here, and making the file end in ls_json (for convenience’s sake). Aside from this, there’s almost nothing different here between 1.1.x and 1.2.x configuration.

In the logstash configuration there are some big changes under the hood. Let’s look at the input first:

input {
   file {
      path => "/var/log/apache2/*.ls_json"
      tags => "apache_json"
      codec => "json"
   }
}

It’s clear we’re tailing a file here, still, so that’s the same. We’re appending the tag “apache_json” for ourselves. I opted to do this because there may be some non-json files I can’t consume this way and I want to differentiate.

The big difference here is codec. In the old example we had format => “json_event” for pre-formatted content. In Logstash 1.2.x you use a codec definition to accomplish this, but it’s not a json_event any more. The only reserved fields in logstash now are @timestamp and @version. Everything else is open.

Moving on to the filters now:

    geoip {
	add_tag => [ "GeoIP" ]
	database => "/opt/logstash/GeoLiteCity.dat"
	source => "clientip"
    }

The GeoIP filter is a wonderful addition since the early days of logstash. I won’t do more than provide a link and a basic description here. It extracts all kinds of useful data about who is visiting your web server: countries, cities, timezone, latitude and longitude, etc. Not every IP will populate every field, but we’ll get to that a bit later. Use the “source” directive to specify which field holds the IP (or host) to provide to the GeoIP filter.

    if [useragent] != "-" and [useragent] != "" {
      useragent {
        add_tag => [ "UA" ]
        source => "useragent"
      }
    }

Awesome new Logstash feature: Conditionals. Conditionals finally provide the kind of if/then/else logic that allows you to do amazing things (and probably some pretty mundane things too, ed.). Follow the link and read up on it. I’ll follow the simple flow here a bit. If the field useragent (fields are encapsulated in square braces) is not a hyphen, and is also not empty, then perform the action, which is another filter: useragent. The useragent filter breaks down a useragent string, like “Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B350 Safari/8536.25” into useful fields, like device, os, major, minor, and name. If it can’t find the answer to some of these, it will populate them with “Other,” which I don’t want. So to save me some trouble, I will prevent this from happening by using the conditional. If it does succeed I will tag it with “UA” and it will parse the “useragent” field.

if [bytes] == 0 { mutate { remove => "[bytes]" } }

Another conditional here. Logstash will check to see if the bytes field is 0. If it is, it will remove the bytes field. This is more about clutter removal than anything else.

The remaining simple “remove” statement for empty GeoIP fields should be pretty simple to follow. One thing to note is that nested fields must be encapsulated as above within square braces, e.g. [geoip][postal_code], for proper parsing.

if "apache_json" in [tags] {
        if [method]    =~ "(HEAD|OPTIONS)" { mutate { remove => "method" } }
        if [useragent] == "-"              { mutate { remove => "useragent" } }
        if [referer]   == "-"              { mutate { remove => "referer" } }
    }

Here we are checking to see if the tag “apache_json” is in the array “tags” before proceeding with other conditionals. Note that the check for “method” is using a regular expression, so it uses =~ instead of ==, and is seeing if the entry is for either “HEAD” or “OPTIONS” and will remove the “method” field in either case.

If you are especially observant you may have noticed that there is no date filter in this example, though there was in the 1.1.x example linked above. The reason here is that the timestamp is already properly formatted in ISO8601 and logstash can see that and uses it automatically. This saves a few cycles and keeps the configuration file appearing clean and orderly.

output {
     elasticsearch {
       host => "elasticsearch.example.com"
       cluster => "elasticsearch"
     }

And we can’t let all of this expanded data go to waste, now can we. So we ship it out to elasticsearch where Kibana can get to it. Let’s take a quick look at what it looks like:

I’ve anonymized some of the data, but you can see that there are many more fields than just those few we capture from apache.

So, that’s my updated rendition of getting Apache to put JSON into logstash directly. What do you think?

I would love to hear your comments, corrections and ideas.

The Logstash Book

Get “The Logstash Book”

No, I am not the author. I am, however, in the credits on page 1, with links and examples drawn from some of the content on this site. The author, James Turnbull, approached me and asked if it would be okay to do so and I agreed (of course!).

If you are just learning about Logstash, Elasticsearch and Kibana, this is the book for you. I’ve read through its pages. This is the book I wish I’d had a year and a half ago when I first started with Logstash in its infancy. A lot of what you will find in this book is the result of Jordan and the Logstash community’s hard work since then to make Logstash more user-friendly and accessible. You’ll be a Logstash maven in no time if you follow the basics in this book. The book even shows the basics of extending Logstash by adding your own plugins. Who knows? Maybe you’ll become a Logstash contributor too!

This is the first book published on Logstash, and it’s worth every penny of the inexpensive $9.99 asking price. Get it now, and get started with Logstash!

Disclaimer: I am not receiving any royalties to review or recommend this book. It succeeds on its own merits and I am pleased to be associated with it in any way.

ls-zbxstatsd – Part 1: Wrangling a zabbix key from a statsd key string.

I have just forked zbx-statsd from github into ls-zbxstatsd.

The reason for this is that zbx-statsd was not compatible with the format coming from logstash’s statsd output plugin.

Statsd format is simply “key:value|[type]”.
In logstash, “key” is different, and the format becomes “namespace.sender.’whatever you named it in the statsd output plugin’:value|[type]”. Things get more complicated when you need to split an already period-delimited “key” and figure out which part is which. What if the “sender,” which is the zabbix host you want the metrics to be stored under, is a period-delimited FQDN?

This was too much to handle so I added a delimiter. Double semicolons. With this, the format sent from logstash now looks like “namespace.sender;;.’whatever you named it in the statsd output plugin’:value|[type]”. This is much more easy to split.

For now, I strip the namespace altogether. I don’t need it, and while it might be useful later, I couldn’t think of a reason to keep it, so my script expects the default “logstash” and strips that out. If you’re using this script at this time, don’t change the default namespace, or expect to edit the code. Now I’m left with “sender;;.’whatever you named it in the statsd output plugin’:value|[type]”, where:

sender = zabbix host
‘whatever you named it in the statsd output plugin’ = item key

With the double semicolons I can easily separate the zabbix host name from the zabbix key, even if there are many periods in each.

With the resolution of this, it was time for stage two: Automatic item creation.

DEPRECATED! My current template/mapping

Update 2015-08-31: My most recent template/mapping can be found here.

2013-11-07: Another year, and things in the Logstash and Elasticsearch have grown and changed considerably. I am now employed by Elasticsearch to work on Logstash. This was one of the first things they wanted me to work on. So I am announcing that a new and improved, Logstash v1.2+ compatible mapping template is coming. It will not be on my personal site, however. It will be on http://www.elasticsearch.org in the main documentation there. I will paste the link here as soon as it’s available. In the interim, you can find a Github gist version here.

Expect this post to get updated from time to time. You can come back here to check out what I’m using and why.

2012-11-05: I now map IP addresses (clientip field) as type IP to allow for range searches. I also map the fields in the geoip filter output to allow for non-analyzed terms facet output (allows full city names with spaces; proper capitalization, etc.)

DO NOT USE THIS with Logstash v1.2+. This is deprecated and remains here as an archived example!

curl -XPUT http://localhost:9200/_template/logstash_per_index -d '
{
    "template" : "logstash*",
    "settings" : {
        "number_of_shards" : 4,
        "index.cache.field.type" : "soft",
        "index.refresh_interval" : "5s",
        "index.store.compress.stored" : true,
        "index.query.default_field" : "@message",
        "index.routing.allocation.total_shards_per_node" : 4
    },
    "mappings" : {
        "_default_" : {
            "_all" : {"enabled" : false},
            "properties" : {
               "@fields" : {
                    "type" : "object",
                    "dynamic": true,
                    "path": "full",
                    "properties" : {
                        "clientip" : { "type": "ip" },
                        "geoip" : {
                            "type" : "object",
                            "dynamic": true,
                            "path": "full",
                            "properties" : {
                                    "area_code" : { "type": "string", "index": "not_analyzed" },
                                    "city_name" : { "type": "string", "index": "not_analyzed" },
                                    "continent_code" : { "type": "string", "index": "not_analyzed" },
                                    "country_code2" : { "type": "string", "index": "not_analyzed" },
                                    "country_code3" : { "type": "string", "index": "not_analyzed" },
                                    "country_name" : { "type": "string", "index": "not_analyzed" },
                                    "dma_code" : { "type": "string", "index": "not_analyzed" },
                                    "ip" : { "type": "string", "index": "not_analyzed" },
                                    "latitude" : { "type": "float", "index": "not_analyzed" },
                                    "longitude" : { "type": "float", "index": "not_analyzed" },
                                    "metro_code" : { "type": "float", "index": "not_analyzed" },
                                    "postal_code" : { "type": "string", "index": "not_analyzed" },
                                    "region" : { "type": "string", "index": "not_analyzed" },
                                    "region_name" : { "type": "string", "index": "not_analyzed" },
                                    "timezone" : { "type": "string", "index": "not_analyzed" }
                            }
                        }
                    }
               },
               "@message": { "type": "string", "index": "analyzed" },
               "@source": { "type": "string", "index": "not_analyzed" },
               "@source_host": { "type": "string", "index": "not_analyzed" },
               "@source_path": { "type": "string", "index": "not_analyzed" },
               "@tags": { "type": "string", "index": "not_analyzed" },
               "@timestamp": { "type": "date", "index": "not_analyzed" },
               "@type": { "type": "string", "index": "not_analyzed" }
            }
        }
    }

}
'

Using elasticsearch mappings appropriately to map as type IP, int, float, etc.

Update 2015-08-31: My most recent template/mapping can be found here.
Update 2012-11-05: My most recent template/mapping can be found here.

I am updating previous templates in blogs accordingly, just FYI.

Logstash allows you to tag certain fields as types within elasticsearch. This is useful for performing statistical analysis on numbers, such as the byte fields or the duration of a transaction in mili or microseconds. In grok, this would be as simple as adding :int, or :float to the end of an expression, e.g. %{POSINT:bytes:int}. This makes the correct mapping output when the event is sent to elasticsearch. However, since we’re trying to avoid using grok and are sending values as pre-formatted JSON, this sometimes results in values not being properly tagged.

Jordan instructed me to not encapsulate values within double-quotes if the value is a number. In doing so, the value is auto-sent as type long (for long integer). However, elasticsearch allows us to store IP addresses as type IP. This is crucial to using the range-based queries across IP blocks/subnets, e.g. clientip:[172.16.0.0 TO 172.23.255.255].

In the past, I tried putting in :ip, just like with :int or :float. I thought it was working, because I was able to do a range query. But then it became clear that it was limited to a single dotted-quad difference, such as 192.168.0.1 TO 192.168.0.255. It would not work with a larger subnet. The way to discover if this is correctly configured or not is to pull the _mapping from your index:

curl -XGET 'http://localhost:9200/logstash-2012.10.12/_mapping?pretty=true'
(truncated)
           "clientip" : {
              "type" : "string"
            },

In this case, type “string” is not desired. We want to see type: “ip”. It turns out my mapping was misconfigured. The correct way to do this is as follows (see the mappings section in particular):

curl -XPUT http://localhost:9200/_template/logstash_per_index -d '{
    "template" : "logstash*",
    "settings" : {
        "number_of_shards" : 4,
        "index.cache.field.type" : "soft",
        "index.refresh_interval" : "5s",
        "index.store.compress.stored" : true,
        "index.query.default_field" : "@message",
        "index.routing.allocation.total_shards_per_node" : 2
    },
    "mappings" : {
        "_default_" : {
           "_all" : {"enabled" : false},
           "properties" : {
              "@fields" : {
                   "type" : "object",
                   "dynamic": true,
                   "path": "full",
                   "properties" : {
                       "clientip" : { "type": "ip"}
                   }
              },
              "@message": { "type": "string", "index": "analyzed" },
              "@source": { "type": "string", "index": "not_analyzed" },
              "@source_host": { "type": "string", "index": "not_analyzed" },
              "@source_path": { "type": "string", "index": "not_analyzed" },
              "@tags": { "type": "string", "index": "not_analyzed" },
              "@timestamp": { "type": "date", "index": "not_analyzed" },
               "@type": { "type": "string", "index": "not_analyzed" }    
           }   
        }
   }
}
'

After applying this template, now I have type “ip” showing up:

curl -XGET 'http://localhost:9200/logstash-2012.10.12/_mapping?pretty=true'
(truncated)
           "clientip" : {
              "type" : "ip"
            },

The same logic is applicable to all other fields in the object @fields (logstash’s default object for everything not prepended with an @ sign). Try it out! Enjoy! Keep in mind that this will not change existing data, but will work on new indexes created after replacing your template.