Getting Apache to output JSON (for logstash)

Update: This page is for the now deprecated Logstash 1.1.x and older. Look for the updated version of this here: https://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/

Last time we looked at ways to improve logstash/elasticsearch with elasticsearch templates. Today we’ll save ourselves a lot of grok parsing pain with apache’s custom log feature.

Disclaimer: This only works with versions of logstash supporting the UDP input. You can adapt this to send or log in another way, if you like, e.g. send the json to a file and have logstash tail it.

Let’s look first, then explain later. If you are using an include line in your apache config (e.g. Include conf.d/*.conf) then all you need to do is to put this in a standalone or vhost. If it’s a single-host apache, I will create logstash.conf and put this in it:

LogFormat "{ 
            "@vips":["vip.example.com","customer.example.net"], 
            "@source":"file://host.example.com//usr/local/apache2/logs/access_log", 
            "@source_host": "host.example.com", 
            "@source_path": "/usr/local/apache2/logs/access_log", 
            "@tags":["Application","Customer"], 
            "@message": "%h %l %u %t \"%r\" %>s %b", 
            "@fields": { 
                "timestamp": "%{%Y-%m-%dT%H:%M:%S%z}t", 
                "clientip": "%a", 
                "duration": %D, 
                "status": %>s, 
                "request": "%U%q", 
                "urlpath": "%U", 
                "urlquery": "%q", 
                "method": "%m", 
                "bytes": %B 
                }  
           }" ls_apache_json

CustomLog "|/usr/local/bin/udpclient.pl 127.0.0.1 57080" ls_apache_json

Some of this should look straightforward, but let me point to some pitfalls I had to dig myself out of.

  • bytes: in the @message, I use %b, but in @fields, I use %B. The reason is summed up nicely on http://httpd.apache.org/docs/2.2/mod/mod_log_config.html :

    %B Size of response in bytes, excluding HTTP headers.
    %b Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a ‘-‘ rather than a 0 when no bytes are sent.

    In other words, since I’m trying to send an integer value, if I don’t choose %B, I may send a – (dash/hyphen) when there is no value to send, making the field be categorized as a string. Jordan says that sending the JSON as an integer (i.e. no quotes) should make it into ES as an integer. This may yet require a mapping/template.

  • @message: This is the apache common format. You could easily substitute in the same fields as go in the apache combined format. Or, you could leave the common format as @message and simply add the fields for user-agent and referrer if you want to collect those:
                    "referer": \"%{Referer}i\", 
                    "useragent": \"%{User-agent}i\" 
  • timestamp: Jordan has had no problems with passing @timestamp directly, but I have had nothing but problems. Perhaps I can get a solution linked here some time, but in the meanwhile, I simply spit out the timestamp here in ISO8601, and then use date and mutate in logstash.conf:
    input {
       udp {
          port => 57080
          type => "apache"
          buffer_size => 8192
          format => "json_event"
       }
    }
    
    filter {
       date {
           type => "apache"
           timestamp => "ISO8601"
       }
       mutate {
            type   => "apache"
            remove => [ "timestamp" ]
       }
    }

What comes out is ready (except for the date munging) for feeding into elasticsearch, and even has an @message field for searching. This method also makes it trivial to add extra fields (get them from http://httpd.apache.org/docs/2.2/mod/mod_log_config.html) without doing anything extra or having to re-work your patterns for grok. As I mentioned previously, I keep the common log format for @message, then add the other fields (like duration, user-agent and referer) as needed. An apachectl restart is all it takes to get the new values into elasticsearch.

And for the sake of a complete solution, the udpclient.pl script:

#!/usr/bin/perl
#udpclient.pl

use IO::Socket::INET;
my $host = $ARGV[0];
my $port = $ARGV[1];

# flush after every write
$| = 1;

my ($socket,$logdata);

#  We call IO::Socket::INET->new() to create the UDP Socket
# and bind with the PeerAddr.
$socket = new IO::Socket::INET (
   PeerAddr   => "$host:$port",
   Proto        => 'udp'
) or die "ERROR in Socket Creation : $!n";

while ($logdata = <STDIN>) {
    $socket->send($logdata);
}

$socket->close();

I also tend to think that one of the best things about this solution is that it does not interfere with your current logging solution in any degree. It simply catches more and sends it, pre-formatted, over local udp to logstash, and then to whatever output(s) you have defined.

Using templates to improve elasticsearch caching (with logstash)

This page is deprecated!

Use the default Logstash template instead.

Update 2014-05-16: Logstash now comes with a default template (which descended from these efforts): It’s available here.

Update 2012-11-05: My most recent template/mapping can be found here.

I find that logstash does a great job with the default index mapping behavior in elasticsearch if you are not sending a ton of log events.  Once that amount begins to grow, however, there is a need to manage how it is indexed.  One great way to maximize your elasticsearch performance is to use an index template.

 

curl -XPUT http://localhost:9200/_template/logstash_per_index -d '
{
    "template" : "logstash*",
    "settings" : {
        "number_of_shards" : 4,
        "index.cache.filter.expire" : "5m",
        "index.cache.field.expire" : "5m",
        "index.refresh_interval" : "5s",
        "index.store.compress.stored" : true,
        "index.query.default_field" : "@message",
        "index.routing.allocation.total_shards_per_node" : 2
    },
    "mappings" : {
        "_default_" : {
             "_all" : {"enabled" : false}
        }
    }
}
'

This template was awesome! I let my field cache expire after 5 minutes to prevent it from overfilling. I have compression turned on to save space. I have 4 nodes and 4 shards (plus the default of 1 replica per shard). With a fixed map of 2 total shards per node, I typically would have one primary shard and one replica per index per day. I was using @message as my default search field and dropping the _all field for space considerations. And then I learned about the soft cache type.

The field cache is awesome in elasticsearch, and you want that data to be persistent for as long as possible. It makes searches lightning fast! A 5 minute expiration time doesn’t help with that at all, and the tendency to re-cache the same data was painful. I was hoping for another solution, but could never find one. Then on the elasticsearch LinkedIn group, I found this article. http://blog.sematext.com/2012/05/17/elasticsearch-cache-usage/ After learning about this change and how it had benefited others, I had to make the change myself. Now my template looks like this:

curl -XPUT http://localhost:9200/_template/logstash_per_index -d '{
    "template" : "logstash*",
    "settings" : {
        "number_of_shards" : 4,
        "index.cache.field.type" : "soft",
        "index.refresh_interval" : "5s",
        "index.store.compress.stored" : true,
        "index.query.default_field" : "@message",
        "index.routing.allocation.total_shards_per_node" : 2
    },
    "mappings" : {
        "_default_" : {
           "_all" : {"enabled" : false},
           "properties" : {
              "@fields" : {
                   "type" : "object",
                   "dynamic": true,
                   "path": "full",
                   "properties" : {
                       "clientip" : { "type": "ip"}
                   }
              },
              "@message": { "type": "string", "index": "analyzed" },
              "@source": { "type": "string", "index": "not_analyzed" },
              "@source_host": { "type": "string", "index": "not_analyzed" },
              "@source_path": { "type": "string", "index": "not_analyzed" },
              "@tags": { "type": "string", "index": "not_analyzed" },
              "@timestamp": { "type": "date", "index": "not_analyzed" },
               "@type": { "type": "string", "index": "not_analyzed" }    
           }   
        }
   }
}
'

The change means that I don’t need cache expiry any more. The built-in GC engine will cover that for me! The downside is that it will take a full 30 days for this solution to get fully caught up as the index.cache.field.type setting can’t be applied to existing indexes.

Time to geek blog about Logstash and Zabbix again

It’s been so long since I did any kind of geek blogging I figure it’s time I lived up to my name again.

I’ve taken to running Logstash and Elasticsearch as a centralized logging engine.  I’ve been doing so for over a year now.  The cluster I created and managed for my company was indexing over 16,000 records per second during the busiest time of day.  Hopefully I’ll be able to put some useful data here for my fellow Logstash users.

I also did a lot of work with Zabbix for my company, plus consulted for another company on the side.  I’ll have to put some of those details here for anyone interested.

Reawakening

It’s been such a long time since I wrote in earnest, anywhere.  So much has changed for me.  So many things are different in my life.

I live in Texas now, having spent all but 2 years of my life in Utah.  I’ve been here for almost 4 years now and I love it!  Texas is home, now.  I can hardly imagine living somewhere else.

I work in an extremely challenging and engaging business.  Having been a Unix Systems Admin for around 8 years in Utah, I took a position as a Java Application Admin at my current company.  The similarities and the differences are striking.  I use the standard set of Unix admin tools still on a daily basis.  At the same time, I spend a lot more time poring over log files from Weblogic and JBoss JVMs, implementing and creating new ways to monitor and visualize data, and making sure our customers’ apps are running optimally.

I am older and more mature (hopefully).  I have discovered that I spent most of my life looking at the world around me with the myopic view of one with symptoms of Asperger’s Syndrome.  There has been a shift in me since then.  At first, this discovery was like putting on corrective lenses which suddenly brought my past into focus.  This was incredibly uncomfortable.  They say that hindsight is 20/20, but this experience was more like 20/10 or better.  All of my interpersonal mishaps and miscommunications were suddenly plain to me, and I still have painful recollections of something I did or said long ago.  The pain is diminished, significantly, from what it initially was.  Embarrassment can be a difficult thing for a perfectionist.  I’m learning and growing still, however.  That’s what really matters.

With that realization, however, come many new opportunities to see the world as it actually is or as it can be.  This reawakening is ongoing for me.  I have not forgotten my past, nor do I seek to hide from it.  The old blog is still around at http://oldblog.untergeek.com with a disclaimer about its meaning as a part of my history.  I do hope to convey, however, that my life has new meaning.  This is my reawakening.