Greetings, travelers, who may have come to this page by way of my other page on this subject, dealing with the same subject matter, but with logstash version 1.1.x.

Logstash 1.2.1 is brand new as of this edition.  The changes to my Apache CustomLog JSON recipe are in! I’ve even since updated this page to not use the prune filter but exclusively use the new logstash conditionals.

Apache configuration:

LogFormat "{ \
            \"@timestamp\": \"%{%Y-%m-%dT%H:%M:%S%z}t\", \
            \"@version\": \"1\", \
            \"vips\":[\"vip.example.com\"], \
            \"tags\":[\"apache\"], \
            \"message\": \"%h %l %u %t \\\"%r\\\" %>s %b\", \
            \"clientip\": \"%a\", \
            \"duration\": %D, \
            \"status\": %>s, \
            \"request\": \"%U%q\", \
            \"urlpath\": \"%U\", \
            \"urlquery\": \"%q\", \
            \"bytes\": %B, \
            \"method\": \"%m\", \
            \"referer\": \"%{Referer}i\", \
            \"useragent\": \"%{User-agent}i\" \
           }" ls_apache_json
CustomLog /var/log/apache2/logstash_test.ls_json ls_apache_json

Logstash configuration:

input {
   file {
      path => "/var/log/apache2/*.ls_json"
      tags => "apache_json"
      codec => "json"
   }
}

filter {
    geoip {
	add_tag => [ "GeoIP" ]
	database => "/opt/logstash/GeoLiteCity.dat"
	source => "clientip"
    }
    if [useragent] != "-" and [useragent] != "" {
      useragent {
        add_tag => [ "UA" ]
        source => "useragent"
      }
    }
    if [bytes] == 0 { mutate { remove => "[bytes]" } }
    if [geoip][city_name]      == "" { mutate { remove => "[geoip][city_name]" } }
    if [geoip][continent_code] == "" { mutate { remove => "[geoip][continent_code]" } }
    if [geoip][country_code2]  == "" { mutate { remove => "[geoip][country_code2]" } }
    if [geoip][country_code3]  == "" { mutate { remove => "[geoip][country_code3]" } }
    if [geoip][country_name]   == "" { mutate { remove => "[geoip][country_name]" } }
    if [geoip][latitude]       == "" { mutate { remove => "[geoip][latitude]" } }
    if [geoip][longitude]      == "" { mutate { remove => "[geoip][longitude]" } }
    if [geoip][postal_code]    == "" { mutate { remove => "[geoip][postal_code]" } }
    if [geoip][region_name]    == "" { mutate { remove => "[geoip][region_name]" } }
    if [geoip][time_zone]      == "" { mutate { remove => "[geoip][time_zone]" } }
    if [urlquery]              == "" { mutate { remove => "urlquery" } }

    if "apache_json" in [tags] {
        if [method]    =~ "(HEAD|OPTIONS)" { mutate { remove => "method" } }
        if [useragent] == "-"              { mutate { remove => "useragent" } }
        if [referer]   == "-"              { mutate { remove => "referer" } }
    }
    if "UA" in [tags] {
        if [device] == "Other" { mutate { remove => "device" } }
        if [name]   == "Other" { mutate { remove => "name" } }
        if [os]     == "Other" { mutate { remove => "os" } }
    }
}

output {
     elasticsearch {
       host => "elasticsearch.example.com"
       cluster => "elasticsearch"
     }
}

So let’s analyze these. The apache configuration now has no nesting in @fields (and there was much rejoicing), so it is considerably less cluttered. We’re writing to file here, and making the file end in ls_json (for convenience’s sake). Aside from this, there’s almost nothing different here between 1.1.x and 1.2.x configuration.

In the logstash configuration there are some big changes under the hood. Let’s look at the input first:

input {
   file {
      path => "/var/log/apache2/*.ls_json"
      tags => "apache_json"
      codec => "json"
   }
}

It’s clear we’re tailing a file here, still, so that’s the same. We’re appending the tag “apache_json” for ourselves. I opted to do this because there may be some non-json files I can’t consume this way and I want to differentiate.

The big difference here is codec. In the old example we had format => “json_event” for pre-formatted content. In Logstash 1.2.x you use a codec definition to accomplish this, but it’s not a json_event any more. The only reserved fields in logstash now are @timestamp and @version. Everything else is open.

Moving on to the filters now:

    geoip {
	add_tag => [ "GeoIP" ]
	database => "/opt/logstash/GeoLiteCity.dat"
	source => "clientip"
    }

The GeoIP filter is a wonderful addition since the early days of logstash. I won’t do more than provide a link and a basic description here. It extracts all kinds of useful data about who is visiting your web server: countries, cities, timezone, latitude and longitude, etc. Not every IP will populate every field, but we’ll get to that a bit later. Use the “source” directive to specify which field holds the IP (or host) to provide to the GeoIP filter.

    if [useragent] != "-" and [useragent] != "" {
      useragent {
        add_tag => [ "UA" ]
        source => "useragent"
      }
    }

Awesome new Logstash feature: Conditionals. Conditionals finally provide the kind of if/then/else logic that allows you to do amazing things (and probably some pretty mundane things too, ed.). Follow the link and read up on it. I’ll follow the simple flow here a bit. If the field useragent (fields are encapsulated in square braces) is not a hyphen, and is also not empty, then perform the action, which is another filter: useragent. The useragent filter breaks down a useragent string, like “Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B350 Safari/8536.25″ into useful fields, like device, os, major, minor, and name. If it can’t find the answer to some of these, it will populate them with “Other,” which I don’t want. So to save me some trouble, I will prevent this from happening by using the conditional. If it does succeed I will tag it with “UA” and it will parse the “useragent” field.

if [bytes] == 0 { mutate { remove => "[bytes]" } }

Another conditional here. Logstash will check to see if the bytes field is 0. If it is, it will remove the bytes field. This is more about clutter removal than anything else.

The remaining simple “remove” statement for empty GeoIP fields should be pretty simple to follow. One thing to note is that nested fields must be encapsulated as above within square braces, e.g. [geoip][postal_code], for proper parsing.

if "apache_json" in [tags] {
        if [method]    =~ "(HEAD|OPTIONS)" { mutate { remove => "method" } }
        if [useragent] == "-"              { mutate { remove => "useragent" } }
        if [referer]   == "-"              { mutate { remove => "referer" } }
    }

Here we are checking to see if the tag “apache_json” is in the array “tags” before proceeding with other conditionals. Note that the check for “method” is using a regular expression, so it uses =~ instead of ==, and is seeing if the entry is for either “HEAD” or “OPTIONS” and will remove the “method” field in either case.

If you are especially observant you may have noticed that there is no date filter in this example, though there was in the 1.1.x example linked above. The reason here is that the timestamp is already properly formatted in ISO8601 and logstash can see that and uses it automatically. This saves a few cycles and keeps the configuration file appearing clean and orderly.

output {
     elasticsearch {
       host => "elasticsearch.example.com"
       cluster => "elasticsearch"
     }

And we can’t let all of this expanded data go to waste, now can we. So we ship it out to elasticsearch where Kibana can get to it. Let’s take a quick look at what it looks like:
logstash_json_capture
I’ve anonymized some of the data, but you can see that there are many more fields than just those few we capture from apache.

So, that’s my updated rendition of getting Apache to put JSON into logstash directly. What do you think?

I would love to hear your comments, corrections and ideas.

Tagged with:
 

13 Responses to Getting Apache to output JSON (for logstash 1.2.x)

  1. […] Update: This page is for the now deprecated Logstash 1.1.x and older. Look for the updated version of this here: http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/ […]

  2. Lucas says:

    Hi Untergeek, Could you please advise? I use your apache log format above to pipe the log over UDP to a logstash listener on a remote server. This is working but the log sent from my linux server is incorrectly parsed:
    Example Json received from windows server
    [code]
    {
    "_index": "logstash-2013.10.08",
    "_type": "logs",
    "_id": "TXLC6ZTwTL-CDRvY5iB46w",
    "_score": null,
    "_source": {
    "@timestamp": "2013-10-08T08:23:35.000Z",
    "@version": "1",
    "tags": [
    "apache",
    "apache_json",
    "GeoIP",
    "UA"
    ],
    "message": "212.nn.nn.nn - - [08/Oct/2013:10:23:35 +0200] \"GET /icons/unknown.gif HTTP/1.1\" 304 -",
    "clientip": "212.nn.nn.nn",
    "duration": 2000,
    "status": 304,
    "request": "/icons/unknown.gif",
    "urlpath": "/icons/unknown.gif",
    "method": "GET",
    "referer": "https://nn.nn.nn/testserver/uploaded/",
    "useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36",
    "host": "172.nn.nn.nn",
    "geoip": {
    "ip": "212.nn.nn.nn",
    "country_code": 161,
    "country_code2": "NL",
    "country_code3": "NLD",
    "country_name": "Netherlands",
    "continent_code": "EU"
    },
    "name": "Chrome",
    "os": "Windows 7",
    "major": "30",
    "minor": "0",
    "patch": "1599"
    },
    "sort": [
    1381220615000
    ]
    }[/code]
    Example Json received from linuxserver
    [code]
    {
    "_index": "logstash-2013.10.08",
    "_type": "logs",
    "_id": "SOzpr-AKRWOTdiYz8DEwFg",
    "_score": null,
    "_source": {
    "message": "{ \"@timestamp\": \"2013-10-08T10:25:49W. Europe Daylight Time\", \"@version\": \"1\", \"tags\":[\"apache\"], \"message\": \"62.nn.nn.nn - \"\" [08/Oct/2013:10:25:49 +0200] \\\"GET /testserver/n HTTP/1.1\\\" 500 540\", \"clientip\": \"62.nn.nn.nn\", \"duration\": 1000, \"status\": 500, \"request\": \"/testserver/n\", \"urlpath\": \"/testserver/n", \"urlquery\": \"?hw_id=ipc\", \"bytes\": 540, \"method\": \"GET\", \"referer\": \"-\", \"useragent\": \"Java/1.7.0_05\" }\n",
    "@timestamp": "2013-10-08T08:25:48.012Z",
    "@version": "1",
    "tags": [
    "apache_json",
    "GeoIP",
    "UA"
    ],
    "host": "172.nn.nn.nn",
    "geoip": {
    "ip": null,
    "country_code": 0,
    "country_code2": "--",
    "country_code3": "--",
    "country_name": "N/A",
    "continent_code": "--"
    }
    },
    "sort": [
    1381220748012
    ]
    }[/code]

    • Aaron says:

      Lucas,

      The linux JSON… Is an error generated? I don’t strictly see anything wrong with what you have there, other than that the GeoIP output has two hyphens instead of a single hyphen. What error is generated when it tries to index that output?

      • Lucas says:

        No error message, perhaps I can best explain it by showing you some screenshots from Kibana. A bit more about the setup. I have a reverse apache proxy (windows) and an apache test server (linux).

        The screenshots show the kibana logging of both servers:
        REVERSE PROXY:
        - http://i.imgur.com/4zzyAHb.png
        - http://i.imgur.com/Dmwki3g.png

        TEST SERVER
        - http://img191.imageshack.us/img191/5124/vyaz.png
        - http://i.imgur.com/SjPskpT.png

        As you can see it looks like for the test server the name-value pairs are not regonised and bundled as the ‘message’ value

        • Aaron says:

          Hmmm. Clearly it is not recognizing the message as valid json and breaking it up appropriately. If so, how is the GeoIP test getting the ip field? Is the email you provided valid? If so, please respond and I will contact you directly. I would like to see your logstash configuration to help troubleshoot this.

          • Lucas says:

            My email addres is valid :) Thank you very much for taking the time to assist me! About the GeopIP config, I used your guide as input and the mapping is done on my Logstash server. I am happy to forward the configuration when I receive your email. Thanks again!

  3. Philipp says:

    Maybe it makes sense to put the user agent information below ua so you get ua.name, ua.device, … like the geoip data which is already nested.

    What do you think of nested fields in general? The docs have a nice example here
    http://logstash.net/docs/1.2.2/configuration#conditionals

    How can I store nested fields with the grok filter?

  4. […] Untergeek’s blog posts: Old & New, this is how you would have done it […]

  5. saha says:

    I have a JSON file. I need OR operator in the code below where “text”: “$76″ LIKE “text”: “$10 OR $11 OR $9″. Is there any way to do this in JSON?
    {
    “type”: “verifyText”,
    “locator”: {
    “type”: “id”,
    “value”: “line1″
    },
    “text”: “$10″ }

    • Aaron says:

      I don’t understand. Are you trying to run a query in Elasticsearch for documents matching those fields? JSON is an object notation, not a query language in itself.

      • saha says:

        I am using selenium builder for web testing and saving file in JSON.

        • Aaron says:

          Um… I’m still confused.

          This post is specifically about configuring Apache to output a custom log format in JSON for Logstash consumption. What does that have to do with Selenium Builder?

          Perhaps your question would be better suited for a venue such as Stack Overflow?

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>