Getting Apache to output JSON (for logstash 1.2.x)

Greetings, travelers, who may have come to this page by way of my other page on this subject, dealing with the same subject matter, but with logstash version 1.1.x.

Logstash 1.2.1 is brand new as of this edition. The changes to my Apache CustomLog JSON recipe are in! I’ve even since updated this page to not use the prune filter but exclusively use the new logstash conditionals.

Apache configuration:

LogFormat "{ 
            "@timestamp": "%{%Y-%m-%dT%H:%M:%S%z}t", 
            "@version": "1", 
            "vips":["vip.example.com"], 
            "tags":["apache"], 
            "message": "%h %l %u %t \"%r\" %>s %b", 
            "clientip": "%a", 
            "duration": %D, 
            "status": %>s, 
            "request": "%U%q", 
            "urlpath": "%U", 
            "urlquery": "%q", 
            "bytes": %B, 
            "method": "%m", 
            "referer": "%{Referer}i", 
            "useragent": "%{User-agent}i" 
           }" ls_apache_json
CustomLog /var/log/apache2/logstash_test.ls_json ls_apache_json

Logstash configuration:

input {
   file {
      path => "/var/log/apache2/*.ls_json"
      tags => "apache_json"
      codec => "json"
   }
}

filter {
    geoip {
	add_tag => [ "GeoIP" ]
	database => "/opt/logstash/GeoLiteCity.dat"
	source => "clientip"
    }
    if [useragent] != "-" and [useragent] != "" {
      useragent {
        add_tag => [ "UA" ]
        source => "useragent"
      }
    }
    if [bytes] == 0 { mutate { remove => "[bytes]" } }
    if [geoip][city_name]      == "" { mutate { remove => "[geoip][city_name]" } }
    if [geoip][continent_code] == "" { mutate { remove => "[geoip][continent_code]" } }
    if [geoip][country_code2]  == "" { mutate { remove => "[geoip][country_code2]" } }
    if [geoip][country_code3]  == "" { mutate { remove => "[geoip][country_code3]" } }
    if [geoip][country_name]   == "" { mutate { remove => "[geoip][country_name]" } }
    if [geoip][latitude]       == "" { mutate { remove => "[geoip][latitude]" } }
    if [geoip][longitude]      == "" { mutate { remove => "[geoip][longitude]" } }
    if [geoip][postal_code]    == "" { mutate { remove => "[geoip][postal_code]" } }
    if [geoip][region_name]    == "" { mutate { remove => "[geoip][region_name]" } }
    if [geoip][time_zone]      == "" { mutate { remove => "[geoip][time_zone]" } }
    if [urlquery]              == "" { mutate { remove => "urlquery" } }

    if "apache_json" in [tags] {
        if [method]    =~ "(HEAD|OPTIONS)" { mutate { remove => "method" } }
        if [useragent] == "-"              { mutate { remove => "useragent" } }
        if [referer]   == "-"              { mutate { remove => "referer" } }
    }
    if "UA" in [tags] {
        if [device] == "Other" { mutate { remove => "device" } }
        if [name]   == "Other" { mutate { remove => "name" } }
        if [os]     == "Other" { mutate { remove => "os" } }
    }
}

output {
     elasticsearch {
       host => "elasticsearch.example.com"
       cluster => "elasticsearch"
     }
}

So let’s analyze these. The apache configuration now has no nesting in @fields (and there was much rejoicing), so it is considerably less cluttered. We’re writing to file here, and making the file end in ls_json (for convenience’s sake). Aside from this, there’s almost nothing different here between 1.1.x and 1.2.x configuration.

In the logstash configuration there are some big changes under the hood. Let’s look at the input first:

input {
   file {
      path => "/var/log/apache2/*.ls_json"
      tags => "apache_json"
      codec => "json"
   }
}

It’s clear we’re tailing a file here, still, so that’s the same. We’re appending the tag “apache_json” for ourselves. I opted to do this because there may be some non-json files I can’t consume this way and I want to differentiate.

The big difference here is codec. In the old example we had format => “json_event” for pre-formatted content. In Logstash 1.2.x you use a codec definition to accomplish this, but it’s not a json_event any more. The only reserved fields in logstash now are @timestamp and @version. Everything else is open.

Moving on to the filters now:

    geoip {
	add_tag => [ "GeoIP" ]
	database => "/opt/logstash/GeoLiteCity.dat"
	source => "clientip"
    }

The GeoIP filter is a wonderful addition since the early days of logstash. I won’t do more than provide a link and a basic description here. It extracts all kinds of useful data about who is visiting your web server: countries, cities, timezone, latitude and longitude, etc. Not every IP will populate every field, but we’ll get to that a bit later. Use the “source” directive to specify which field holds the IP (or host) to provide to the GeoIP filter.

    if [useragent] != "-" and [useragent] != "" {
      useragent {
        add_tag => [ "UA" ]
        source => "useragent"
      }
    }

Awesome new Logstash feature: Conditionals. Conditionals finally provide the kind of if/then/else logic that allows you to do amazing things (and probably some pretty mundane things too, ed.). Follow the link and read up on it. I’ll follow the simple flow here a bit. If the field useragent (fields are encapsulated in square braces) is not a hyphen, and is also not empty, then perform the action, which is another filter: useragent. The useragent filter breaks down a useragent string, like “Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B350 Safari/8536.25” into useful fields, like device, os, major, minor, and name. If it can’t find the answer to some of these, it will populate them with “Other,” which I don’t want. So to save me some trouble, I will prevent this from happening by using the conditional. If it does succeed I will tag it with “UA” and it will parse the “useragent” field.

if [bytes] == 0 { mutate { remove => "[bytes]" } }

Another conditional here. Logstash will check to see if the bytes field is 0. If it is, it will remove the bytes field. This is more about clutter removal than anything else.

The remaining simple “remove” statement for empty GeoIP fields should be pretty simple to follow. One thing to note is that nested fields must be encapsulated as above within square braces, e.g. [geoip][postal_code], for proper parsing.

if "apache_json" in [tags] {
        if [method]    =~ "(HEAD|OPTIONS)" { mutate { remove => "method" } }
        if [useragent] == "-"              { mutate { remove => "useragent" } }
        if [referer]   == "-"              { mutate { remove => "referer" } }
    }

Here we are checking to see if the tag “apache_json” is in the array “tags” before proceeding with other conditionals. Note that the check for “method” is using a regular expression, so it uses =~ instead of ==, and is seeing if the entry is for either “HEAD” or “OPTIONS” and will remove the “method” field in either case.

If you are especially observant you may have noticed that there is no date filter in this example, though there was in the 1.1.x example linked above. The reason here is that the timestamp is already properly formatted in ISO8601 and logstash can see that and uses it automatically. This saves a few cycles and keeps the configuration file appearing clean and orderly.

output {
     elasticsearch {
       host => "elasticsearch.example.com"
       cluster => "elasticsearch"
     }

And we can’t let all of this expanded data go to waste, now can we. So we ship it out to elasticsearch where Kibana can get to it. Let’s take a quick look at what it looks like:

I’ve anonymized some of the data, but you can see that there are many more fields than just those few we capture from apache.

So, that’s my updated rendition of getting Apache to put JSON into logstash directly. What do you think?

I would love to hear your comments, corrections and ideas.

17 thoughts on “Getting Apache to output JSON (for logstash 1.2.x)”

Pingback: Getting Apache to output JSON (for logstash) | The Untergeek

Hi Untergeek, Could you please advise? I use your apache log format above to pipe the log over UDP to a logstash listener on a remote server. This is working but the log sent from my linux server is incorrectly parsed:
Example Json received from windows server

{
  "_index": "logstash-2013.10.08",
  "_type": "logs",
  "_id": "TXLC6ZTwTL-CDRvY5iB46w",
  "_score": null,
  "_source": {
    "@timestamp": "2013-10-08T08:23:35.000Z",
    "@version": "1",
    "tags": [
      "apache",
      "apache_json",
      "GeoIP",
      "UA"
    ],
    "message": "212.nn.nn.nn - - [08/Oct/2013:10:23:35 +0200] \"GET /icons/unknown.gif HTTP/1.1\" 304 -",
    "clientip": "212.nn.nn.nn",
    "duration": 2000,
    "status": 304,
    "request": "/icons/unknown.gif",
    "urlpath": "/icons/unknown.gif",
    "method": "GET",
    "referer": "https://nn.nn.nn/testserver/uploaded/",
    "useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.69 Safari/537.36",
    "host": "172.nn.nn.nn",
    "geoip": {
      "ip": "212.nn.nn.nn",
      "country_code": 161,
      "country_code2": "NL",
      "country_code3": "NLD",
      "country_name": "Netherlands",
      "continent_code": "EU"
    },
    "name": "Chrome",
    "os": "Windows 7",
    "major": "30",
    "minor": "0",
    "patch": "1599"
  },
  "sort": [
    1381220615000
  ]
}

Example Json received from linuxserver

{
  "_index": "logstash-2013.10.08",
  "_type": "logs",
  "_id": "SOzpr-AKRWOTdiYz8DEwFg",
  "_score": null,
  "_source": {
    "message": "{             \"@timestamp\": \"2013-10-08T10:25:49W. Europe Daylight Time\",             \"@version\": \"1\",             \"tags\":[\"apache\"],             \"message\": \"62.nn.nn.nn - \"\" [08/Oct/2013:10:25:49 +0200] \\\"GET /testserver/n HTTP/1.1\\\" 500 540\",             \"clientip\": \"62.nn.nn.nn\",             \"duration\": 1000,             \"status\": 500,             \"request\": \"/testserver/n\",             \"urlpath\": \"/testserver/n",             \"urlquery\": \"?hw_id=ipc\",             \"bytes\": 540,             \"method\": \"GET\",             \"referer\": \"-\",             \"useragent\": \"Java/1.7.0_05\"            }\n",
    "@timestamp": "2013-10-08T08:25:48.012Z",
    "@version": "1",
    "tags": [
      "apache_json",
      "GeoIP",
      "UA"
    ],
    "host": "172.nn.nn.nn",
    "geoip": {
      "ip": null,
      "country_code": 0,
      "country_code2": "--",
      "country_code3": "--",
      "country_name": "N/A",
      "continent_code": "--"
    }
  },
  "sort": [
    1381220748012
  ]
}

October 8, 2013 at 3:31 am

Aaron says:

Lucas,

The linux JSON… Is an error generated? I don’t strictly see anything wrong with what you have there, other than that the GeoIP output has two hyphens instead of a single hyphen. What error is generated when it tries to index that output?

Reply
October 8, 2013 at 4:14 am
- Lucas says:
  
  No error message, perhaps I can best explain it by showing you some screenshots from Kibana. A bit more about the setup. I have a reverse apache proxy (windows) and an apache test server (linux).
  
  The screenshots show the kibana logging of both servers:
  REVERSE PROXY:
  – http://i.imgur.com/4zzyAHb.png
  – http://i.imgur.com/Dmwki3g.png
  
  TEST SERVER
  – http://img191.imageshack.us/img191/5124/vyaz.png
  – http://i.imgur.com/SjPskpT.png
  
  As you can see it looks like for the test server the name-value pairs are not regonised and bundled as the ‘message’ value
  
  October 9, 2013 at 3:17 am
- Aaron says:
  
  Hmmm. Clearly it is not recognizing the message as valid json and breaking it up appropriately. If so, how is the GeoIP test getting the ip field? Is the email you provided valid? If so, please respond and I will contact you directly. I would like to see your logstash configuration to help troubleshoot this.
  
  October 9, 2013 at 3:25 am
- Lucas says:
  
  My email addres is valid 🙂 Thank you very much for taking the time to assist me! About the GeopIP config, I used your guide as input and the mapping is done on my Logstash server. I am happy to forward the configuration when I receive your email. Thanks again!
  
  October 11, 2013 at 4:22 am

Maybe it makes sense to put the user agent information below ua so you get ua.name, ua.device, … like the geoip data which is already nested.

What do you think of nested fields in general? The docs have a nice example here
http://logstash.net/docs/1.2.2/configuration#conditionals

How can I store nested fields with the grok filter?

November 8, 2013 at 2:50 am

Aaron says:

The ability to nest the useragent data in Logstash already exists: http://logstash.net/docs/1.2.2/filters/useragent#target

Just set target => “ua” and the events will be nested as you described.

Reply
November 8, 2013 at 10:07 am

Pingback: SysAdmin Adventures » Logstash v1.1 -> v1.2 :: JSON Event Layout Format Change

I have a JSON file. I need OR operator in the code below where “text”: “$76” LIKE “text”: “$10 OR $11 OR $9”. Is there any way to do this in JSON?
{
“type”: “verifyText”,
“locator”: {
“type”: “id”,
“value”: “line1”
},
“text”: “$10” }

December 4, 2013 at 8:51 am

Aaron says:

I don’t understand. Are you trying to run a query in Elasticsearch for documents matching those fields? JSON is an object notation, not a query language in itself.

Reply
December 4, 2013 at 8:53 am
- saha says:
  
  I am using selenium builder for web testing and saving file in JSON.
  
  December 4, 2013 at 11:44 am
- Aaron says:
  
  Um… I’m still confused.
  
  This post is specifically about configuring Apache to output a custom log format in JSON for Logstash consumption. What does that have to do with Selenium Builder?
  
  Perhaps your question would be better suited for a venue such as Stack Overflow?
  
  December 4, 2013 at 12:01 pm

Hello,

Thank you for your format. I have added it, well, trimmed down to just keep the bits I need, to several of my Apache servers. Putting things into JSON saves an immense amount of headache in parsing the data, so it’s much appreciated! As my own contribution, I’d like to mention a command line tool that I made, originally to grab data from our live feeds but that works for all JSON line data. Here are a few examples:

cat /etc/apache/access.log | head | jline-pretty
cat /etc/apache/access.log | jline-foreach \
begin::’global.total=0;’ \
‘total += record.duration;’ \
‘end::console.log(“Total duration:”, total);’

The main tool, jline-foreach, is basically awk for JSON and rather than inventing a language from scratch I use Javascript. Github: https://github.com/bitdivine/jline/blob/master/bin/foreach.md

I hope that helps!

Regards, Max

September 27, 2015 at 7:57 pm

Pingback: nxhack/logstash | GITROOM

I m trying logstash with snmptrap, as I have more than 300 switches, but the output for the logs seems to be creepy, how can I get help from utility like grok. Logstash log output.

December 10, 2015 at 3:40 am

What was the end-result with Lucas? Were you able to resolve his issue?

May 25, 2016 at 6:22 am

The Untergeek

Ramblings of a non-übergeek

Getting Apache to output JSON (for logstash 1.2.x)

Apache configuration:

Logstash configuration:

17 thoughts on “Getting Apache to output JSON (for logstash 1.2.x)”

Leave a comment Cancel reply

Apache configuration:

Logstash configuration:

Share this:

Related

17 thoughts on “Getting Apache to output JSON (for logstash 1.2.x)”

Leave a comment Cancel reply