Parsing and Analyzing Apache Log file using linux commands

Apache log file is typically created in this format

%h %l %u %t “%r” %>s %b “%{Referer}i” “%{User-agent}i”

Definition of each variable are

%h   =  IP address of the client (remote host) which made the request
%l   =  RFC 1413 identity of the client
%u   =  userid of the person requesting the document
%t   =  Time that the server finished processing the request
%r   =  Request line from the client in double quotes
%>s  =  Status code that the server sends back to the client
%b   =  Size of the object returned to the client

Sample Log file data

106.51.136.220 – – [26/Mar/2016:06:48:21 +0000] “GET /index.php HTTP/1.1” 200 6933 “http://www.tuskerdatalab.com/about.php” “Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0”
141.8.184.13 – – [26/Mar/2016:06:50:53 +0000] “GET /robots.txt HTTP/1.1” 404 510 “-” “Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)”
69.162.124.228 – – [26/Mar/2016:06:52:48 +0000] “HEAD / HTTP/1.1” 200 205 “-” “Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)”
61.135.189.111 – – [26/Mar/2016:06:57:33 +0000] “GET / HTTP/1.1” 200 6952 “-” “Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)”
69.162.124.228 – – [26/Mar/2016:06:57:48 +0000] “HEAD / HTTP/1.1” 200 205 “-” “Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)”
66.249.66.42 – – [26/Mar/2016:07:00:44 +0000] “GET /analytics-technology-solution.php HTTP/1.1” 200 6174 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”
182.64.232.101 – – [26/Mar/2016:07:01:20 +0000] “GET /images/tech/java_logo.png HTTP/1.1” 200 20524 “https://www.google.co.in/” “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36”
69.162.124.228 – – [26/Mar/2016:07:02:48 +0000] “HEAD / HTTP/1.1” 200 205 “-” “Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)”
69.162.124.228 – – [26/Mar/2016:07:07:48 +0000] “HEAD / HTTP/1.1” 200 205 “-” “Mozilla/5.0+(compatible; UptimeRobot/2.0; http://www.uptimerobot.com/)”

Token Separation using awk command

cat access.log | awk ‘{print $1}’ #IP Address (%h)

cat access.log | awk ‘{print $4,5}’ # data/time (%t)

cat access.log | awk ‘{print $9}’ # status codes

cat access.log | awk ‘{print $10}’ # size

cat access.log | awk -F\” ‘{print $2}’ # Requested URI

cat access.log | awk -F\” ‘{print $4}’ # Referer URL

cat access.log | awk -F\” ‘{print $6}’ # Agents

Aggregation commands

sort – sort all the lines
uniq -c – group all the lines with unique value and maintain the count
uniq -ci – group all the lines with unique value and maintain the count, with case ignored
sort -rg – sort in descending order (r), and numerical sorting (g)
head -n – gives you n lines from the top

Examples

#Get top 20 agents
cat access.log | awk -F \” ‘{print $6}’ |  sort | uniq -c | sort -rg | head -n 20

#Get top 20 IP address from where the requests came
cat access.log | awk ‘{print $1}’ |  sort | uniq -c | sort -rg | head -n 20

#For a certain IP, find all the agents
cat access.log | grep “185.106.121.128” | awk -F \” ‘{print $6}’ |  uniq -c | sort -rg

#Get who all accessing your image files
awk -F\” ‘($2 ~ /\.(jpg|gif)/ && $4 !~ /^http:\/\/www\.tuskerdatalab\.com/){print $4}’ access.log | sort | uniq -c | sort

Facebook Comments

Leave a Reply

Your email address will not be published. Required fields are marked *