Aggregate 404 errors from log file on the Linux command line

−0

I'm sure there's a program somewhere that parses common log entries, but I don't know what it is. However, the task as stated is pretty simple so I'd try hack something together myself. You need to:

Pull out the error code and referrer with a regex
Filter for 404 codes
Filter out empty referrers
Print requests
Count them

1 can be done with a regex and sed, but you need a lot of []{}()+ in the pattern, and sed makes these annoying to type (you have to escape them all). So instead I would read into a Python script.

If you use sed for 1, you would use grep for 2 and 3, also with regex.

If you want to stick with the shell, you can use Python to only dump a JSON, CSV, or whatever else you want. You can filter JSON with jq and CSV with csvkit or csvq.

Since I used Python, it's easier to use Python's syntax for filtering as well, which is what I did:

import re
import sys
from collections import Counter
from typing import NamedTuple

RE_SYSLOG = re.compile(r'(?:\S+ ){3}\[([^[\]]+)\] "([^"]+)" (\d+) \S+ "([^"]+)"')


class LogEntry(NamedTuple):
    timestamp: str
    request: str
    error_code: str
    referer: str


def main():
    raw = sys.stdin.readlines()
   
    parsed = [parse_syslog_message(s) for s in raw]
    filtered = [i for i in parsed if i.error_code == "404" and len(i.referer) > 3]

    # Print requests
    for i in filtered:
        print(i.request)


def parse_syslog_message(msg: str) -> LogEntry:
    m = RE_SYSLOG.search(msg)
    return LogEntry(*m.groups())


if __name__ == "__main__":
    main()

You then do cat web.log | python parse.py and you'll get a list of the requests you want (4). I assumed you do care about request type (because it's easier that way) but I'm sure you can see how to get only the URL from i.request.

What remains is to count. You can do this with uniq (which requires pre-sorted input):

cat web.log | python parse.py | sort | uniq -c
      1 GET http://example.com/some-page.html HTTP/1.1

posted almost 2 years ago

CC BY-SA 4.0

2y ago by Stephen Ostermiller‭

matthewsnyder‭

141 82 73

Copy Link

Raw

Markdown

History

2 comment threads

What version of Python is this for? (3 comments)

Syslog =? common log (1 comment)

Communities

Aggregate 404 errors from log file on the Linux command line Question

0 comment threads

1 answer

2 comment threads