Communities

Writing
Writing
Codidact Meta
Codidact Meta
The Great Outdoors
The Great Outdoors
Photography & Video
Photography & Video
Scientific Speculation
Scientific Speculation
Cooking
Cooking
Electrical Engineering
Electrical Engineering
Judaism
Judaism
Languages & Linguistics
Languages & Linguistics
Software Development
Software Development
Mathematics
Mathematics
Christianity
Christianity
Code Golf
Code Golf
Music
Music
Physics
Physics
Linux Systems
Linux Systems
Power Users
Power Users
Tabletop RPGs
Tabletop RPGs
Community Proposals
Community Proposals
tag:snake search within a tag
answers:0 unanswered questions
user:xxxx search by author id
score:0.5 posts with 0.5+ score
"snake oil" exact phrase
votes:4 posts with 4+ votes
created:<1w created < 1 week ago
post_type:xxxx type of post
Search help
Notifications
Mark all as read See all your notifications »
Incubator Q&A

Welcome to the staging ground for new communities! Each proposal has a description in the "Descriptions" category and a body of questions and answers in "Incubator Q&A". You can ask questions (and get answers, we hope!) right away, and start new proposals.

Are you here to participate in a specific proposal? Click on the proposal tag (with the dark outline) to see only posts about that proposal and not all of the others that are in progress. Tags are at the bottom of each post.

Post History

50%
+0 −0
Incubator Q&A Aggregate 404 errors from log file on the Linux command line

I'm sure there's a program somewhere that parses common log entries, but I don't know what it is. However, the task as stated is pretty simple so I'd try hack something together myself. You need to...

posted 1y ago by matthewsnyder‭  ·  edited 1y ago by Stephen Ostermiller‭

Answer
#2: Post edited by user avatar Stephen Ostermiller‭ · 2023-06-16T20:10:22Z (over 1 year ago)
This has nothing to do with Syslog. This is common log format for access logs https://en.wikipedia.org/wiki/Common_Log_Format
  • I'm sure there's a program somewhere that parses syslog entries, but I don't know what it is (well, there's journald but I find that very arcane). However, the task as stated is pretty simple so I'd try hack something together myself. You need to:
  • 1. Pull out the error code and referrer with a regex
  • 2. Filter for 404 codes
  • 3. Filter out empty referrers
  • 4. Print requests
  • 5. Count them
  • 1 can be done with a regex and `sed`, but you need a lot of `[]{}()+` in the pattern, and sed makes these annoying to type (you have to escape them all). So instead I would read into a Python script.
  • If you use sed for 1, you would use grep for 2 and 3, also with regex.
  • If you want to stick with the shell, you can use Python to only dump a JSON, CSV, or whatever else you want. You can filter JSON with `jq` and CSV with `csvkit` or `csvq`.
  • Since I used Python, it's easier to use Python's syntax for filtering as well, which is what I did:
  • ```python
  • import re
  • import sys
  • from collections import Counter
  • from typing import NamedTuple
  • RE_SYSLOG = re.compile(r'(?:\S+ ){3}\[([^[\]]+)\] "([^"]+)" (\d+) \S+ "([^"]+)"')
  • class LogEntry(NamedTuple):
  • timestamp: str
  • request: str
  • error_code: str
  • referer: str
  • def main():
  • raw = sys.stdin.readlines()
  • parsed = [parse_syslog_message(s) for s in raw]
  • filtered = [i for i in parsed if i.error_code == "404" and len(i.referer) > 3]
  • # Print requests
  • for i in filtered:
  • print(i.request)
  • def parse_syslog_message(msg: str) -> LogEntry:
  • m = RE_SYSLOG.search(msg)
  • return LogEntry(*m.groups())
  • if __name__ == "__main__":
  • main()
  • ```
  • You then do `cat web.log | python parse.py` and you'll get a list of the requests you want (4). I assumed you do care about request type (because it's easier that way) but I'm sure you can see how to get only the URL from `i.request`.
  • What remains is to count. You can do this with `uniq` (which requires pre-sorted input):
  • ```
  • cat web.log | python parse.py | sort | uniq -c
  • 1 GET http://example.com/some-page.html HTTP/1.1
  • ```
  • I'm sure there's a program somewhere that parses common log entries, but I don't know what it is. However, the task as stated is pretty simple so I'd try hack something together myself. You need to:
  • 1. Pull out the error code and referrer with a regex
  • 2. Filter for 404 codes
  • 3. Filter out empty referrers
  • 4. Print requests
  • 5. Count them
  • 1 can be done with a regex and `sed`, but you need a lot of `[]{}()+` in the pattern, and sed makes these annoying to type (you have to escape them all). So instead I would read into a Python script.
  • If you use sed for 1, you would use grep for 2 and 3, also with regex.
  • If you want to stick with the shell, you can use Python to only dump a JSON, CSV, or whatever else you want. You can filter JSON with `jq` and CSV with `csvkit` or `csvq`.
  • Since I used Python, it's easier to use Python's syntax for filtering as well, which is what I did:
  • ```python
  • import re
  • import sys
  • from collections import Counter
  • from typing import NamedTuple
  • RE_SYSLOG = re.compile(r'(?:\S+ ){3}\[([^[\]]+)\] "([^"]+)" (\d+) \S+ "([^"]+)"')
  • class LogEntry(NamedTuple):
  • timestamp: str
  • request: str
  • error_code: str
  • referer: str
  • def main():
  • raw = sys.stdin.readlines()
  • parsed = [parse_syslog_message(s) for s in raw]
  • filtered = [i for i in parsed if i.error_code == "404" and len(i.referer) > 3]
  • # Print requests
  • for i in filtered:
  • print(i.request)
  • def parse_syslog_message(msg: str) -> LogEntry:
  • m = RE_SYSLOG.search(msg)
  • return LogEntry(*m.groups())
  • if __name__ == "__main__":
  • main()
  • ```
  • You then do `cat web.log | python parse.py` and you'll get a list of the requests you want (4). I assumed you do care about request type (because it's easier that way) but I'm sure you can see how to get only the URL from `i.request`.
  • What remains is to count. You can do this with `uniq` (which requires pre-sorted input):
  • ```
  • cat web.log | python parse.py | sort | uniq -c
  • 1 GET http://example.com/some-page.html HTTP/1.1
  • ```
#1: Initial revision by user avatar matthewsnyder‭ · 2023-06-16T19:31:34Z (over 1 year ago)
I'm sure there's a program somewhere that parses syslog entries, but I don't know what it is (well, there's journald but I find that very arcane). However, the task as stated is pretty simple so I'd try hack something together myself. You need to:

1. Pull out the error code and referrer with a regex
2. Filter for 404 codes
3. Filter out empty referrers
4. Print requests
5. Count them

1 can be done with a regex and `sed`, but you need a lot of `[]{}()+` in the pattern, and sed makes these annoying to type (you have to escape them all). So instead I would read into a Python script.

If you use sed for 1, you would use grep for 2 and 3, also with regex.

If you want to stick with the shell, you can use Python to only dump a JSON, CSV, or whatever else you want. You can filter JSON with `jq` and CSV with `csvkit` or `csvq`.

Since I used Python, it's easier to use Python's syntax for filtering as well, which is what I did:
```python
import re
import sys
from collections import Counter
from typing import NamedTuple

RE_SYSLOG = re.compile(r'(?:\S+ ){3}\[([^[\]]+)\] "([^"]+)" (\d+) \S+ "([^"]+)"')


class LogEntry(NamedTuple):
    timestamp: str
    request: str
    error_code: str
    referer: str


def main():
    raw = sys.stdin.readlines()
   
    parsed = [parse_syslog_message(s) for s in raw]
    filtered = [i for i in parsed if i.error_code == "404" and len(i.referer) > 3]

    # Print requests
    for i in filtered:
        print(i.request)


def parse_syslog_message(msg: str) -> LogEntry:
    m = RE_SYSLOG.search(msg)
    return LogEntry(*m.groups())


if __name__ == "__main__":
    main()
```

You then do `cat web.log | python parse.py` and you'll get a list of the requests you want (4). I assumed you do care about request type (because it's easier that way) but I'm sure you can see how to get only the URL from `i.request`.

What remains is to count. You can do this with `uniq` (which requires pre-sorted input): 
```
cat web.log | python parse.py | sort | uniq -c
      1 GET http://example.com/some-page.html HTTP/1.1
```