3

I need to select specific data's from log files. I need two scripts:

  1. I need to select all IP addresses that only visited /page1
  2. I need to select all IP addresses that visited /page1 but never visited /page2

I have my desired logs in a .tar file. I want them extracted into a folder, and then I will use the script to parse them and delete them. ALL duplicated IP addresses.

This is what I have so far:

# filter /page1 visitors
cat access.log | grep "/page1" > /tmp/res.txt
# take the IP portion of record
cat res.txt | grep '^[[:alnum:]]*\.[[:alnum:]]*\.[[:alnum:]]*\.[[:alnum:]]*' -o > result.txt

Typical access log looks like

162.158.86.83 - - [22/May/2016:06:31:18 -0400] "GET /page1?vtid=nb3 HTTP/1.1" 301 128 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"
Rui F Ribeiro
  • 55,929
  • 26
  • 146
  • 227
Delirium
  • 368
  • 1
  • 5
  • 22
  • Can you explain what you have tried, please? – Kevdog777 May 26 '16 at 13:10
  • I did this filter /page visitors `cat access.log | grep /page1 > /tmp/res.txt` And `# take the IP portion of record cat res.txt | grep '^[[:alnum:]]*\.[[:alnum:]]*\.[[:alnum:]]*\.[[:alnum:]]*' -o > result.txt` – Delirium May 26 '16 at 13:11
  • Please [edit] your question and show us an example of your input and your desired output. Also, include what you tried in the question itself, not in the comments. Comments are hard to read and easy to miss and can be deleted without warning. – terdon May 26 '16 at 13:20
  • Thanks for the edit but *we need to see an example of your input and your desired output*. We can't help you parse something unless you tell us what we're parsing. – terdon May 26 '16 at 13:37
  • I tried to edit but since your question is not clear, I'm not sure I did a good job. What does "ALL duplicated IP addresses."? Do you want to print all duplicates? Do you want to remove all duplicates? – terdon May 26 '16 at 13:43
  • I edited it once again and I hope in right way, sorry for mistakes, I feel pretty stupid right now. – Delirium May 26 '16 at 13:49

2 Answers2

2
awk '/^\/page1?/ {print $1}' /path/to/access.log | sort -u > result.txt

If you want a count of each unique IP, change sort -u to sort | uniq -c

If you want to match only the request-path field of the log (rather than the entire line) against /page1:

awk '$7 ~ /^\/page1?/ {print $1}' /path/to/access.log | sort -u > result.txt

Note: I think nginx access logs are the same as apache access logs. If not, count the fields (count every space, including the one between the the Date:Time and the TimeZone) in the nginx log, and use the correct field number instead of $7

Finally, if you want to print both the IP address (or hostname if they've already been resolved) and the request path:

awk -v OFS='\t' '$7 ~ /^\/page1?/ {print $1, $7}' /path/to/access.log |
    sort -u > result.txt

To see IP addresses that have visited /page1 but have never visited /page2:

awk '$7 ~ /^\/page1?/ {print $1}' /path/to/access.log | sort -u > result1.txt
awk '$7 ~ /^\/page2?/ {print $1}' /path/to/access.log | sort -u > result2.txt
comm -2 -3 result1.txt result2.txt

comm's -2 option suppresses lines that appear only in result2.txt, and -3 suppresses lines that appear in both files. output is thus lines that appear only in results1.txt.

see man comm for more details.

cas
  • 1
  • 7
  • 119
  • 185
  • And how I will know this IPs was not on `/page2` and how I will delete duplicates? Is that possible in same time when is proceded awk command? – Delirium May 26 '16 at 13:30
  • 1. the `awk` script ignores everything but matching lines. `/page2` doesn't match `/page1` so it will be ignored. 2. i didn't notice the no-dupes part of your question, i'll update my answer. – cas May 26 '16 at 13:37
  • the script WILL, however, match `/page10`, `/page11`, `/page100` and so on (exactly as your `grep` command does). I'd have to know more about your URL paths before I could come up with a reliable regexp to exclude them too. i.e. what normally comes immediately after `/page1` in your logs. is `/page` the beginning of the request path? please add some example log lines to your question. – cas May 26 '16 at 13:43
  • Access log for /page1 looks like `162.158.86.83 - - [22/May/2016:06:31:18 -0400] "GET /page1?vtid=nb3 HTTP/1.1" 301 128 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"` I don't know if "awk" command will work even there are some arguments in that URL. – Delirium May 26 '16 at 13:45
  • awk has no problem with URL args. btw, don't post long examples or code in comments. edit your question and add them there. – cas May 26 '16 at 13:53
  • sorry for maybe being stupid @cas but solve that problem with this condition? `IP addresses that visited /page1 but never visited /page2` Sorry cas, I already edited my question before few mins. – Delirium May 26 '16 at 13:54
  • No, that's much more complicated to do in one pass. it's midnight here and i don't have time to write anything but a simple script. I'll add a simple method to my answer. – cas May 26 '16 at 13:57
0
  • Create a sorted list of IPs that visited Page1
  • Create a sorted list of IPs that visited Page2
  • Use "diff" on the two lists to find those that visited one page without visiting the other (the '>' or '<' symbol at the beginning on the list distinguishes those from Page1 from those of Page2)
xenoid
  • 8,648
  • 1
  • 24
  • 47