3

I want to sort directories in an s3 storage by the date embedded in their name.

When I run

s3cmd ls s3://xyz/private/backups/mails/daily/ | awk '{print $2}'

it lists the directories like

s3://xyz/private/backups/mails/daily/01_Apr_2020/
s3://xyz/private/backups/mails/daily/02_Apr_2020/
s3://xyz/private/backups/mails/daily/03_Apr_2020/
s3://xyz/private/backups/mails/daily/04_Apr_2020/
s3://xyz/private/backups/mails/daily/05_Apr_2020/
s3://xyz/private/backups/mails/daily/06_Apr_2020/
s3://xyz/private/backups/mails/daily/07_Apr_2020/
s3://xyz/private/backups/mails/daily/08_Apr_2020/
s3://xyz/private/backups/mails/daily/09_Apr_2020/
s3://xyz/private/backups/mails/daily/10_Apr_2020/
s3://xyz/private/backups/mails/daily/11_Apr_2020/
s3://xyz/private/backups/mails/daily/12_Apr_2020/
s3://xyz/private/backups/mails/daily/13_Apr_2020/
s3://xyz/private/backups/mails/daily/14_Apr_2020/
s3://xyz/private/backups/mails/daily/15_Apr_2020/
s3://xyz/private/backups/mails/daily/30_Mar_2020/
s3://xyz/private/backups/mails/daily/31_Mar_2020/

I want these to display in date wise order so that it looks something like this

s3://xyz/private/backups/mails/daily/30_Mar_2020/
s3://xyz/private/backups/mails/daily/31_Mar_2020/
s3://xyz/private/backups/mails/daily/01_Apr_2020/
s3://xyz/private/backups/mails/daily/02_Apr_2020/
s3://xyz/private/backups/mails/daily/03_Apr_2020/
....
....

I tried sorting with column and -M (for month) flag, but it isn't working.

My goal is deleting directories older than n days, but since s3cmd ls doesn't return the creation/modified date of directories, I have to do it the hard way.

How may I make this work?

K7AAY
  • 3,696
  • 4
  • 22
  • 39
Pankaj Jha
  • 133
  • 6

2 Answers2

4
... |
awk -F'[/_]' '{printf "%04d%02d%02d %s\n", $(NF-1), index("  JanFebMarAprMayJunJulAugSepOctNovDec",$(NF-2))/3, $(NF-3), $0}' |
sort |
sed 's/[0-9]* //'

Notice that there are 2 spaces in " Jan..."; that's not a bug: indexes in awk start from 1, not from 0 as in other languages.

  • This is simply amazing, any idea how can I exclude bottom `-n` directories where n is the no. count I want to retain. for ex: I want to keep last 7 days, so it will just remove bottom 7 and then give me the result – Pankaj Jha Apr 10 '20 at 17:05
  • 2
    Use `.. | tail -n7` too keep just the last 7, or `.. | head -n -7` to keep all but the last 7. –  Apr 10 '20 at 17:08
  • That was very helpful. I am marking this as the accepted answer. just one request can you please explain your answer a little bit so that novices like us looking for the same issue can understand it more? – Pankaj Jha Apr 10 '20 at 17:35
  • 1
    @PankajJha where should I start from? What is giving you trouble? The awk thing just prepends a `YYYYMMDD` (numeric year, month and day) to each line (which sorts the same numerically and lexicographically). –  Apr 10 '20 at 17:47
  • Got it now, Thanks – Pankaj Jha Apr 10 '20 at 18:28
  • `+1` but `index(" JanFebMarAprMayJunJulAugSepOctNovDec",$(NF-2))/3` is more commonly written `(index("JanFebMarAprMayJunJulAugSepOctNovDec",$(NF-2))+2)/3` and `cut` is more commonly used than `sed` to remove the appended string after sorting and `\t` is then more commonly used than `" "` as the separator added by `awk` since that's `cut`s default separator. So it'd be: `awk -F'[/_]' '{printf "%04d%02d%02d\t%s\n", $(NF-1), (index("JanFebMarAprMayJunJulAugSepOctNovDec",$(NF-2))+2)/3, $(NF-3), $0}' | sort | cut -f2-`. – Ed Morton Apr 11 '20 at 12:59
2

A GNU awk alternative with use of date

awk -F'[/_]' '{
    D=$(NF-3)"-"$(NF-2)"-"$(NF-1);
    "date +%Y-%m-%d -d "D|getline nd;
    print nd, $0
}' file1 | sort | cut -d" " -f 2

Walkthrough

Split out $0 your fields on / or _

awk -F'[/_]' '{

Recompose them as a valid date

    D=$(NF-3)"-"$(NF-2)"-"$(NF-1);

Use the shell date function to convert the month from text to a number and grab it back by piping through awk's getline into a new variable

    "date +%Y-%m-%d -d "D|getline nd;

Nothing new here

    print nd, $0
}' file1 | sort | cut -d" " -f 2

Output

s3://xyz/private/backups/mails/daily/30_Mar_2020/
s3://xyz/private/backups/mails/daily/31_Mar_2020/
s3://xyz/private/backups/mails/daily/01_Apr_2020/
s3://xyz/private/backups/mails/daily/02_Apr_2020/
s3://xyz/private/backups/mails/daily/03_Apr_2020/

Alternative slightly leaner using gensub

awk -F'/' '{
    "date +%Y%m%d -d "gensub("_","-","g",$(NF-1))|getline nd;
    print nd, $0
}' file1 | sort | cut -d" " -f2
bu5hman
  • 4,663
  • 2
  • 14
  • 29
  • That would be orders of magnitude slower than just manipulating the text as @mosvy is doing and if you're using GNU tools then GNU awk has built in time functions so you don't need to spawn a shell and call `date` for every input line. Also your `cut` should be `-f2-`, not just `-f2`, in case any of the input file names contain blanks. It also introduces the possibility of a getline failure going undetected and silently messing up the output by duplicating the value from the previous successful call, see http://awk.freeshell.org/AllAboutGetline for how to call getline to guard against that. – Ed Morton Apr 11 '20 at 13:02
  • I couldn't spot the `awk` time function that would do the conversion but am no guru there. Assistance? – bu5hman Apr 11 '20 at 14:48
  • It's looks kinda silly since you don't need to call a date function to do this at all but `awk 'BEGIN{split("01_Apr_2020",t,/_/); print strftime("%Y%m%d",mktime(t[3]" "(index("JanFebMarAprMayJunJulAugSepOctNovDec",t[2])+2)/3" "t[1]" 0 0 0"))}'` would output `20200401`. There's also a `strptime()` library recently become available for gawk (see https://groups.google.com/forum/#!msg/comp.lang.awk/Ft6_h7NEIaE/tmyxd94hEAAJ) but you've got do download and compile stuff to use it so I doubt it'll ever actually be used. – Ed Morton Apr 11 '20 at 16:27
  • TBH i was just looking for any way not to type in "janfeb..." and then `tolower()` if you want to avoid case issues etc. on one liners. Seems there is nothing to be had off the shelf ... not like it's hard to roll up a function to throw in the library for own use. – bu5hman Apr 11 '20 at 19:16
  • If you're going to use `date | getline` though then you really need to write `D=$(NF-3)"-"$(NF-2)"-"$(NF-1); cmd="date +%Y-%m-%d -d "D; nd="N/A"; if ( (cmd |getline line) > 0 ) nd=line; close(cmd)` so it's not briefer than writing `nd=sprintf(%04d%02d%02d",$(NF-1), (index("JanFebMarAprMayJunJulAugSepOctNovDev",$(NF-2))+2)/3, $(NF-3))` and it's vastly slower than it and it requires GNU date. – Ed Morton Apr 11 '20 at 20:12
  • That did get me wondering though if there's a way to generate a list of month numbers so you don't have to type them and best I could come up with was `locale mon` but then you still need to trim each month name to 3 chars so at the end of the day it's easier just to type them. – Ed Morton Apr 11 '20 at 20:59
  • 1
    I wasted my time the same way ;).....Anyway I did a time comparison over 1,000 iterations and the `date` call is about 1 order of (10x) magnitude slower than @mosvy, So then I just rolled my own function and threw it in a library script. No appreciable difference between doing this and in line code. At least I wont be needing to type this again for my own use, – bu5hman Apr 12 '20 at 16:36