Here are two mutually exclusive sed loops:
sed -ne'p;/ 12 * 31 /!d;:n' -e'n;//!bn' <<""
YEAR MONTH DAY RES
1971 1 1 245
1971 1 2 587
...
1971 12 31 685
1971 1 1 245
1971 1 2 587
...
1971 12 31 685
1972 1 1 549
1972 1 2 746
...
1972 12 31 999
1972 1 1 933
1972 1 2 837
...
1972 12 31 343
YEAR MONTH DAY RES
1971 1 1 245
1971 1 2 587
...
1971 12 31 685
1972 1 1 549
1972 1 2 746
...
1972 12 31 999
Basically sed has two states - print and eat. In the first state - the print state - sed automatically prints every input line then checks it against the / 12 * 31 / pattern. If the current pattern space does ! not match it is deleted and sed pulls in the next input line and starts the script again from the top - at the print command without attempting to run anything that follows the delete command at all.
When an input line does match / 12 * 31 /, however, sed falls through to the second half of the script - the eat loop. First it defines a branch : label named n; then it overwrites the current pattern space with the next input line, and then it compares the current pattern space to the // last matched pattern. Because the line that matched it before has just been overwritten with the next one, the first iteration of this eat loop doesn't match, and every time it does ! not sed branches back to the :n label to get the next input line and once again compare it to the // last matched pattern.
When another match is finally made - some 365 next lines later - sed does -not automatically print it when it completes its script, pulls in the next input line, and starts again from the top at the print command in its first state. So each loop state will fall through to the next on the same key and do as little as possible in the meantime to find the next key.
Note that the entire script completes without invoking a single editing routine, and that it needs only to compile the single regexp. The automaton that results is very simple - it understands only [123 ] and [^123 ]. What's more, at least half of the comparisons will very likely be made without any compilations, because the only address referenced in the eat loop at all is the // empty one. sed can therefore complete that loop entirely with a single regexec() call per input line. sed may do similar for the print loop as well.
timed
I was curious about how the various answers here might perform, and so I came up with my own table:
dash <<""
d=0 D=31 IFS=: set 1970 1
while case "$*:${d#$D}" in (*[!:]) ;;
($(($1^($1%4)|(d=0))):1:)
D=29 set $1 2;;
(*:1:) D=28 set $1 2;;
(*[3580]:)
D=30 set $1 $(($2+1));;
(*:) D=31 set $(($1+!(t<730||(t=0)))) $(($2%12+1))
esac
do printf '%-6d%-4d%-4d%d\n' "$@" $((d+=1)) $((t+=1))
done| head -n1000054 >/tmp/dates
dash <<<'' 6.62s user 6.95s system 166% cpu 8.156 total
That puts a million+ lines in /tmp/dates and doubles the output for each of years 1970 - 3338. The file looks like:
tail -n1465 </tmp/dates | head; echo; tail </tmp/dates
3336 12 27 728
3336 12 28 729
3336 12 29 730
3336 12 30 731
3336 12 31 732
3337 1 1 1
3337 1 2 2
3337 1 3 3
3337 1 4 4
3337 1 5 5
3338 12 22 721
3338 12 23 722
3338 12 24 723
3338 12 25 724
3338 12 26 725
3338 12 27 726
3338 12 28 727
3338 12 29 728
3338 12 30 729
3338 12 31 730
...some of it anyway.
And then I tried the different commands on it:
for cmd in "sort -uVk1,3" \
"sed -ne'p;/ 12 * 31 /!d;:n' -e'n;//!bn'" \
"awk '"'{u=$1 $2 $3 $4;if (!a[u]++) print;}'\'
do eval "time ($cmd|wc -l)" </tmp/dates
done
500027
( sort -uVk1,3 | wc -l; ) \
1.85s user 0.11s system 280% cpu 0.698 total
500027
( sed -ne'p;/ 12 * 31 /!d;:n' -e'n;//!bn' | wc -l; ) \
0.64s user 0.09s system 110% cpu 0.659 total
500027
( awk '{u=$1 $2 $3 $4;if (!a[u]++) print;}' | wc -l; ) \
1.46s user 0.15s system 104% cpu 1.536 total
The sort and sed commands both completed in less than half the time awk did - and these results were typical. I did run them several times. It appears all of the commands are writing out the correct number of lines as well - and so they probably all work.
sort and sed were fairly well neck and neck - with sed generally a hair ahead - for completion time for every run, but sort does more actual work to achieve its results than either of the other two commands. It is running parallel jobs to complete its task and benefits a great deal from my multi-core cpu. awk and sed both peg the single-core assigned them for the entire time they process.
The results here are from a standard, up-to-date GNU sed, but I did try another. In fact, I tried all three commands with other binaries, but only the sed command actually worked with my heirloom tools. The others, as I guess due to non-standard syntax, simply quit with error before getting off the ground.
It is good to use standard syntax when possible - you can freely use more simple, honed, and efficient implementations in many cases that way:
PATH=/usr/heirloom/bin/posix2001:$PATH; time ...
500027
( sed -ne'p;/ 12 * 31 /!d;:n' -e'n;//!bn' | wc -l; ) \
0.31s user 0.12s system 136% cpu 0.318 total