5

I have a file containing population information for a bunch of towns. I have another file that is a list of the names of a subset of those towns. I want to select the population information from the first file using the second file. How would I do this?

Examples:

File 1: ma-towns.txt

Acton   Town    Middlesex   Open town meeting   21,924  1735  
Acushnet    Town    Bristol Open town meeting   10,303  1860  
Adams   Town    Berkshire   Representative town meeting 8,485   1778  
Agawam  City[4] Hampden Mayor-council   28,438  1855  
Alford  Town    Berkshire   Open town meeting   494 1773  
Amesbury    City    Essex   Mayor-council   16,283  1668  
Amherst Town    Hampshire   Representative town meeting 37,819  1775  

File 2: town-list.txt

Acton  
Adams  
Agawam 

Desired output would be

Acton   Town    Middlesex   Open town meeting   21,924  1735  
Adams   Town    Berkshire   Representative town meeting 8,485   1778  
Agawam  City[4] Hampden Mayor-council   28,438  1855   

Basically, as said generally, extract the line if it contains the string in one of the lines of file 2.

clk
  • 2,116
  • 1
  • 17
  • 25
abnry
  • 175
  • 1
  • 1
  • 8

2 Answers2

7
grep -f <(sed 's/.*/\^&\\>/' town-list.txt) ma-towns.txt

Explanation:

grep -f file reads file for a list of patterns to match against. We are searching in the ma-towns.txt list, using patterns from town-list.txt. Each separate line is treated as a new pattern, i.e. a new search term.

However, that's not quite enough, so I've included a sed to format the search terms like this:

^Acton\>
^Adams\>
^Agawam\>

The ^ makes grep only match that pattern at the start of a line, and the \> makes grep only match if the word ends at that point.

Together this ensures that the search term only looks at the beginning of the line (where the town names are), and that the search term must end where the town name ends.


The sed command itself runs a s (substitute) command, of the form s/search/replace/.

The search term .* matches a whole line. The replacement, \^&\\>, replaces it with a literal ^ character, followed by the original line, followed by the text \>.


What this answer does that the other does not:

  • Handles town names beginning with a dash or containing backslashes (which is unlikely, but if the input is taken from a user you don't want them to be able to break your scripts in unpredictable ways). Note that both answers treat town names as a regex rather than a literal search term.
  • Outputs the towns in the original order as specified in ma-towns.txt
  • Performs better
  • Searches the beginning of the line for the town name, not just anywhere in the line
  • Does not match a town if only a substring matches (e.g. Waterloo will not match Waterlooville)
Score_Under
  • 522
  • 4
  • 11
  • So I didn't know about "grep -f file1 file2". That's helpful. Can you explain your perl regex in the sed command? You are doing a substitution, twice, but it's a bit hazy for me. I didn't know you can do two substitutions in one sed command. I may accept your answer instead. – abnry Jul 26 '16 at 15:04
  • I've had a crack at explaining the `sed` invocation now – Score_Under Jul 26 '16 at 15:17
  • So your second regex in your sed command clips off the 's' in town names that end in 's'. I can't figure out why. – abnry Jul 26 '16 at 15:17
  • It's clipping off `\s`, which is the whitespace character class in regex. – Score_Under Jul 26 '16 at 15:18
  • My problem is with sed 's/\s*$/\\>/' <(echo 'Saugus'), which outputs Saugu/>. – abnry Jul 26 '16 at 15:20
  • The trailing spaces were because of formatting the question. I am not familiar with how to format blocks of text here. – abnry Jul 26 '16 at 15:20
  • Okay, so I think it's because of the regex variant differences on my OS X machine. I don't know what the right flag is to make it work. However, removing "\s*" from the regex makes everything work, as the original formatting of the file is clean. I'm accepting this answer now. – abnry Jul 26 '16 at 15:24
  • 1
    D'oh, I always forget about OS X's `sed`. It doesn't share the same backslash escapes as GNU `sed`, so I think removing it is the right thing to do there. Another option if it does become necessary in future is to use something like `s/ *$/\\>/` instead, matching against a space instead. – Score_Under Jul 26 '16 at 15:32
  • @nayrb just use `join`, it's the simplest solution. See the answers in the duplicate(s). For the example you show, all you need is `join town-list.txt ma-towns.txt `. If your files aren't sorted, you can do `join <(sort town-list.txt) <(sort ma-towns.txt )` – terdon Jul 26 '16 at 15:40
  • For some reason join doesn't seem to work, even with sorting. – abnry Jul 26 '16 at 15:54
  • 1
    `Waterloo` will not match `Waterlooville` - true but `Sunny` will match `Sunny Valley` and it shouldn't... I'd still use `join` here taking advantage of the fact that every name is followed by either `Town` or `City` so something like `sed -E 's/[[:blank:]]*(Town|City)(.*)/%\1\2/' ma-towns.txt | join -t% - towns-list.txt | tr % ' '` This should never fail (it does assume there's no `%` in the input but that can be replaced with another character that is guaranteed to never occur in a text file like `\x02`) – don_crissti Jul 26 '16 at 17:18
5

This will read the lines of file2 and parse file1 with grep using the lines :

while read line; do
  grep "${line}" file1
done < file2
magor
  • 3,592
  • 2
  • 11
  • 27
  • 4
    If `file2` has 10000 lines you're going to run `grep` 10000 times reading `file1` 10000 times - all this via the slow and error prone `while..read`... – don_crissti Jul 26 '16 at 14:45
  • agreed, this could be refined, it's just a quick answer which seems to be solving the problem in this case...10000 lines is not a lot though nowadays...tested on a 100k line logfile, grepping a string took me 0m0.009s – magor Jul 26 '16 at 14:53
  • Sorry, but the other answer is cleaner and better in my opinion (if I want to use loops I'll crack open python) so I'm accepting it instead. Thanks still! – abnry Jul 26 '16 at 15:25
  • No probs, yes the other solution is better and I also voted for it. Mine is just a quick solution which will work for the exact scenario you gave as an example, small files up to 1gb or so. If you don't have terrabytes/petabytes of data, performance will not be an issue on a modern computer or server and you get the data what you want. – magor Jul 26 '16 at 19:28