0

I have some addresses.csv in different international formats

Example Street 1
Teststraße 2
Teststr. 1-5
Baker Street 221b
221B Baker Street
19th Ave 3B
3B 2nd Ave
1-3 2nd Mount x Ave
105 Lock St # 219
Test Street, 1
BookAve, 54, Extra Text 123#

For example we in Germany write Teststraße 2 and in the USA 2 Test Street

Is there a way to seperate/extract all street names and street numbers? output-names.csv

Example Street
Teststraße
Teststr.
Baker Street
Baker Street
19th Ave
2nd Ave
2nd Mount Good Ave
Lock St # 219
Test Street
BookAve

output-numbers.csv

1
2
1-5
221b
221B
3B
3B
1-3
105
1
54

output-extra_text.csv











Extra Text 123#

I am using macOS 13.. the shell is zsh 5.8.1 or bash-3.2


my thoughts that i had: you could sort the addresses first like this:

x=The-adress-line;
if [ x = "begins with a letter"];
    then 
    if [ x = "begins with a letter + number + SPACE"];
        then
        echo 'something like "1A Street"';
        # NUMBER = '1A' / NAME = 'Street'
    else
        echo 'It begins with the STREET-NAME';
    fi;
elif [ x = "begins with a number"];
    then
    echo 'maybe STREET-NAME like "19th Ave 19B" or STREET-NUMBER like "19B Street"';
    # NUMBER = '19B' / NAME = '19th Ave' or 'Street'
    if [ x = "begins with a number + SPACE"];
        then
        echo 'It begins with the STREET-NUMBER like "1 Street"';
        # NUMBER = '1' / NAME = 'Street'
    elif [ x = "is (number)(text)(space)(text)(number(maybe-text))"];
        then
            echo 'For example 19th Street 19B -> The last number+text is the number (19B)'
            # NUMBER = '19B' / NAME = '19th Street'
    elif [ x = "is (number(maybe-text))(space)(number)(text)(space)(text)"];
        then
        echo 'For example 19B 19th Street -> The first number+text is the number (19B)'
            # NUMBER = '19B' / NAME = '19th Street'
    else
        echo 'INVALID';
else
    echo 'INVALID';
fi;
R 9000
  • 167
  • 6
  • What about "42nd street"? I mean, pretty much anything, including numbers, can be street names. – terdon Mar 02 '23 at 16:51
  • Exactly.. "42nd street 3" (DE) or "3 42nd street" (US) means -> number="3" and name="42nd street" – R 9000 Mar 02 '23 at 17:01
  • 3
    Which is why I don't think it is possible to automate this short of using an actual AI trained on real street names :/ – terdon Mar 02 '23 at 17:08
  • I think it is possible.. for your example see "my thoughts" what I just added – R 9000 Mar 02 '23 at 17:22
  • What if the address is "Flat B, 72 street"? Or "The Brown Cottage, Hanwell"? Or "Number 12, Foo street"? – terdon Mar 02 '23 at 18:20
  • 1) "Flat B, 72 street" -> "TEXT A, 00 TEXT".. so the only number is 72.. the question is.. is "Flat B" or "Street" the street name.. good question... 2) "The Brown Cottage, Hanwell" -> No number -> invalid 3) "Number 12, Foo street" -> "TEXT 00, TEXT TEXT".. so the number is 12.. but like in (1).. what is the street and what is "extra text"... good question.. maybe someone knows a solution – R 9000 Mar 02 '23 at 18:43
  • So the problem is.. is it "Street 72 EXTRA-TEXT" or "EXTRA-TEXT 72 Street".. it is ok for me if it begins with "EXTRA-TEXT.. then it is -> invalid – R 9000 Mar 02 '23 at 18:55
  • 2
    That pseudo-code is shell-like. You would not do something like this in shell as it'd be hard to get the syntax right and take forever to run. See [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice). You should use awk or some other general-purpose text-processing tool instead. – Ed Morton Mar 02 '23 at 19:33
  • How would you handle some rural addresses like _4593 NC 39_? – doneal24 Mar 03 '23 at 00:07
  • @doneal24 adress is invalid :p – R 9000 Mar 03 '23 at 01:45
  • How about something like "1e Korte Dwarsstraat 525 BG", pretty common with your neighbors in NL or Belgium? A tool that doesn't accept valid addresses, as you've shown above, is hardly an "ultimate tool". I agree with terdon, not doable w/o a reasonably well-working AI. – Peregrino69 Mar 03 '23 at 06:37
  • @R9000 My mother would be very distressed to hear that. :) – doneal24 Mar 03 '23 at 19:59

1 Answers1

2

IMHO all you can do is a best-effort employing a series of regexps to match the addresses you know about, e.g. using GNU awk for the 3rd arg to match() and \s shorthand for [[:space:]] and 3 of the possible regexps defined:

$ cat tst.awk
BEGIN { OFS="\",\"" }
{
    name = number = type = ""
    gsub(/"/,"\"\"")
}
match($0,/^([^0-9]+)([0-9]+(-[0-9]+)?[[:alpha:]]?)$/,a) {
    # Example Street 1
    # Teststraße 2
    # Teststr. 1-5
    # Baker Street 221b
    # Test Street, 1
    type   = 1
    name   = a[1]
    number = a[2]
}
!type && match($0,/^([0-9]+[[:alpha:]])\s+([^0-9]+)$/,a) {
    # 221B Baker Street
    type   = 2
    name   = a[2]
    number = a[1]
}
!type && match($0,/^([0-9]+[[:alpha:]]{2}.*)\s+([0-9]+[[:alpha:]]?)$/,a) {
    # 19th Ave 3B
    type   = 3
    name   = a[1]
    number = a[2]
}
{
    gsub(/^\s+|\s+$/,"",name)
    gsub(/^\s+|\s+$/,"",number)
    if ( !doneHdr++ ) {
        print "\"" "type", "name", "number", "$0" "\""
    }
    print "\"" type, name, number, $0 "\""
}

$ awk -f tst.awk file
"type","name","number","$0"
"1","Example Street","1","Example Street 1"
"1","Teststraße","2","Teststraße 2"
"1","Teststr.","1-5","Teststr. 1-5"
"1","Baker Street","221b","Baker Street 221b"
"2","Baker Street","221B","221B Baker Street"
"3","19th Ave","3B","19th Ave 3B"
"","","","3B 2nd Ave"
"","","","1-3 2nd Mount x Ave"
"","","","105 Lock St # 219"
"1","Test Street,","1","Test Street, 1"
"","","","BookAve, 54, Extra Text 123#"

You'd add the other regexps to match the formats of address you know about in the appropriate order such that if an address might match 2 or more regexps you have the more restrictive regexp(s) first. You may actually want to modify the above to print a warning if an address matches 2 or more regexps as you may then want to tweak or re-order or consolidate them.

If you reach the print line with type still empty, that's the "invalid" case and then you could write/add a new regexp to match those if appropriate.

I do expect you'll come across cases where you simply can't write code to distinguish one address format from another but hopefully this best-effort approach will be adequate for your needs. If you have city/state/county you could always curl an address using google maps to see if it's real or not as a last-ditch effort for addresses you can't identify (but that'd take forever if you tried to do ONLY that for all your addresses).

Produce output however you like wherever you like once the address recognition algorithm is working, I'm just dumping CSV above for ease of developing/testing.

Ed Morton
  • 28,789
  • 5
  • 20
  • 47
  • 1
    You're welcome. Given the above it's just a case of iterating writing additional regexps and/or refining existing regexps as you come across additional address formats. I do expect you'll come across cases where you simply can't write code to distinguish one address format from another but hopefully this best-effort approach will be adequate for your needs. If you have city/state/county you could always curl an address using google maps to see if it's real or not as a last-ditch effort for addresses you can't identify (but thatd take forever if you tried to do ONLY that for all your addresses) – Ed Morton Mar 02 '23 at 19:54
  • I think you’ll eventually hit an address that matches the regex but in interpreted wrong. The address _4593 NC 39_ has the house number first but your 3rd regex would make that the street number. – doneal24 Mar 03 '23 at 00:13
  • @doneal24 right, I said as much in [my comment directly above yours](https://unix.stackexchange.com/questions/738429/the-ultimate-tool-to-split-extract-street-name-from-street-number/738460?noredirect=1#comment1401851_738460) - `I do expect you'll come across cases where you simply can't write code to distinguish one address format from another`. I copied that statement into my answer now so it's harder to miss. – Ed Morton Mar 03 '23 at 00:15
  • `syntax error at source line 6 source file tst.awk context is >>> match($0,/^([^0-9]+)([0-9]+(-[0-9]+)?)[[:alpha:]]?$/, <<< awk: bailing out at source line 6 source file tst.awk` So I can't know if this is working – R 9000 Mar 03 '23 at 00:41
  • As I said in my answer, it requires GNU awk for the 3rd arg to match() and \s shorthand for [[:space:]]. You aren't using GNU awk, you should try really, really hard to get it as it has a ton of useful extensions. If you can't get it for some reason it's possible to do something similar with a POSIX awk but it requires a bit more code and a bit more complicated code (which is why GNU awk added that parameter to match()). – Ed Morton Mar 03 '23 at 00:59
  • @EdMorton Thank you for telling me. I will give GNU awk a try.. And yes, with GNU awk your script is working. But here https://unix.stackexchange.com/questions/738479/split-extract-street-name-from-street-number your script don't really shows the results that I need.. so I prefer my way. – R 9000 Mar 03 '23 at 01:33
  • Well, good luck with that. Were you expecting the 3 regexps I provided to show you how to solve your problem to be able to parse all possible address formats for you? I specifically told you `You'd add the other regexps to match the formats of address you know about`. – Ed Morton Mar 03 '23 at 01:46
  • I did vote it down. That was in the situation I thought your script is not working (didn't used GNU awk at this moment).. And I saw that you downvote my (for me) working script. Now I see that your script is working. And I see that I can/have to implement my part (of my script) in the 'match' from your script. I am sorry and I really want to change the vote, but for that you have to edit your answer(?) I am very grateful to you for your hard work! – R 9000 Mar 03 '23 at 01:59
  • @EdMorton I think our comments came within seconds of each other so it was easy to miss yours. – doneal24 Mar 03 '23 at 20:01