How do I find the overlap of two strings in bash?

Question

I have two strings. For the sake of the example they are set like this:

string1="test toast"
string2="test test"

What I want is to find the overlap starting at the beginning of the strings. With overlap I mean the string "test t" in my above example.

# I look for the command 
command "$string1" "$string2"
# that outputs:
"test t"

If the strings were string1="atest toast"; string2="test test" they would have no overlap since the check starts form the beginning and the "a" at the start of string1 .

Please [do not cross post](http://unix.stackexchange.com/faq/#cross-posting)! [How do I find the overlap of two strings in bash?](http://stackoverflow.com/questions/6973088/6973268#6973268) — Caleb, Aug 07 '11 at 17:16
This is exactly the reason people aren't supposed to cross-post; now it has multiple answers on each site that are different, and it's on-topic for both sites. I think I'm just going to leave it here anyway — Michael Mrozek, Aug 08 '11 at 14:15

score 12 · Accepted Answer · answered Aug 07 '11 at 14:13

12

You can think of a function like this, with some error check to add

common_prefix() {
  local n=0
  while [[ "${1:n:1}" == "${2:n:1}" ]]; do
    ((n++))
  done
  echo "${1:0:n}"
}

answered Aug 07 '11 at 14:13

enzotib

50,671
14
120
105

I just noticed that when run with two empty/null args it enters an ∞ loop. `[[ -z "$1$2" ]] && return` fixes it. – Peter.O Feb 14 '12 at 14:28
This method is exponentially slower (rather than linearly). As the string doubles in length, the time increases by a factor of 4 (approx). Here are some string-length/time comparisons to Gilles' *binary-split*: .. `64` *0m0.005s* vs *0m0.003s* - `128` *0m0.013s* vs *0m0.003s* - `256` *0m0.041s* vs *0m0.003s* - `512` *0m0.143s* vs *0m0.005s* - `1024` *0m0.421s* vs *0m0.009s* - `2048` *0m1.575s* vs *0m0.012s* - `4096` *0m5.967s* vs *0m0.022s* - `8192` *0m24.693s* vs *0m0.049s* - `16384` *1m34.004s* vs *0m0.085s* - `32768` *6m34.721s* vs *0m0.168s* - `65536` *27m34.012s* vs *0m0.370s* – Peter.O Aug 29 '12 at 06:15
2

@Peter.O Quadratically, not exponentially. – Gilles 'SO- stop being evil' Aug 29 '12 at 11:27
I guess bash stores strings internally with implicit length, so getting the `n`th character requires scanning `n` characters to check that they're not the string-terminating zero-byte. This is consistent with bash being unable to store a zero-byte in a variable. – Peter Cordes Sep 05 '15 at 05:14

Gilles 'SO- stop being evil' · Answer 2 · 2012-08-29T12:36:55.660

This can be done entirely inside bash. Although doing string manipulation in a loop in bash is slow, there is a simple algorithm that is logarithmic in the number of shell operations, so pure bash is a viable option even for long strings.

longest_common_prefix () {
  local prefix= n
  ## Truncate the two strings to the minimum of their lengths
  if [[ ${#1} -gt ${#2} ]]; then
    set -- "${1:0:${#2}}" "$2"
  else
    set -- "$1" "${2:0:${#1}}"
  fi
  ## Binary search for the first differing character, accumulating the common prefix
  while [[ ${#1} -gt 1 ]]; do
    n=$(((${#1}+1)/2))
    if [[ ${1:0:$n} == ${2:0:$n} ]]; then
      prefix=$prefix${1:0:$n}
      set -- "${1:$n}" "${2:$n}"
    else
      set -- "${1:0:$n}" "${2:0:$n}"
    fi
  done
  ## Add the one remaining character, if common
  if [[ $1 = $2 ]]; then prefix=$prefix$1; fi
  printf %s "$prefix"
}

The standard toolbox includes cmp to compare binary files. By default, it indicates the byte offset of the first differing bytes. There is a special case when one string is a prefix of the other: cmp produces a different message on STDERR; an easy way to deal with this is to take whichever string is the shortest.

longest_common_prefix () {
  local LC_ALL=C offset prefix
  offset=$(export LC_ALL; cmp <(printf %s "$1") <(printf %s "$2") 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Note that cmp operates on bytes, but bash's string manipulation operates on characters. This makes a difference in multibyte locales, for examples locales using the UTF-8 character set. The function above prints the longest prefix of a byte string. To handle character strings with this method, we can first convert the strings to a fixed-width encoding. Assuming the locale's character set is a subset of Unicode, UTF-32 fits the bill.

longest_common_prefix () {
  local offset prefix LC_CTYPE="${LC_ALL:=$LC_CTYPE}"
  offset=$(unset LC_ALL; LC_MESSAGES=C cmp <(printf %s "$1" | iconv -t UTF-32) \
                                           <(printf %s "$2" | iconv -t UTF-32) 2>/dev/null)
  if [[ -n $offset ]]; then
    offset=${offset%,*}; offset=${offset##* }
    prefix=${1:0:$((offset/4-1))}
  else
    if [[ ${#1} -lt ${#2} ]]; then
      prefix=$1
    else
      prefix=$2
    fi
  fi
  printf %s "$prefix"
}

Revisiting this question (1 year on), I've have re-assessed the **best** answer. It's all quite simple: rock breaks scissors, scissors cut paper, paper wraps rock. and binary eats sequential!.. even for quite short strings.. and as for a moderate 10000 char string being processed sequentially via `while char-by-char`, I'm still waiting for it as I write this.. time passes.. still waiting (maybe there's something wrong with my system).. time passes.. there must be something wrong; it is only 10,000 itterations! Ah! patience is a virtue (perhaps a curse in this case).. 13m53.755s.. vs, 0m0.322s — Peter.O, Aug 29 '12 at 02:49
The 3 methods given here are the outright fastest of all the presented answers.. Basically, `cmp` is the fastest (but is not char based). The next is `iconv` and then the *very* respectibly fast `binary-split` answer. Thanks Gilles. It took me a year to get to this point, but better late than never. (PS. 2 typo mods in `iconv` code: `$` in `=$LC_CTYPE}` and `\ ` in `UTF-32) \ `) ... PPS. actually the string I mentioned above was longer than 10,000 characters. It was the result of {1..10000} which is, 48,894, but that doesnt' change the differential — Peter.O, Aug 29 '12 at 12:11

score 6 · Answer 3 · edited Mar 07 '13 at 10:42

6

In sed, assuming the strings don't contain any newline characters:

string1="test toast"
string2="test test"
printf "%s\n" "$string1" "$string2" | sed -e 'N;s/^\(.*\).*\n\1.*$/\1/'

edited Mar 07 '13 at 10:42

Stéphane Chazelas

522,931
91
1,010
1,501

answered Aug 07 '11 at 15:24

jfg956

5,988
3
22
24

But duplicate with [this](http://stackoverflow.com/questions/6973088/how-do-i-find-the-overlap-of-two-strings-in-bash/6973268#6973268). – jfg956 Aug 07 '11 at 15:25
Brilliant! goes directly to my tips&tricks library :-) – hmontoliu Aug 08 '11 at 11:27
Or, for a *bash* string, which can't contain `\0`. Using `tr` and `\0`, the method can handle newlines in the string, .... `{ printf "%s" "$string1" |tr \\n \\0; echo; printf "%s" "$string2" |tr \\n \\0; echo; } | sed -e 'N;s/^$.*$.*\n\1.*$/\1/' |tr \\0 \\n` – Peter.O Aug 29 '12 at 01:25
I've just tested this `sed` method a bit further,and it seems that using back-references this way (in the search pattern) is hugely expensive. It still outperforms the sequential byte-by-byte looping (by approx a factor of 3), but here is an example: for two 32kb strings (with the last byte being different), it takes `2m4.880s`, as compared to Gilles' binary-split method `0m0.168s` – Peter.O Aug 29 '12 at 07:54

score 2 · Answer 4 · answered Aug 07 '11 at 14:21

2

This seems crude to me, but you can do it via brute force:

#!/bin/bash

string1="test toast"
string2="test test"

L=1  # Prefix length

while [[ ${string1:0:$L} == ${string2:0:$L} ]]
do
    ((L = L + 1))
done

echo Overlap: ${string1:0:$((L - 1))}

I want some clever algorithm to exist, but I can't find any with a short search.

answered Aug 07 '11 at 14:21

2

[compare half and repeat](http://unix.stackexchange.com/questions/18236/how-do-i-find-the-overlap-of-two-strings-in-bash/18695#18695) is n*log(n) rather than n^2. – Gilles 'SO- stop being evil' Aug 13 '11 at 21:57
2

For general reference, it is a bit on the slow side. Two 32768 character strings (the last char being different) took 6m27.689s. – Peter.O Aug 29 '12 at 08:34

How do I find the overlap of two strings in bash?

4 Answers4

Linked