12

Is there a convenient way to identify duplicate or near duplicate blocks of text within a file?

I want to use this for identifying code duplication. It looks like there are specialty programs with this capability but I'm not looking to get that involved.

I'm hoping there's a tool similar to diff that will can do a sort of "within a file" diff. Even better would be a within a single file vimdiff.

Praxeolitic
  • 1,638
  • 3
  • 18
  • 24

1 Answers1

16

If doing the comparision line-by-line is acceptable, then the following will tell which lines are duplicated in file text and how many times each one appears:

sort text | uniq -c | grep -vE '^\s*1 '

As an example,

$ cat text
alpha
beta
alpha
gamma
alpha
beta
$ sort text | uniq -c | grep -vE '^\s*1 '
      3 alpha
      2 beta

Using the usual unix tools, this could be extended, assuming the input test format is not too complex, to paragraph-by-paragraph or sentence-by-sentence comparisons.

Finding Repeated Paragraphs

Suppose that our file text contains:

This is a paragraph.

This is another
paragraph

This is
a paragraph.

Last sentence.

The following command identifies shows which paragraphs appear more than once:

$ awk -v RS=""  '{gsub(/\n/," "); print}' text | sort | uniq -c | grep -vE '^\s*1 '
      2 This is a paragraph.

This uses awk to break the text up into paragraphs (delineated by blank lines), converts the newlines to spaces, and then passes the output, one line per paragraph, to sort and uniq for counting duplicated paragraphs.

The above was tested wtih GNU awk. For other awk's, the method for defining blank lines as paragraph (record) boundaries may differ.

John1024
  • 73,527
  • 11
  • 167
  • 163
  • 3
    I'd upvote for multiple lines at a time. – Praxeolitic Oct 01 '14 at 03:09
  • 1
    @Praxeolitic Updated for paragraphs. – John1024 Oct 01 '14 at 07:18
  • In 95% of cases this works great, but the problem is it destroys the input. I have to edit a transcript that I frantically saved from a Zoom meeting and this is a bit involved. If only there was something like `git diff` which has red marks for trailing whitespace. – Sridhar Sarnobat Oct 25 '20 at 20:17