Questions tagged [deduplication]

For questions where a task is to be applied to only one instance of multiple copies of data (files or blocks of data on a filesystem, or strings in a text), or where duplicates of the first such instance are to be ignored for space/time saving purposes.

Deduplication is the elimination or disregarding of identical copies of data during processing. It mainly occurs in two contexts:

Space saving/speedup on file storage and transfer systems. This can mean to scan a file system for multiple copies of the same file, and removing all but one of those found. On a lower level, the same can apply to blocks of data found on the filesystem. Alternatively, it can mean the identification of files or data blocks already encountered when transferring/backing up data, and skipping any duplicates to reduce backup size/transfer volume.
Eliminating/preventing repeated copies of a (sub)string in a larger string or text file. In this case, the task may be to scan a given string for multiple instances of a given substring, or when appending text to a file or string, identifying which parts of the text to be added are already present on the destination and skipping them upon output.

75 questions

201

votes

20 answers

Is there an easy way to replace duplicate files with hardlinks?

I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory. Here's the situation: This…

asked Oct 12 '10 at 19:23

Josh

8,311
12
54
73

votes

1 answer

How to duplicate a file without copying its data with btrfs?

I have no experience with btrfs, but it's advertised to be able to de-duplicate files. In my application, I'd need to duplicate whole directory trees. From what I learned, btrfs only de-duplicates in some post scan, not immediately. Even just using…

cp file-copy btrfs deduplication

asked Dec 16 '15 at 17:26

Udo G

1,123
3
12
27

votes

5 answers

How to find duplicate lines in many large files?

I have ~30k files. Each file contains ~100k lines. A line contains no spaces. The lines within an individual file are sorted and duplicate free. My goal: I want to find all all duplicate lines across two or more files and also the names of the files…

shell-script text-processing performance large-files deduplication

asked Feb 11 '18 at 23:02

Lars Schneider

votes

2 answers

Are there any deduplication scripts that use btrfs CoW as dedup?

Looking for deduplication tools on Linux there are plenty, see e.g. this wiki page. Allmost all scripts do either only detection, printing the duplicate file names or removing duplicate files by hardlinking them to a single copy. With the rise of…

btrfs deduplication

asked Nov 08 '12 at 14:46

Peter Smit

1,154
4
18
32

votes

2 answers

How to find data copies of a given file in Btrfs filesystem?

I have deduplicated my Btrfs filesystem with bedup, so now all duplicate files (above a certain size) are "reflink" copies. Is there any way to see, given a filename, what other files are the same reflinks?

btrfs deduplication

asked May 06 '14 at 07:29

Peter Smit

1,154
4
18
32

votes

1 answer

Is there a way to enable reflink on an existing XFS filesystem?

I currently have a 4TB RAID 1 setup on a small, personal Linux server, which is formatted as XFS in LVM. I am interested in enabling the reflink feature of XFS, but I did not do so when I first created the filesystem (I used the defaults). Is there…

filesystems xfs deduplication reflink

asked Sep 14 '19 at 16:35

TheSola10

votes

2 answers

Deduplication on partition level

What are available solutions for block level or more detailed deduplication ? There are file-based ones - with "Copy-On-Write" approach. I'm looking for block level "copy-on-write", so I could periodically look for common blocks, or - preferably -…

filesystems deduplication

asked Mar 22 '12 at 14:33

Grzegorz Wierzowiecki

13,865
23
89
137

votes

1 answer

Make tar (or other) archive, with data block-aligned like in original files for better block-level deduplication?

How can one generate a tar file, so the contents of tarred files are block-aligned like in the original files, so one could benefit from block-level deduplication ( https://unix.stackexchange.com/a/208847/9689 )? (Am I correct that there is nothing…

btrfs archive deduplication

asked Apr 16 '16 at 15:52

Grzegorz Wierzowiecki

13,865
23
89
137

votes

5 answers

Remove duplicate lines from a file that contains a timestamp

This question/answer has some good solutions for deleting identical lines in a file, but won't work in my case since the otherwise duplicate lines have a timestamp. Is it possible to tell awk to ignore the first 26 characters of a line in…

text-processing awk duplicate deduplication

asked Nov 03 '14 at 16:15

a coder

3,184
9
42
63

votes

2 answers

Is there a block-level storage file system?

I'm looking for a file system that stores files by block content, therefore similar files would only take one block. This is for backup purposes. It is, similar to what block-level backup storage proposes such as zbackup, but I'd like a Linux file…

linux filesystems backup deduplication

asked Oct 01 '19 at 15:10

MappaM

votes

1 answer

What does a rmlint's "clone" for btrfs do?

I was reading the rmlint manual, and one of the duplicate handlers are clone and reflink: · clone: btrfs only. Try to clone both files with the BTRFS_IOC_FILE_EXTENT_SAME ioctl(3p). This will physically delete duplicate extents. Needs at least…

btrfs deduplication ioctl rmlint reflink

asked Feb 13 '18 at 21:10

Dan

9,372
5
25
39

votes

3 answers

Finding duplicate files with same filename AND exact same size

I have a huge songs folder with a messy structure and files duplicated in multiple folders. I need a recommendation for a tool or a script that can find and remove duplicates with simple two matches: Exact same file name Exact same file size In…

fedora find search mp3 deduplication

asked Nov 21 '21 at 04:22

Electrifyings

votes

2 answers

How does tar deal with hardlinked files?

I have a 2.5 TB of data that I want to put in a 2TB hard drive to mail somewhere. It's not hopeless, as a very large fraction of the data consists of duplicate files. I am considering using jdupes with the -H option, which will replace duplicate…

tar hard-link deduplication

asked Nov 09 '16 at 00:09

Dan

9,372
5
25
39

votes

1 answer

Are tars deduplicatable at the block level?

Quite simply, when a tar file is made on disk, would the extents be deduplicatable with extents inside and/or outside of the tar? I am asking in the theoretical sense, so if the extents of data are identical inside the tar (no shifting, or splitting…

tar btrfs deduplication

asked Jun 10 '15 at 11:43

flungo

votes

1 answer

How to get tar to hardlink identical content on-the-fly during archive creation?

How can I get tar (or any other program that is commonly available on Linux, e.g. pax) to hardlink duplicate content on-the-fly during archive creation? That is, I'd like to avoid hardlinking upfront and instead would like tar to take care of it. Is…

tar deduplication

asked Apr 12 '15 at 10:01

0xC0000022L

16,189
24
102
168

2 3 4 5 Next