Questions tagged [deduplication]

For questions where a task is to be applied to only one instance of multiple copies of data (files or blocks of data on a filesystem, or strings in a text), or where duplicates of the first such instance are to be ignored for space/time saving purposes.

Deduplication is the elimination or disregarding of identical copies of data during processing. It mainly occurs in two contexts:

  • Space saving/speedup on file storage and transfer systems. This can mean to scan a file system for multiple copies of the same file, and removing all but one of those found. On a lower level, the same can apply to blocks of data found on the filesystem. Alternatively, it can mean the identification of files or data blocks already encountered when transferring/backing up data, and skipping any duplicates to reduce backup size/transfer volume.
  • Eliminating/preventing repeated copies of a (sub)string in a larger string or text file. In this case, the task may be to scan a given string for multiple instances of a given substring, or when appending text to a file or string, identifying which parts of the text to be added are already present on the destination and skipping them upon output.
75 questions
201
votes
20 answers

Is there an easy way to replace duplicate files with hardlinks?

I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory. Here's the situation: This…
Josh
  • 8,311
  • 12
  • 54
  • 73
17
votes
1 answer

How to duplicate a file without copying its data with btrfs?

I have no experience with btrfs, but it's advertised to be able to de-duplicate files. In my application, I'd need to duplicate whole directory trees. From what I learned, btrfs only de-duplicates in some post scan, not immediately. Even just using…
Udo G
  • 1,123
  • 3
  • 12
  • 27
13
votes
5 answers

How to find duplicate lines in many large files?

I have ~30k files. Each file contains ~100k lines. A line contains no spaces. The lines within an individual file are sorted and duplicate free. My goal: I want to find all all duplicate lines across two or more files and also the names of the files…
12
votes
2 answers

Are there any deduplication scripts that use btrfs CoW as dedup?

Looking for deduplication tools on Linux there are plenty, see e.g. this wiki page. Allmost all scripts do either only detection, printing the duplicate file names or removing duplicate files by hardlinking them to a single copy. With the rise of…
Peter Smit
  • 1,154
  • 4
  • 18
  • 32
10
votes
2 answers

How to find data copies of a given file in Btrfs filesystem?

I have deduplicated my Btrfs filesystem with bedup, so now all duplicate files (above a certain size) are "reflink" copies. Is there any way to see, given a filename, what other files are the same reflinks?
Peter Smit
  • 1,154
  • 4
  • 18
  • 32
9
votes
1 answer

Is there a way to enable reflink on an existing XFS filesystem?

I currently have a 4TB RAID 1 setup on a small, personal Linux server, which is formatted as XFS in LVM. I am interested in enabling the reflink feature of XFS, but I did not do so when I first created the filesystem (I used the defaults). Is there…
TheSola10
  • 201
  • 2
  • 8
9
votes
2 answers

Deduplication on partition level

What are available solutions for block level or more detailed deduplication ? There are file-based ones - with "Copy-On-Write" approach. I'm looking for block level "copy-on-write", so I could periodically look for common blocks, or - preferably -…
Grzegorz Wierzowiecki
  • 13,865
  • 23
  • 89
  • 137
9
votes
1 answer

Make tar (or other) archive, with data block-aligned like in original files for better block-level deduplication?

How can one generate a tar file, so the contents of tarred files are block-aligned like in the original files, so one could benefit from block-level deduplication ( https://unix.stackexchange.com/a/208847/9689 )? (Am I correct that there is nothing…
Grzegorz Wierzowiecki
  • 13,865
  • 23
  • 89
  • 137
9
votes
5 answers

Remove duplicate lines from a file that contains a timestamp

This question/answer has some good solutions for deleting identical lines in a file, but won't work in my case since the otherwise duplicate lines have a timestamp. Is it possible to tell awk to ignore the first 26 characters of a line in…
a coder
  • 3,184
  • 9
  • 42
  • 63
7
votes
2 answers

Is there a block-level storage file system?

I'm looking for a file system that stores files by block content, therefore similar files would only take one block. This is for backup purposes. It is, similar to what block-level backup storage proposes such as zbackup, but I'd like a Linux file…
MappaM
  • 175
  • 1
  • 4
7
votes
1 answer

What does a rmlint's "clone" for btrfs do?

I was reading the rmlint manual, and one of the duplicate handlers are clone and reflink: · clone: btrfs only. Try to clone both files with the BTRFS_IOC_FILE_EXTENT_SAME ioctl(3p). This will physically delete duplicate extents. Needs at least…
Dan
  • 9,372
  • 5
  • 25
  • 39
6
votes
3 answers

Finding duplicate files with same filename AND exact same size

I have a huge songs folder with a messy structure and files duplicated in multiple folders. I need a recommendation for a tool or a script that can find and remove duplicates with simple two matches: Exact same file name Exact same file size In…
6
votes
2 answers

How does tar deal with hardlinked files?

I have a 2.5 TB of data that I want to put in a 2TB hard drive to mail somewhere. It's not hopeless, as a very large fraction of the data consists of duplicate files. I am considering using jdupes with the -H option, which will replace duplicate…
Dan
  • 9,372
  • 5
  • 25
  • 39
5
votes
1 answer

Are tars deduplicatable at the block level?

Quite simply, when a tar file is made on disk, would the extents be deduplicatable with extents inside and/or outside of the tar? I am asking in the theoretical sense, so if the extents of data are identical inside the tar (no shifting, or splitting…
flungo
  • 341
  • 1
  • 2
  • 11
5
votes
1 answer

How to get tar to hardlink identical content on-the-fly during archive creation?

How can I get tar (or any other program that is commonly available on Linux, e.g. pax) to hardlink duplicate content on-the-fly during archive creation? That is, I'd like to avoid hardlinking upfront and instead would like tar to take care of it. Is…
0xC0000022L
  • 16,189
  • 24
  • 102
  • 168
1
2 3 4 5