6

I am currently copying a large number of directories and files recursively on the same disk using cp -r.

Is there a way to do this more quickly? Would compressing the files first be better, or maybe using rsync?

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
CJ7
  • 829
  • 5
  • 12
  • 18
  • If this is on zfs, you can make a snapshot, which is practically instantaneous. The cost of the copy (both in time and in disk space) is then only paid when one of the sides is modified. I don't know what commands to use for this, I encourage someone who does to post an answer explaining how to do it. – Gilles 'SO- stop being evil' Aug 25 '16 at 22:08
  • If you could post the output of `iostat` while this copy operation is running, you might get more help from readers. Assuming you're running on Solaris from the `/solaris` tag, post several lines from `iostat -sndzx 2`. That will emit an output line every 2 seconds, with the first line being not very useful. Again, that needs to be run *while your `cp -r ...` command is running*. – Andrew Henle Aug 27 '16 at 11:30

4 Answers4

2

I was recently puzzled by the sometimes slow speed of cp. Specifically, how come df = pandas.read_hdf('file1', 'df') (700ms for a 1.2GB file) followed by df.to_hdf('file2') (530ms) could be so much faster than cp file1 file2 (8s)?

Digging into this:

  • cat file1 > file2 isn't any better (8.1s).
  • dd bs=1500000000 if=file1 of=file2 neither (8.3s).
  • rsync file1 file2 is worse (11.4s), because file2 existed already so it tries to do its rolling checksum and block update magic.

Oh, wait a second! How about unlinking (deleting) file2 first if it exists?

Now we are talking:

  • rm -f file2: 0.2s (to add to any figure below).
  • cp file1 file2: 1.0s.
  • cat file1 > file2: 1.0s.
  • dd bs=1500000000 if=file1 of=file2: 1.2s.
  • rsync file1 file2: 4s.

So there you have it. Make sure the target files don't exist (or truncate them, which is presumably what pandas.to_hdf() does).

Edit: this was without emptying the cache before any of the commands, but as noted in the comments, doing so just consistently adds ~3.8s to all numbers above.

Also noteworthy: this was tried on various Linux versions (Centos w. 2.6.18-408.el5 kernel, and Ubuntu w. 3.13.0-77-generic kernel), and ext4 as well as ext3. Interestingly, on a MacBook with Darwin 10.12.6, there is no difference and both versions (with or without existing file at the destination) are fast.

Pierre D
  • 645
  • 1
  • 5
  • 14
  • Did you account for the source file contents potentially being held in cache? – Andrew Henle Jul 08 '18 at 18:17
  • @AndrewHenle: good point, but same conclusions when clearing the cache (using `sudo sh -c 'sync && echo 3 > /proc/sys/vm/drop_caches'`) before every command. Just adding ~3.8s to all the numbers above. The delta between `cp` to an existing file vs to an non-existent destination is as above: ~7s. – Pierre D Jul 09 '18 at 04:02
  • So, *why* was it faster to `rm` first, then `cp/cat` rather than just plain `cp/cat`? Fragmentation? Something else? – jrw32982 Dec 09 '20 at 22:55
  • I am not sure, but it may be related to the way overwriting a large file is implemented on certain filesystems, e.g. `ext3` and `ext4`. BTW, I also just tested with `truncate -s 0 file2 && cp file1 file2`: same timing as `rm -f file2 && cp file1 file2` (fast). – Pierre D Dec 09 '20 at 23:30
  • "_it tries to do its rolling checksum and block update magic_" - not for local to local copies – roaima Jan 16 '23 at 09:06
1

On the same partition (and filesystem) you can use -l to achieve hard links instead of copies. Hard link creation is much faster than copying things (but, of course, does not work across different disk partitions).

As a small example:

$ time cp -r mydir mydira

real    0m1.999s
user    0m0.000s
sys     0m0.490s

$ time cp -rl mydir mydirb

real    0m0.072s
user    0m0.000s
sys     0m0.007s

That's a 28 times improvement. But that test used only ~300 (rather small) files. A couple of bigger files should perform faster, a lot of smaller files slower.

grochmal
  • 8,489
  • 4
  • 30
  • 60
  • It's on the same partition – CJ7 Aug 25 '16 at 02:23
  • What are hard-links? I need actual copies of the files to play around with. – CJ7 Aug 25 '16 at 02:24
  • 1
    Hard links make each filename map to the same file; they're not copies. If you modify the new name you modify the original. – Stephen Harris Aug 25 '16 at 02:27
  • @CJ7 - Hard links are just extra inodes pointing to the same data. If you change the copy the original file is changed too. – grochmal Aug 25 '16 at 02:27
  • Also I'm on solaris and my cp does not have a -l option. – CJ7 Aug 25 '16 at 02:28
  • @CJ7 - Yeah, posix cp will not have `-l`. You will need to go with `find -exec ln -P ... \;`. But then again, you're after copies not hard links. – grochmal Aug 25 '16 at 02:34
  • This doesn't meet the requirements; hard links to existing files are not copies; modify the "copy" and you modify the original. – Stephen Harris Aug 25 '16 at 02:38
  • @StephenHarris - Well, it may meet the requirements of someone who get here from a google search. In U&L theory it answers the question at hand. The comment discussion with OP is an extra that shows that the answer does not meet OPs requirements. (it's the difference between *answer to question* and *most useful answer to OP*). – grochmal Aug 25 '16 at 02:47
  • 1
    @grochmal The semantics of "cp -r" and creating hard links are totally different. You might think the OP had a "XY problem", but the question wasn't phrased that way. A big failure mode is in second guessing the question; in this scenario it's better to ask via comments what the question really means. – Stephen Harris Aug 25 '16 at 03:13
  • @StephenHarris - Semantics are heavily subjective, and you're too heated today/yesterday. I use hard links a lot in my home directory for an example. Had I searched for different ways of doing `cp` I'd like to see an alternative with hard links. If I remember correctly (that was several years ago), had I learned about hard links sooner, i would have used them sooner. – grochmal Aug 25 '16 at 12:10
  • A hard link isn't a copy. It's another pointer to the exact same data, the exact same inode (which is why you can't do a hard link across file-systems). If you *hard link* file1 to file2, then editing either link will result in both pointing to the same modified data (quite similar to a symbolic link)...whereas if you *copy* file1 to file2, then they are two completely separate files which can be viewed/edited completely independently of each other. – cas Jan 16 '23 at 06:50
1

Copying a file on the local disk is 99% spent in reading and writing to the disk. If you try to compress data then you increase CPU load but don't reduce the read/write data... it will actually slow down your copy.

rsync will help if you already have a copy of the data and bring it "up to date".

But if you want to create a brand new copy of a tree then you can't really do much better than your cp command.

Stephen Harris
  • 42,369
  • 5
  • 94
  • 123
  • They could use CoW snapshots. That's essentially creating a new copy of the files and you only have the initial snapshot operation and subsequent increased latency for writes to the new "copies" – Bratchley Aug 25 '16 at 02:32
  • @Bratchley did you _read_ the question? CoW does not meet the requirements (and it's Solaris, anyway). – Stephen Harris Aug 25 '16 at 02:37
  • Well, you might do some form of tmpfs, mount it somewhere, CoW onto it, and then edit the "copies". But (1) that's horribly far fetched, and (2) not on solaris. – grochmal Aug 25 '16 at 02:39
  • It does meet the requirements. It creates an additional set of copies. Also Solaris [quite famously](https://docs.oracle.com/cd/E19253-01/819-5461/6n7ht6r4f/) supports snapshots. – Bratchley Aug 25 '16 at 02:39
  • 1
    @grochmal If it's reasonably current version of Solaris, it's almost certainly going to be using ZFS which supports snapshotting. There are then a variety of ways to get those "copies" to show up in a desired part of the filesystem. – Bratchley Aug 25 '16 at 02:42
  • @Bratchley - I forgot that Solaris has ZFS, yeah, that's actually a good bet. I know little about ZFS though. – grochmal Aug 25 '16 at 02:44
  • 1
    ha yeah. Solaris is the birthplace for ZFS ;-) – Bratchley Aug 25 '16 at 02:45
  • I wonder if `dd` would be faster with the right block size. I no longer have handy access to a Solaris machine to test that. – Eric Aug 25 '16 at 03:05
  • @Eric `dd` can only copy a single file or of a whole filesystem image, but not a directory tree, which is what was asked here. – Gilles 'SO- stop being evil' Aug 25 '16 at 22:03
  • @gilles I suppose you could use `find` to run `dd` for a quicker copy, but you're right that I missed the recursive part of the question. I wonder if `cpio` or `tar` would be quicker than `cp`. – Eric Aug 27 '16 at 11:36
  • Rsync won't do delta magic on local copies. At best it becomes like cp; in practice it's a little slower. – roaima Jul 09 '18 at 07:06
  • zfs snapshots **don't** make another copy of the files. It copies the list of blocks currently in use in a particular dataset (and zfs won't re-use those blocks as long as they're referenced by a snapshot, dataset, or clone), that's why snapshots are so fast - the only data copied is the list of block-pointers. Also, snapshots cover an entire filesystem (dataset), not individual files or sub-directories. You can fish out old files by looking in the dataset's `.zfs/` directory or by mounting the snapshot. But `zfs snapshot` not a substitute for copy or rsync. – cas Jan 16 '23 at 06:34
  • `zfs send`-ing a snapshot can be a substitute for rsync , but that's more useful for backups or copying entire datasets to another pool than for simple file copying. `zfs send` is **much** faster than `rsync` because it already knows which blocks belong to a snapshot so it can just send those blocks with no need for any comparisons - e.g. when i switched from `rsync` to `zfs send` for my backups, the backup time was reduced from **hours** every night to just minutes. – cas Jan 16 '23 at 06:40
0

For copying a large number of directories, you can actually do better than cp by parallelizing the copies and using copy acceleration.

Parallelizing the copies will ensure you saturate your drive. Modern SSDs (and to some extent HDDs) perform better when they receive many I/O requests since those requests can be re-ordered/batched/cached for optimal performance. Single-threaded copy stands no chance of saturating an SSD unless the files being copied are massive and the OS performs pre-fetching. On the other hand, multi-threaded copy makes sure many file reads and writes are occurring at the same time.

Copy acceleration is only available on some file systems, but trumps all else because it doesn't actually perform the copy. Instead, it marks the original file as having been "COWed" and when either file is written to later, the actual copy will be performed. You might say that's just delaying the work, but the "later" part gives us extra information. For example, if only some disk blocks were changed, the file system could copy/create just those new blocks and keep pointers to the other blocks from the original file. Or maybe the file system doesn't support block-level copy granularity, but your changes completely overwrote some disk blocks... those don't need to be copied anymore. My point is that copy acceleration is more than just "defer the work," it lets us see into the future.

io_uring doesn't yet support copy acceleration, but once it does, using it will enable further efficiency gains through parallelizing the various operations required to perform a copy with minimal overhead.


I've created a multi-threaded replacement for cp with the sole purpose of being the fastest way to copy files, period. It currently doesn't come out on top when copying a single directory with no nesting, but I expect that to change once Linux supports copy acceleration in io_uring.

The tool: https://github.com/SUPERCILEX/fuc/tree/master/cpz
Benchmarks: https://github.com/SUPERCILEX/fuc/tree/master/comparisons#copy

SUPERCILEX
  • 121
  • 5