I am currently copying a large number of directories and files recursively on the same disk using cp -r.
Is there a way to do this more quickly? Would compressing the files first be better, or maybe using rsync?
I am currently copying a large number of directories and files recursively on the same disk using cp -r.
Is there a way to do this more quickly? Would compressing the files first be better, or maybe using rsync?
I was recently puzzled by the sometimes slow speed of cp. Specifically, how come df = pandas.read_hdf('file1', 'df') (700ms for a 1.2GB file) followed by df.to_hdf('file2') (530ms) could be so much faster than cp file1 file2 (8s)?
Digging into this:
cat file1 > file2 isn't any better (8.1s).dd bs=1500000000 if=file1 of=file2 neither (8.3s).rsync file1 file2 is worse (11.4s), because file2 existed already so it tries to do its rolling checksum and block update magic.Oh, wait a second! How about unlinking (deleting) file2 first if it exists?
Now we are talking:
rm -f file2: 0.2s (to add to any figure below).cp file1 file2: 1.0s.cat file1 > file2: 1.0s.dd bs=1500000000 if=file1 of=file2: 1.2s.rsync file1 file2: 4s.So there you have it. Make sure the target files don't exist (or truncate them, which is presumably what pandas.to_hdf() does).
Edit: this was without emptying the cache before any of the commands, but as noted in the comments, doing so just consistently adds ~3.8s to all numbers above.
Also noteworthy: this was tried on various Linux versions (Centos w. 2.6.18-408.el5 kernel, and Ubuntu w. 3.13.0-77-generic kernel), and ext4 as well as ext3. Interestingly, on a MacBook with Darwin 10.12.6, there is no difference and both versions (with or without existing file at the destination) are fast.
On the same partition (and filesystem) you can use -l to achieve hard links instead of copies. Hard link creation is much faster than copying things (but, of course, does not work across different disk partitions).
As a small example:
$ time cp -r mydir mydira
real 0m1.999s
user 0m0.000s
sys 0m0.490s
$ time cp -rl mydir mydirb
real 0m0.072s
user 0m0.000s
sys 0m0.007s
That's a 28 times improvement. But that test used only ~300 (rather small) files. A couple of bigger files should perform faster, a lot of smaller files slower.
Copying a file on the local disk is 99% spent in reading and writing to the disk. If you try to compress data then you increase CPU load but don't reduce the read/write data... it will actually slow down your copy.
rsync will help if you already have a copy of the data and bring it "up to date".
But if you want to create a brand new copy of a tree then you can't really do much better than your cp command.
For copying a large number of directories, you can actually do better than cp by parallelizing the copies and using copy acceleration.
Parallelizing the copies will ensure you saturate your drive. Modern SSDs (and to some extent HDDs) perform better when they receive many I/O requests since those requests can be re-ordered/batched/cached for optimal performance. Single-threaded copy stands no chance of saturating an SSD unless the files being copied are massive and the OS performs pre-fetching. On the other hand, multi-threaded copy makes sure many file reads and writes are occurring at the same time.
Copy acceleration is only available on some file systems, but trumps all else because it doesn't actually perform the copy. Instead, it marks the original file as having been "COWed" and when either file is written to later, the actual copy will be performed. You might say that's just delaying the work, but the "later" part gives us extra information. For example, if only some disk blocks were changed, the file system could copy/create just those new blocks and keep pointers to the other blocks from the original file. Or maybe the file system doesn't support block-level copy granularity, but your changes completely overwrote some disk blocks... those don't need to be copied anymore. My point is that copy acceleration is more than just "defer the work," it lets us see into the future.
io_uring doesn't yet support copy acceleration, but once it does, using it will enable further efficiency gains through parallelizing the various operations required to perform a copy with minimal overhead.
I've created a multi-threaded replacement for cp with the sole purpose of being the fastest way to copy files, period. It currently doesn't come out on top when copying a single directory with no nesting, but I expect that to change once Linux supports copy acceleration in io_uring.
The tool: https://github.com/SUPERCILEX/fuc/tree/master/cpz
Benchmarks: https://github.com/SUPERCILEX/fuc/tree/master/comparisons#copy