13

I have a large folder, 2TB, with 1000000 files on a Linux machine. I want to build a package with tar. I do not care about the size of the tar file, so I do not need to compress the data. How can I speed tar up? It takes me an hour to build a package with tar -cf xxx.tar xxx/. I have a powerful CPU with 28 cores, and 500GB memory,is there a way to make tar run multithreaded?

Or, alternatively, is there any good way to transfer a large number of small files between different folders and between different servers? My filesystem is ext4.

terdon
  • 234,489
  • 66
  • 447
  • 667
Guo Yong
  • 131
  • 1
  • 5
  • 4
    Tar does not compress. – ctrl-alt-delor Apr 17 '19 at 16:30
  • 1
    What will you do with the tar file when you have it. This may affect the answer, as a speed up in one area may slow down another, or vica versa. – ctrl-alt-delor Apr 17 '19 at 16:31
  • 2
    The CPU and the number of cores don't make much difference as the operation of creating a `tar` archive is disk-bound, not CPU-bound. You could use several `tar` processes running in parallel, each handling their own subset of the files, creating separate archives, but they would still need to fetch all the data from the single disk. – Kusalananda Apr 17 '19 at 16:32
  • @ ctrl-alt-delor after the tar file , i will transfer that with network or just mv to another folder. – Guo Yong Apr 17 '19 at 16:46
  • @ Kusalananda thanks for your suggest. do you have any idea about how to automatically deploy tar tasks and automatically split folders – Guo Yong Apr 17 '19 at 16:48
  • 1
    @GuoYong the ideal split would be a combination of number of files and aggregate disk usage. If you're looking to copy the files elsewhere to another server why not just use `scp` and skip the `tar` phase entirely? – roaima Apr 17 '19 at 17:11
  • reiserfs usually deals better with a lot of smaller files in a directory (though not being developed, it is already starting to show some bugs). You might prefer to make a loop image and mount that over the directory, to evade the need for messy changes of your installation. – Radovan Garabík Apr 21 '19 at 15:52
  • @Kusalananda in theory, TAR is disk bound, but GNU tar has been implemented in an inefficient way, so you cannot reach the expected disk speed. I recommend to try `star` that uses only approx. 1/3 of the CPU time GNU tar needs for the same task. – schily Apr 23 '19 at 15:36

2 Answers2

13

As @Kusalananda says in the comments, tar is disk-bound. One of the best things you can do is put the output on a separate disk so the writing doesn't slow down the reading.

If your next step is to move the file across the network, I'd suggest that you create the tar file over the network in the first place:

$ tar -cf - xxx/ | ssh otherhost 'cat > xxx.tar'

This way the local host only has to read the files, and doesn't have to also accommodate the write bandwidth consumed by tar. The disk output from tar is absorbed by the network connection and the disk system on otherhost.

Jim L.
  • 7,188
  • 1
  • 13
  • 25
  • 1
    Better yet, you might just untar at the other end, and save half the writes there. OP seems to just want to transfer some files. – muru Apr 17 '19 at 17:33
  • @muru thanks for the suggestion. i want to 1. tar the file in serverA 2. nc the tar file to serverB 3. untar the tar file in serverB. Do you have any way to improve this process? – Guo Yong Apr 18 '19 at 08:45
  • @Jim L. thanks for your help. i want to 1. tar the file in serverA 2. nc the tar file to serverB 3. untar the tar file in serverB. Do you have any way to improve this process? – Guo Yong Apr 18 '19 at 08:46
  • 2
    @GuoYong Sounds like you want `rsync`, not `tar`. – Kusalananda Apr 23 '19 at 15:44
  • Is this approach appropriate and reliable to do over an automatic backup cron between different hosts? @Jim L. – Renato Junior Jan 29 '22 at 16:28
7

Or, alternatively, is there any good way to transfer a large number of small files between different folders and between different servers? My filesystem is ext4.

Rsync over ssh is something I use on a regular basis. It preserves file permissions, symlinks, etc., when used with the --archive option:

rsync -av /mnt/data <server>:/mnt

This example copies the local directory /mnt/data and its contents to a remote server inside /mnt. It invokes ssh to set up the connection. No rsync daemon is needed on either side of the wire.

This operation can also be performed between 2 local directories, or from remote to local.

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
Tim
  • 1,023
  • 5
  • 20