18

In a common Linux distribution, do utilities like rm, mv, ls, grep, wc, etc. run in parallel on their arguments?

In other words, if I grep a huge file on a 32-threaded CPU, will it go faster than on dual-core CPU?

fpmurphy
  • 4,556
  • 3
  • 23
  • 26
homocomputeris
  • 391
  • 2
  • 13
  • 1
    Note that starting a thread is relatively expensive, compared to many of the simple system calls; and then, multiple threads accessing the same resource generally results in the kernel using locks to process them in sequence. So many of those basic utilities have nothing to gain and a lot to lose from parallelization. – Guntram Blohm Jun 25 '20 at 08:20
  • @GuntramBlohmsupportsMonica this statement applies only to Linux where threads are emulated via processes. – schily Jun 25 '20 at 08:36
  • Is your kernel multi-threaded? Because if not, there is no point in parallelizing system calls such as unlink() (used by `rm`). – Jens Jun 25 '20 at 10:38
  • 5
    @Jens Most if not all modern unix kernels (Solaris, Linux, *BSD) are multithreaded. – Gilles 'SO- stop being evil' Jun 25 '20 at 11:06
  • 5
    GNU `sort` is one of very few that explicitly and intentionally parallelizes (when feasible; input needs to be a seekable file rather than a FIFO, f/e). That said, POSIX doesn't _require_ any of these tools to parallelize, so it's not really fair to say that the baseline "POSIX utilities" are or aren't; it's individual vendors' implementations that matter. – Charles Duffy Jun 25 '20 at 14:50
  • 4
    Disk I/O is fundamentally serialized anyway, so there's not much point to making parallel filesystem tools (rm, mv, etc). All you do is generate contention, which helps nothing. – J... Jun 25 '20 at 16:16
  • 4
    @J... Assuming everything is on a single filesystem/disk or that the filesystem/disk does not do any optimization (NCQ/TCQ). – pipe Jun 25 '20 at 19:13
  • @pipe Fair point, but in either case I'd say it's the responsibility of the OS and the driver layer to manage those optimizations. The best case in that situation is that trying to parallelize I/O using threads results in no time penalty other than the parallel overhead. The worst case is contention and and degradation of performance. In either case, I think trying to outsmart the filesystem seems generally like a bad idea. If it could store things any faster it would. – J... Jun 26 '20 at 01:08
  • `grep` doesn't run in parallel but if you want to grep in parallel use `ag`: https://github.com/ggreer/the_silver_searcher. It's available in most distros and even on other unixen like Mac OSX as either `ag` or `silver-searcher` or `the-silver-searcher` – slebetman Jun 26 '20 at 04:46
  • @J: On modern SSD, disk I/O can have up to 4 "thread"-like access per disk and if you use something like a recent EPYC you can have up to 128 parallel access total. These are called PCI lanes and they have been around for a long time but individual disks have only started to make use of them with the advent of PCI-based SSD. Older disks kind of have thread-like things as well: DMA channels. But generally DMA still use a single transfer channel but can interlace multiple concurrent requests. (note: the current M.2 spec has a maximum of 4 PCI lanes - some disks use less) – slebetman Jun 26 '20 at 04:53
  • 1
    @J With regards to responsibility. For a program like `grep` there is no API that says "read files". All file I/O API reads a single file. So it is still the responsibility of your program to spawn threads or launch multiple file reads in parallel. Programs like `ag` and `git` were written this way but `grep` was not – slebetman Jun 26 '20 at 05:00
  • @slebetman You're not following - I wasn't talking about grep, I was talking about file operations (I even listed them - rm, mv, etc) - answering the question of why commands like `rm` and `mv` don't support "parallel" file operations. I know how PCI lanes work - the point is that it's not up to `rm` and `mv` to know or care about the hardware implementation of the device the filesystem is sitting on top of. If the filesystem can make parallel writes, fine, but it doesn't make sense for that hardware capability to sit at the mercy of the user writing a parallel `mv`. The kernel/driver do this. – J... Jun 26 '20 at 11:26
  • @J Network file systems can have large latencies (on the order of seconds), so it can make sense to issue multiple file operations at once rather than wait for the result of each operation before starting the next. Unfortunately, threadding doesn't help much in this case, because all the threads will quickly block on system calls. You need asynchronous file operation system calls to handle this case. – Vaelus Jun 26 '20 at 12:42
  • @Vaelus Agreed, I think that's what I was saying. – J... Jun 26 '20 at 15:16

4 Answers4

27

You can get a first impression by checking whether the utility is linked with the pthread library. Any dynamically linked program that uses OS threads should use the pthread library.

ldd /bin/grep | grep -F libpthread.so

So for example on Ubuntu:

for x in $(dpkg -L coreutils grep findutils util-linux | grep /bin/); do if ldd $x | grep -q -F libpthread.so; then echo $x; fi; done

However, this produces a lot of false positives due to programs that are linked with a library that itself is linked with pthread. For example, /bin/mkdir on my system is linked with PCRE (I don't know why…) which itself is linked with pthread. But mkdir is not parallelized in any way.

In practice, checking whether the executable contains libpthread gives more reliable results. It could miss executables whose parallel behavior is entirely contained in a library, but basic utility typically aren't designed that way.

dpkg -L coreutils grep findutils util-linux | grep /bin/ | xargs grep pthread               
Binary file /usr/bin/timeout matches
Binary file /usr/bin/sort matches

So the only tool that actually has a chance of being parallelized is sort. (timeout only links to libpthread because it links to librt.) GNU sort does work in parallel: the number of threads can be configured with the --parallel option, and by default it uses one thread per processor up to 8. (Using more processors gives less and less benefit as the number of processors increases, tapering off at a rate that depends on how parallelizable the task is.)

grep isn't parallelized at all. The PCRE library actually links to the pthread library only because it provides thread-safe functions that use locks and the lock manipulation functions are in the pthread library.

The typical simple approach to benefit from parallelization when processing a large amount of data is to split this data into pieces, and process the pieces in parallel. In the case of grep, keep file sizes manageable (for example, if they're log files, rotate them often enough) and call separate instances of grep on each file (for example with GNU Parallel). Note that grepping is usually IO-bound (it's only CPU-bound if you have a very complicated regex, or if you hit some Unicode corner cases of GNU grep where it has bad performance), so you're unlikely to get much benefit from having many threads.

Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
  • libpthread is what has been in use in the 1990s. Modern UNIXes moved the related code into libc long time ago. – schily Jun 24 '20 at 22:28
  • 3
    @schily All modern unixes include pthread alongside libc, but you still need `-lpthread` when linking with POSIX thread-related functions such as `pthread_create` on modern unices such as [Solaris](https://docs.oracle.com/cd/E53394_01/pdf/E54803.pdf), [Linux](https://www.man7.org/linux/man-pages/man3/pthread_create.3.html), [FreeBSD](https://www.freebsd.org/cgi/man.cgi?query=pthread_create&sektion=3&manpath=FreeBSD+12.1-RELEASE+and+Ports), etc. – Gilles 'SO- stop being evil' Jun 24 '20 at 22:36
  • A stupid question: does linking with `pthread` mean that a program *really* does something useful in parallel? Say, `grep` might count lines in parallel but parse them sequentially. – homocomputeris Jun 24 '20 at 23:23
  • @Gilles'SO-stopbeingevil' You are mistaken. libpthread is nowerdays an empty ELF filter library that just exists in order to serve old binaries that expect that library. Recently linked binaries do not need libpthread and in fact do not link against that library. The conclusion is that your method does not work to identify programs that may run in parallel. – schily Jun 25 '20 at 02:55
  • 1
    @homocomputeris No, linking with pthread is not sufficient to indicate that the program does something useful. I give the example of GNU `timeout`. It's the other way round: if a program _doesn't_ link with `pthread`, it's unlikely to ever start more than one thread. – Gilles 'SO- stop being evil' Jun 25 '20 at 08:16
  • 3
    @schily This may be true on _some_ Unix systems, but the question specifically mentioned “common Linux distributive (sic.)”, where this is not the case. – Gilles 'SO- stop being evil' Jun 25 '20 at 08:18
  • @Gilles'SO-stopbeingevil' The question says `POSIX` and Linux is not `POSIX` compliant, so we primary need to look at certified platforms that do not include Linux. – schily Jun 25 '20 at 08:29
  • 7
    @AdminBee The question is about “POSIX utilities”, not about POSIX _platforms_. `grep` is a POSIX utility, whether or not the platform as a whole has been tested for POSIX compliance. (And by the way, “Linux is not POSIX” is wrong, since there exists a platform with a Linux kernel (and a GNU userland) that has been certified for POSIX compliance.) – Gilles 'SO- stop being evil' Jun 25 '20 at 11:09
  • @Gilles'SO-stopbeingevil' you are right, that was my incorrect reading of the question title. – AdminBee Jun 25 '20 at 11:11
  • 9
    This answer makes the right conclusion about which POSIX utilities are threaded, but future readers should beware... this answer may be misleading if applied to other utilities. Reasons: Some tools parallelise using [fork()](https://man7.org/linux/man-pages/man2/fork.2.html) which is not part of pthreads. Some tools load in libraries as plugins dynamically which do NOT show up in `ldd`. You cannot rule-out a plugin using threads even if the core program didn't. – Philip Couling Jun 25 '20 at 11:20
  • The manpage of `ldd` mentions you can use `objdump -p /path/to/program | grep NEEDED` to get a list of direct dependencies. – G. Sliepen Jun 25 '20 at 15:08
8

Another way to find an answer is to use something like sysdig to examine the system calls executed by a process. For example, if you want to see if rm creates any threads (via the clone system call), you could do:

# sysdig proc.name=rm and evt.type=clone and evt.dir='<'

With that running, I did:

$ mkdir foo
$ cd foo
$ touch {1..9999}
$ rm *

And saw no clones -- no threading there. You could repeat this experiment for other tools, but I don't think you'll find that they're threaded.

Note that clone() is the underpinnings of fork() as well, so if a tool starts some other process (e.g., find ... -exec), you'd see that output. The flags will differ from the "create a new thread" use case:

# sysdig proc.name=find and evt.type=clone and evt.dir='<'
...
1068339 18:55:59.702318832 2 find (2960545) < clone res=0 exe=find args=/tmp/foo.-type.f.-exec.rm.{}.;. tid=2960545(find) pid=2960545(find) ptid=2960332(find) cwd= fdlimit=1024 pgft_maj=0 pgft_min=1 vm_size=9100 vm_rss=436 vm_swap=0 comm=find cgroups=cpuset=/.cpu=/user.slice.cpuacct=/user.slice.io=/user.slice.memory=/user.slic... flags=25165824(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID) uid=1026 gid=1026 vtid=2960545(find) vpid=2960545(find)
Andy Dalton
  • 13,654
  • 1
  • 25
  • 45
  • The question says `POSIX` and `clone` is not seen on any `POSIX` platform. This is rather a Linux trick to hide that Linux emulates threads via processes. – schily Jun 25 '20 at 08:34
  • 7
    @schily Be careful: “emulating” threads via processes came from Solaris, so you're criticizing Solaris here. Now, can you please stop your persistent trolling about Linux? It's just boring everyone who knows you, and puzzling people who don't know you. – Gilles 'SO- stop being evil' Jun 25 '20 at 08:42
  • 2
    You can also test a specific execution with e.g. `strace -e clone rm *` which has the advantage of not requiring any intervention as root. This won't detect a program that has an option for parallelization that defaults to off, but no method's perfectly accurate here anyway. – Gilles 'SO- stop being evil' Jun 25 '20 at 08:44
  • @Gilles'SO-stopbeingevil' you do not seem to know Solaris, so please be careful... – schily Jun 25 '20 at 08:44
  • 3
    @schily: The advice is to check system calls for thread creation. Obviously the exact system call depends on the exact OS, and `clone` is literally given as an **example**. – MSalters Jun 25 '20 at 11:22
  • @MSalters on most platforms, this syscall is called `thread_create` or similar. The problem here is that many people only know Linux and incorrectly assume it is the same on other platforms. – schily Jun 25 '20 at 11:34
  • 2
    _The question says `POSIX`..._. The question also says "In a common Linux distributive [sic]..." – Andy Dalton Jun 25 '20 at 16:35
8

See xargs or gnu parallel, for how to run them in parallel.

However the parallelisable part will tends toward zero time, as more processes are added. This will leave the non-parallelisable part, that will not get faster. Therefore there is a limit to how fast a task can be by adding more processes. Very quickly you can get to a situation that adding processes makes very little difference.

Then there is communication overhead: adding processes makes it slower. If the benefit of adding a process is lower than the cost of adding it, then it can get slower.

ctrl-alt-delor
  • 27,473
  • 9
  • 58
  • 102
4

If you are basically interested in the utilies you named, then it is much unlikely that there is a threaded version of the commands.

Even worse, it such a variant did exist, it would most likely be slower than their single threaded counterpart.

This is caused by the fact that the utilities you named all have massive filesystem interactions that (if done multi threaded) would harm kernel optimizations like read ahead.

A well implemented kernel e.g. detects a linear read in a file and causes a linear read such as done by grep to get the file content used by grep to be fetched in advance.

A mv operation is a rename operation insdie one or two directories and that requires a directory lock in the kernel. Another rename operation on these directories cannot happen at the same time unless that would be implemented in a non-atomic way.

The oldest free tar implementation (star) on the other side is parallelized since 30 years with respect to the two basic tasks: There are two processes and a piece of shared memory between both that allows one process to do the archive read/write and the other process to do the filesystem I/O simultaneously.

Your specific question related to grep could be answered by "basically yes" since the filesystem prefetch in the kernel will be faster with more than one CPU than it is with only one CPU. If the file you operate on is not huge and if this file is already inside the kernel cache,there are no prefetch advantages...

BTW: Modern shells have a builtin time feature that does not only show the times but also computes a percentage computed from the ratio of the sum of USER and SYS CPU time and wall clock time. If the related time output is more than 100%, you had a utility run that did take advantage from having more than one CPU. For non-threaded utilities, this however is typically just something like 105%.

Finally: parallelization also takes place at process level and a parallelized make could easily run 3x faster than a non-paralelized version.

If your platform allows you to switch off CPUs at runtime, I encourage you to switch off n-1 CPUs and compare the results with a multi CPU environment on the otherwise identical machine.

schily
  • 18,806
  • 5
  • 38
  • 60
  • I'd be much more cautious asserting threading would be slower. Nieve application of threading, would be as you say. But let's not assume the authors of these utilities are Nieve. A specific example: `rm -r ...` large directories across different file systems would in many cases be faster if threaded one thread per FS. – Philip Couling Jun 25 '20 at 11:10
  • 1
    Is it worth to add complex code for something that is higly improbable? The speed of`rm` higly depends on the underlying filesystem. Some filessytems implement a background rm that is more than 1000x faster than what a multi threaded rm could achieve. On the other side, `star` shows that parallelization at the right place causes benefit. `star` is still the fastest known `tar` implementation. – schily Jun 25 '20 at 11:24
  • 1
    your point is my point. Used correctly threading *is* much faster we agree on that. I'm just picking up on the wording in this answer which kindof suggests otherwise. "too difficult to implement a quicker threaded version" is a much better explanation "it would most likely be slower". I certainly wouldn't regard one command being executed on multiple file systems as "highly improbable". – Philip Couling Jun 25 '20 at 11:37
  • @schily with open source software, things don't necessarily get done for rational reasons, "somebody enjoyed doing it and it is not positively harmful" is a good enough justification. – alephzero Jun 26 '20 at 11:41
  • 1
    ...though decreased ease-of-maintenance is a harm, so things that can't be rationalized won't necessarily make it upstream. – Charles Duffy Jun 26 '20 at 20:23