-1

A directory I have is filled by a lot of files. I want to discover of what kind they are and whose of them are so numerous.

Here are the events when I try some commands :

ls -l | wc -l
1514340

ls | head -n 4
2004112700001.htm
2004112700002.htm
2004112700003.htm
2004112700004.htm

ls *.xml | head -n 4
20041127.xml
20041225.xml
20050101.xml
20050108.xml

ls -l *.htm | wc -l
bash: /bin/ls: Liste d'arguments trop longue
0

# Any other kind of ls command with *.htm, *.* is failing too.

I understand that wc -l has to wait that the output of the ls -l *.htm is entirely done before starting to analyze it. And because that output is too big, it fails.

Is it truly what is happening ?

What is the good way to make the ls command works in this case in conjunction with wc -l ? Is there a way to ask the wc command to start asynchronously, before the output is entirely completed ?

Marc Le Bihan
  • 1,353
  • 1
  • 11
  • 24
  • 1
    It's not `wc` failing because the output is too big or the pipe that's overflowing. `ls` is notg even starting because `*.htm` expands into too many arguments for it. – muru Jun 04 '20 at 06:29
  • @muru : how can it be ? there is no other file extensions starting with `htm` than `htm`. No `html` file, for example. – Marc Le Bihan Jun 04 '20 at 07:06
  • So what? `*.htm` expands to `2004112700001.htm 2004112700002.htm 2004112700003.htm 2004112700004.htm ...` then `ls` is run with all those filenames as arguments, which exceeds the argument length limit. Whether or not you have a `.html` file makes no difference. Please see the dupe. – muru Jun 04 '20 at 07:08
  • @muru `*.htm` isn't the `arg[0]` that a C program `ls` is taking to resolve a file filter by classical `findFirst`, `findNext` functions ? How would the `ls` succeed in expanding *.htm to a list of files ? By doing itself an `ls` ? – Marc Le Bihan Jun 04 '20 at 07:17
  • Never heard of these classical functions. `ls` doesn't expand anything. The shell does. See, e.g,, https://unix.stackexchange.com/q/17938/70524 – muru Jun 04 '20 at 07:28

1 Answers1

2

Same problem when you try removing millions of files with rm * in a directory. I think the system is "extending" your command with all the filenames it finds... and can't afford it.

I would suggest using "find" instead, like

find . -mindepth 1 -maxdepth 1 -name "*.html" | wc -l
darxmurf
  • 1,097
  • 6
  • 19
  • Note that it also counts hidden ones, doesn't work properly if filenames contain newline characters and with many `find` implementations would skip the filenames that contain sequences of byte that don't form valid characters in the locale. – Stéphane Chazelas Jun 04 '20 at 06:36
  • Well, yes, but if you rack up 1 billion files, spaces in names and exotic characters, it makes things a bit complicated then :-) – darxmurf Jun 04 '20 at 06:47
  • Here, you could do `count() { echo "$#"; }; count *.html` which wouldn't have either of those problems (but give you 1 instead of 0/error when there's no matching file unless you turn on `nullglob`/`failglob`). With `find`, that could be addressed with `LC_ALL=C find . ! -name . -prune -name '*.html' ! -name '.*' -print | LC_ALL=C grep -c /` (here also avoiding the `-m??depth` GNU extensions). – Stéphane Chazelas Jun 04 '20 at 06:51
  • You command works if I search for `htm` files, and it returns the number of : 1513532 files. Considering the total number of files : 1514340 I had, and 807 of xml kind, there's only one last file having an extension I yet don't know. Therefore, I can't understand really why ls refused my command if it isn't for a kind of buffer overflow. Because it can't be the number of arguments. Only three types of files are in my directory : `.xml`, `.htm`, and a last one I don't know, but it's a single file. – Marc Le Bihan Jun 04 '20 at 07:13
  • Have a look here: https://unix.stackexchange.com/questions/38955/argument-list-too-long-for-ls – darxmurf Jun 04 '20 at 07:14