4

I would like to find all the .html files in a folder and append [file](./file.html) to another file called index.md. I tried the following command:

ls | awk "/\.html$/" | xargs -0 -I @@ -L 1 sh -c 'echo "[${@@%.*}](./@@)" >> index.md'

But it can't substitute @@ inside the command? What am I doing wrong?

Note: Filename can contain valid characters like space


Clarification:

index.md would have each line with [file](./file.html) where file is the actual file name in the folder

Rui F Ribeiro
  • 55,929
  • 26
  • 146
  • 227
Porcupine
  • 1,680
  • 2
  • 19
  • 45
  • `xargs -0` implies null-terminated strings on the `xargs` stdin, but `awk` does not print them. `${}` needs a variable name. Both points are addressed in @RoVo's answer – weirdan Sep 03 '18 at 11:24
  • 1
    Would you please clarify how the content of "index.md" will look like? –  Sep 03 '18 at 11:32
  • @Goro I had appended the clarification at the end of question, but unfortunately, it has been edited out! – Porcupine Sep 03 '18 at 11:39
  • @Nikhil. Would you please include it again. Thanks! –  Sep 03 '18 at 11:40
  • @Goro Isn't it appropriate to justify accepted answer? – Porcupine Sep 03 '18 at 14:37
  • @Nikhil. Please review this https://unix.stackexchange.com/help/someone-answers –  Sep 03 '18 at 14:43
  • @Goro It does not say that the OP can't justify why he has chosen a particular answer. These are guidelines. In the case where multiple answers are worth accepting, I think the explanation by OP adds value to the community. – Porcupine Sep 03 '18 at 14:48

3 Answers3

13

Just do:

for f in *.html; do printf '%s\n' "[${f%.*}](./$f)"; done > index.md

Use set -o nullglob (zsh, yash) or shopt -s nullglob (bash) for *.html to expand to nothing instead of *.html (or report an error in zsh) when there's no html file. With zsh, you can also use *.html(N) or in ksh93 ~(N)*.html.

Or with one printf call with zsh:

files=(*.html)
rootnames=(${files:r})
printf '[%s](./%s)\n' ${basenames:^files} > index.md

Note that, depending on which markdown syntax you're using, you may have to HTML-encode the title part and URI-encode the URI part if the file names contain some problematic characters. Not doing so could even end up introducing a form of XSS vulnerability depending on context. With ksh93, you can do it with:

for f in *.html; do
  title=${ printf %H "${file%.*}"; }
  title=${title//$'\n'/"<br/>"}
  uri=${ printf '%#H' "$file"; }
  uri=${uri//$'\n'/%0A}      
  printf '%s\n' "[$title]($uri)"
done > index.md

Where %H¹ does the HTML encoding and %#H the URI encoding, but we still need to address newline characters separately.

Or with perl:

perl -MURI::Encode=uri_encode -MHTML::Entities -CLSA -le '
  for (<*.html>) {
     $uri = uri_encode("./$_");
     s/\.html\z//;
     $_ = encode_entities $_;
     s:\n:<br/>:g;
     print "[$_]($uri)"
  }'

Using <br/> for newline characters. You may want to use ␤ instead or more generally decide on some form of alternative representation for non-printable characters.

There are a few things wrong in your code:

  • parsing the output of ls
  • use a $ meant to be literal inside double quotes
  • Using awk for something that grep can do (not wrong per se, but overkill)
  • use xargs -0 when the input is not NUL-delimited
  • -I conflicts with -L 1. -L 1 is to run one command per line of input but with each word in the line passed as separate arguments, while -I @@ runs one command for each line of input with the full line (minus the trailing blanks, and quoting still processed) used to replace @@.
  • using {} inside the code argument of sh (command injection vulnerability)
  • In sh, the var in ${var%.*} is a variable name, it won't work with arbitrary text.
  • use echo for arbitrary data.

If you wanted to use xargs -0, you'd need something like:

printf '%s\0' * | grep -z '\.html$' | xargs -r0 sh -c '
  for file do
    printf "%s\n" "[${file%.*}](./$file)"
  done' sh > file.md
  • Replacing ls with printf '%s\0' * to get a NUL-delimited output
  • awk with grep -z (GNU extension) to process that NUL-delimited output
  • xargs -r0 (GNU extensions) without any -n/-L/-I, because while we're at spawning a sh, we might as well have it process as many files as possible
  • have xargs pass the words as extra arguments to sh (which become the positional parameters inside the inline code), not inside the code argument.
  • which means we can more easily store them in variables (here with for file do which loops over the positional parameters by default) so we can use the ${param%pattern} parameter expansion operator.
  • use printf instead of echo.

It goes without saying that it makes little sense to use that instead of doing that for loop directly over the *.html files like in the top example.


¹ It doesn't seem to work properly for multibyte characters in my version of ksh93 though (ksh93u+ on a GNU system)

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
  • That overwrites `index.md` though, which OP's code did not. – weirdan Sep 03 '18 at 11:27
  • 2
    I *think* this is still what OP wants. OP uses `>>` because he uses it inside the loop, while this answer after the loop and a second run of the same script doesn't make too much sense to me. – pLumo Sep 03 '18 at 11:28
  • @StéphaneChazelas Thanks for the answer. But `for f in *.html; do printf '%s\n' "[${f%.*}](./$f)"; done >> index.md` appends `[*](./*.html)` when no html file exists. – Porcupine Sep 03 '18 at 12:48
  • 1
    @Nikhil, see edit. – Stéphane Chazelas Sep 03 '18 at 13:02
9

Do not parse ls.
You don't need xargs for this, you can use find -exec.

try this,

find . -maxdepth 1 -type f -name "*.html" -exec \
    sh -c 'f=$(basename "$1"); echo "[${f%.*}]($1)" >> index.md' sh {} \;

If you want to use xargs, use this very similar version:

find . -maxdepth 1 -type f -name "*.html" -print0 | \
    xargs -0 -I{} sh -c 'f=$(basename "$1"); echo "[${f%.*}]($1)" >> index.md' sh {} \;

Another way without running xargs or -exec:

find . -maxdepth 1 -type f -name "*.html" -printf '[%f](./%f)\n' \
    | sed 's/\.html\]/]/' \
    > index.md
pLumo
  • 22,231
  • 2
  • 41
  • 66
  • Is that an extra `sh` argument in the first command, or is that intentional? – Toby Speight Sep 03 '18 at 15:26
  • 2
    This is taken from [this answer](https://unix.stackexchange.com/a/156010/236063). See comments there and `man sh` -> `-c` for a documentation why this is needed. – pLumo Sep 03 '18 at 15:27
  • 1
    Ah, thanks - I had missed that **If there are arguments after the *command_string*, the first argument is assigned to `$0` and any remaining arguments are assigned to the positional parameters.** – Toby Speight Sep 03 '18 at 15:40
  • 1
    Add '-type f' to avoid strangeness with directories matching "*.html" – abligh Sep 03 '18 at 17:29
1

Do you really need xargs?

ls *.html | perl -pe 's/.html\n//;$_="[$_](./$_.html)\n"'

(If you have more than 100000 files):

printf "%s\n" *.html | perl -pe 's/.html\n//;$_="[$_](./$_.html)\n"'

or (slower, but shorter):

for f in *.html; do echo "[${f%.*}](./$f)"; done
Ole Tange
  • 33,591
  • 31
  • 102
  • 198
  • Note that with `ls *.html`, if any of those `html` files are of type _directory_, `ls` will list their content. More generally, when you use `ls` with a shell wildcard, you want to use `ls -d -- *.html` (which also addresses the issues with file names starting with `-`). – Stéphane Chazelas Sep 04 '18 at 07:18
  • The first two approaches assume file names don't contain newline characters (anyway, I suppose those would have to be encoded somehow in the markdown syntax). The third one assumes file names don't contain backslash characters. More generally, [`echo` can't be used for arbitrary data](/q/65803). – Stéphane Chazelas Sep 04 '18 at 07:20