3

This is what I am wanting to do:

Convert a folder of HTML files into markdown, also copying over the the XML metadata of each of the HTML files by converting into YAML.

I have done research and came across the following commands:

  1. find . -name \*.md -type f -exec pandoc -o {}.txt {} \;

    • This was found here, and it is a command that works and uses pandoc, however the file extentions are ".html.md" not ".md"
  2. find / -name "*.md" -type f -exec sh -c 'markdown "${0}" > "${0%.md}.html"' {} \;

    • This was found here. This apparently takes away the ".html.md" and turns into ".md", but it does not use pandoc.
  3. pandoc -f html -t markdown -s input.html -o output.md

    • This was found here. This is the pandoc command that apparently copies over the metadata and turns it into YAML, however it does not work on a folder of files, only on open

What I need is to have one single command that uses pandoc, gives the converted files the ".md." extension and not .html.md, and converts the XML metadata into YAML. All of this can be achieved using these three commands, they just need to be merged into one single command.

Kurt Pfeifle
  • 1,401
  • 1
  • 12
  • 15
  • Break this down for us a little, please. (1) What are your input filenames like: `a.html`, `b.html.md`, `c.md`, or a mixture? (2) For each individual input file, what command(s) do you want/need to run, and what do you want the output files to be called? (If you don't know the answer to (2), focus on researching that before you muddy the issue by trying to determine how to process multiple files.) – Scott - Слава Україні Mar 14 '15 at 04:08
  • (1) They are all `a.html` (2)Convert `a.html` into `a.md` which includes converting the XML metadata in the header of `a.html` into YAML to be used as `a.md`'s front matter. – st john smith Mar 14 '15 at 06:37
  • (1) I trust you can see that «the file extensions are ".html.md" not ".md"» is confusing, if, in fact, the file extensions are all ".html". (2) I said, "***what command(s)*** do you want/need to run". Upon rereading your question, I guess you're implying that you want to use `pandoc`. I've never heard of `pandoc`, so I didn't know that it does both of the functions that you want (convert HTML and copy/convert XML metadata), and your references to "three commands" confused me. (3) Comments are a bad place for clarifications. Improve your question by editing it. – Scott - Слава Україні Mar 14 '15 at 06:53

2 Answers2

1

What you need is xargs. I am not familiar with pandoc, but something like this should work:

$ find . -name \*.html -type f | sed 's/\.html$//' | xargs -I {} pandoc -f html -t markdown -s -o "{}.md" "{}.html"

This uses 'find' to list all the .html files in your chosen directory (and any sub-directories). These are piped to sed which strips off the '.html' extension and then piped to xargs which feeds them one-by-one into pandoc; pandoc (if I have used the syntax correctly) then takes each name (substitued for {}), uses each html file as source and outputs to a new file with md extension in the same directory as the source file.

You should end up with your original html files and an equal number of matching md files in the same directory.

gogoud
  • 2,613
  • 2
  • 14
  • 18
  • This seems to have worked! Thank you so so much! Honestly, you dont know how much this has helped me, I cannot thank you enough. (sorry for the late reply) – st john smith Mar 20 '15 at 18:23
0

Using xargs to process find output for many people looks like something not fully comprehensible. Maybe looping through all the files with the help of a while read-loop is easier to understand?

find . -name "*.html" -type f | while read line ; do
    pandoc "${line}"   \
           -f html     \
           -t markdown \
           -s          \
           -o "${line%%.html}.md"
    done

The quotes are used to also cover files which contain spaces, just in case. The construct of ${line%%.html} is a "Bashism" which may not work in other shells. It removes any .html-suffix from the file name.

Kurt Pfeifle
  • 1,401
  • 1
  • 12
  • 15