12

I want html number entities like ę and want to convert it to real character. I have emails mostly from linkedin that look like this:

chciałabym zapytać, czy rozważa Pan takze udział w nowych projektach w Warszawie ? Obecnie poszukujemy specjalisty javascript/architekta z bardzo dobrą znajomością Angular.js do projektu, który dotyczy systemu, służącego do monitorowania i zarządzania flotą pojazdów. Zespół, do którego poszukujemy

I'm using clawsmail, switching to html don't convert it to text, I've try to copy and use

xclip -o -sel clip | html2text | less

but it didn't convert the entities. Is there a way to have that text using command line tools?

The only way I can think of is to use data:text/html,<PASTE THE EMAIL> and open it in a browser, but would prefer the command line.

jcubic
  • 9,612
  • 16
  • 54
  • 75

4 Answers4

25

With Free recode (formerly known as GNU recode):

recode html < file

If you don't have recode or HTML::Entities and only need to decode &#x<hex>; entities, you could do it by hand with:

perl -Mopen=locale -pe 's/&#x([\da-f]+);/chr hex $1/gie'
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
  • this work perfect `c-v | html2text | recode html` – jcubic Aug 08 '14 at 15:02
  • Didn't have `html2text`; not sure it matters. This example fails with `recode: Request 'html' is erroneous`. Seems it needs to be run this way now with a range instead of a single identifier: `recode html..utf-8`. A bit strange, but I guess it's all similar translating codes at some levels. – Pysis Feb 06 '20 at 16:25
  • @Pysis, you'll notice the first version of this answer had `html..` later changed to `html` in 2014. `html` alone definitely works with the latest version (git head from December 2019) or from 3.6 from 2008. Is it possible you have a very old version? – Stéphane Chazelas Feb 06 '20 at 17:34
  • Just installed to use in cygwin, I think it was from Choco? recode 3.7-beta2 – Pysis Feb 06 '20 at 17:44
  • 2
    With recode 3.7-beta2 the command that currently works is `recode HTML..utf-8`. – Diomidis Spinellis Mar 22 '20 at 21:13
6

From How can I decode HTML entities? on StackOverflow, you may be able to implement a simple perl solution such as

perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)' email.txt

e.g. using your example text

$ perl -Mopen=locale -MHTML::Entities -pe '$_ = decode_entities($_)' email.txt
chciałabym zapytać, czy rozważa Pan takze udział w nowych projektach w Warszawie ? Obecnie poszukujemy specjalisty javascript/architekta z bardzo dobrą znajomością Angular.js do projektu, który dotyczy systemu, służącego do monitorowania i zarządzania flotą pojazdów. Zespół, do którego poszukujemy

With -Mopen=locale, I/O is done in the locale's character set. That includes input from email.txt. It looks like email.txt contains only ASCII characters (the whole point of encoding those characters using the &#x<hex>; notation I suppose), but if not you may need to adapt the above to also decode that file using the right charset (if it's not the same as the locale's one) instead of using open=locale.

steeldriver
  • 78,509
  • 12
  • 109
  • 152
5

A python 3.2+ version, can be used in a pipe:

python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]' < file
Aissen
  • 151
  • 1
  • 3
  • 1
    Cleaner: ``python3 -c'import html,sys;print(html.unescape(sys.stdin.read()), end="")'`` – ariddell Mar 27 '18 at 12:55
  • @ariddell : your version isn't line-by-line, and I wanted to preserve line boundaries; otherwise it blocks a pipe until everything is read on stdin (pipe is exhausted). – Aissen Mar 27 '18 at 15:09
-1

echo -e "\x01\x19" should do the trick.

doneal24
  • 4,910
  • 2
  • 16
  • 33
  • to get up votes you should probably write shell code that will convert `ę` to `echo -e "\x01\x19"` should be possible with sed. – jcubic Aug 08 '14 at 14:43
  • Also this don't work because it's one character and I don't get it when I run your command. – jcubic Aug 08 '14 at 14:45
  • \u119 work, but I'm not able to make it work with sed. So far I have `c-v | sed -e 's/\([^;]*\);/\\u\1/g' -e 's/.*/echo -e "&"/' | bash` – jcubic Aug 08 '14 at 14:53