1

I‘m building a script to extract some data from a website with broken character encoding:

  • The html header claims it‘s iso-8859-1, but it‘s not
  • wgetting the file shows me it‘s actually utf-8, but with wrong characters
  • Reverse engineering shows me that someone managed to use the windows codepage 1252 as unicode code!

So for example the backtick is 0x91 in codepage 1252 and is U+0091 in this page. Weird. Amazingly, web browsers seem to be able to repair this automagically.

My question: Which tool can help me clean this mess? (Not by hand! This is a dynamic website with hundreds of pages and I saw at least six different false encodings.)

Philippos
  • 13,237
  • 2
  • 37
  • 76

2 Answers2

2

Depending on what you mean with "Not by hand", iconv could be useful for your task.

iconv - convert text from one character encoding to another

OPTIONS

   -f from-encoding, --from-code=from-encoding
          Use from-encoding for input characters.

   -t to-encoding, --to-code=to-encoding
          Use to-encoding for output characters.

In my experience, iconv works even if you have to deal with wrong encodings. For example, you can tell iconv that the input data is utf-8 encoded, even though it's iso-8859, so that iconv acts as if the input would be utf-8. This way you can repair incorrectly encoded data.

Since iconv can work as a filter, you can chain it with something like curl. Chaining with wget should work as well, when you use --output-document -.

From what i know, iconv isn't able to detect/guess the correct input encoding. But depending on how messed your input data is, this might be "impossible" anyway, if the website has (too) many different types of wrong/mix encoding. If the whole website is messed up the same way, you should be able to fix it.

Sheldon
  • 200
  • 6
  • Piping the input through `iconv -f CP1252 -t UTF8` did indeed do the trick, although the input was no real CP1252. I don't understand it, but it worked. Thank you. – Philippos Apr 04 '22 at 05:41
1

First, you want your locale in UTF-8.

To detect

  • chardetect (from python3-chardet package; AKA chardet)
  • enca, focused on eastern and central European languages
  • uchardet
  • file --brief --mime-encoding FILE | awk '{print $2}' FS=':[ :]+'

Usual suspects are: CP850, CP437, latin1 (AKA ISO-8859-1), CP1252 (AKA windows-1252).

In my experience pretty often these tools doesn't do the job. Sometimes a file could have a mix of encodings.

Somewhere I found this brute-force handy little script:

#!/bin/bash
# Usage script.sh fileWithLiberaci°n.txt | grep Liberación
iconv --list | sed -e 's/\/\///g' | while read -r encoding
do
  transcoded=$(head -n1 "$1" | iconv -c -f "$encoding" -t UTF-8)
  echo "$encoding $transcoded"
done

To convert

Related

Pablo A
  • 2,307
  • 1
  • 22
  • 34