Fix character encoding chaos

Question

I‘m building a script to extract some data from a website with broken character encoding:

The html header claims it‘s iso-8859-1, but it‘s not
wgetting the file shows me it‘s actually utf-8, but with wrong characters
Reverse engineering shows me that someone managed to use the windows codepage 1252 as unicode code!

So for example the backtick is 0x91 in codepage 1252 and is U+0091 in this page. Weird. Amazingly, web browsers seem to be able to repair this automagically.

My question: Which tool can help me clean this mess? (Not by hand! This is a dynamic website with hundreds of pages and I saw at least six different false encodings.)

Is there a test-page you can point us at? Or are you just looking for helpful suggestions? — jubilatious1, Apr 02 '22 at 19:42
@jubilatious1 Yes, I asked for a tool and got the right hint. I new `iconv`, but I wouldn't have tested it on irregular encoded input. But it worked for some reason. — Philippos, Apr 04 '22 at 05:43

Sheldon · Accepted Answer · 2022-04-02T09:33:50.430

Depending on what you mean with "Not by hand", iconv could be useful for your task.

iconv - convert text from one character encoding to another

OPTIONS

   -f from-encoding, --from-code=from-encoding
          Use from-encoding for input characters.

   -t to-encoding, --to-code=to-encoding
          Use to-encoding for output characters.

In my experience, iconv works even if you have to deal with wrong encodings. For example, you can tell iconv that the input data is utf-8 encoded, even though it's iso-8859, so that iconv acts as if the input would be utf-8. This way you can repair incorrectly encoded data.

Since iconv can work as a filter, you can chain it with something like curl. Chaining with wget should work as well, when you use --output-document -.

From what i know, iconv isn't able to detect/guess the correct input encoding. But depending on how messed your input data is, this might be "impossible" anyway, if the website has (too) many different types of wrong/mix encoding. If the whole website is messed up the same way, you should be able to fix it.

Piping the input through `iconv -f CP1252 -t UTF8` did indeed do the trick, although the input was no real CP1252. I don't understand it, but it worked. Thank you. — Philippos, Apr 04 '22 at 05:41

score 1 · Answer 2 · answered Nov 07 '22 at 23:00

First, you want your locale in UTF-8.

To detect

chardetect (from python3-chardet package; AKA chardet)
enca, focused on eastern and central European languages
uchardet
file --brief --mime-encoding FILE | awk '{print $2}' FS=':[ :]+'

Usual suspects are: CP850, CP437, latin1 (AKA ISO-8859-1), CP1252 (AKA windows-1252).

In my experience pretty often these tools doesn't do the job. Sometimes a file could have a mix of encodings.

Somewhere I found this brute-force handy little script:

#!/bin/bash
# Usage script.sh fileWithLiberaci°n.txt | grep Liberación
iconv --list | sed -e 's/\/\///g' | while read -r encoding
do
  transcoded=$(head -n1 "$1" | iconv -c -f "$encoding" -t UTF-8)
  echo "$encoding $transcoded"
done

To convert

iconv (recommended)
recode

Fix character encoding chaos

2 Answers2

To detect

To convert

Related