Questions tagged [character-encoding]

Questions that deal with various representations of characters & character sets, such as: ASCII, UTF-8, EBCDIC, among others. Often encountered when moving files between operating systems that encode new lines with carriage returns and/or newline characters.

Use this tag when you know that you are dealing with characters or character sets that are represented differently.

A frequent issue is when a file (particularly one meant to be executed as a ) is saved on a Microsoft Windows platform, then transferred to a Unix platform:

Other useful questions on the site are:

For further explanation around character encodings, see the Wikipedia entry.

398 questions
77
votes
4 answers

echo bytes to a file

I'm trying to connect my rasberry Pi to some display using the i2c bus. To get started I wanted to manually write stuff, bytes in particular to a file. How do you write specific bytes to a file? I already read that one and I figured my problem…
Mark
  • 1,149
  • 3
  • 9
  • 18
77
votes
4 answers

How can I test the encoding of a text file... Is it valid, and what is it?

I have several .htm files which open in Gedit without any warning/error, but when I open these same files in Jedit, it warns me of invalid UTF-8 encoding... The HTML meta tag states "charset=ISO-8859-1". Jedit allows a List of fallback encodings and…
Peter.O
  • 32,426
  • 28
  • 115
  • 163
71
votes
2 answers

How can I set Vim's default encoding to UTF-8?

I'd like to contribute to an open source project by providing translated strings. One of their requirements is that contributors must use UTF-8 as the encoding for the PO files. I'm using Vim 7.3 on Linux. How can I be sure that Vim's encoding is…
Paolo
  • 16,955
  • 11
  • 31
  • 40
66
votes
4 answers

What is the ^M character called?

TexPad is creating it. I know that it is under some deadkey. I just cannot remember it is name. The blue character: I just want to mass remove them from my document. How can you type it?
Léo Léopold Hertz 준영
  • 6,788
  • 29
  • 91
  • 193
59
votes
6 answers

Filtering invalid utf8

I have a text file in an unknown or mixed encoding. I want to see the lines that contain a byte sequence that is not valid UTF-8 (by piping the text file into some program). Equivalently, I want to filter out the lines that are valid UTF-8. In other…
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
56
votes
3 answers

What charset encoding is used for filenames and paths on Linux?

Does it depend on what file system I use? For example, ext2/ext3/ext4 but also what happens when I insert one of those "joliet" CD-ROMs with ISO 9660? I've heard that POSIX contains some sort of spec for the charset encoding of…
martin
  • 561
  • 1
  • 4
  • 3
42
votes
2 answers

tr complains of “Illegal byte sequence”

I'm brand new to UNIX and I am using Kirk McElhearn's "The Mac OS X Command Line" to teach myself some commands. I am attempting to use tr and grep so that I can search for text strings in a regular MS-Office Word Document. $ tr '\r' '\n' <…
user74886
  • 421
  • 1
  • 4
  • 4
41
votes
5 answers

Converting a UTF-8 file to ASCII (best-effort)

I have a file in UTF-8 that contains texts in multiple languages. A lot of it are people's names. I need to convert it to ASCII and I need the result to look as decent as possible. There are many ways how to approach converting from a wider encoding…
user7610
  • 1,878
  • 2
  • 18
  • 22
40
votes
4 answers

How to specify characters using hexadecimal codes in `grep`?

I am using following command to grep character set range for hexadecimal code 0900 (instead of अ) to 097F (instead of व). How I can use hexadecimal code in place of अ and व? bzcat archive.bz2 | grep -v '<[अ-व]*\s' | tr '[:punct:][:blank:][:digit:]'…
40
votes
4 answers

How to change encoding from Non-ISO extended-ASCII text, with CRLF line terminators to UTF-8?

I have a txt file : $ file -i x.txt x.txt: text/plain; charset=unknown-8bit $ file x.txt x.txt: Non-ISO extended-ASCII text, with CRLF line terminators And there are some characters that are incorrectly encoded : trwa³y, sta³y, usuwaæ How can…
Patryk
  • 13,556
  • 22
  • 53
  • 61
39
votes
4 answers

identify files with non-ASCII or non-printable characters in file name

In a directory size 80GB with approximately 700,000 files, there are some file names with non-English characters in the file name. Other than trawling through the file list laboriously is there: An easy way to list or otherwise identify these file…
suspectus
  • 5,890
  • 4
  • 20
  • 26
37
votes
8 answers

How can I correctly decompress a ZIP archive of files with Hebrew names?

Someone sent me a ZIP file containing files with Hebrew names (and created on Windows, not sure with which tool). I use LXDE on Debian Stretch. The Gnome archive manager manages to unzip the file, but the Hebrew characters are garbled. I think I'm…
einpoklum
  • 8,772
  • 19
  • 65
  • 129
36
votes
7 answers

Why do some characters show as squares in Chrome?

For example in the dev tools I get something like: Some of these squares are at the end of lines, initially I thought they were carriage returns but it turns out they aren't. Also, squares appear after = or > in many places where there is no…
Mat
  • 705
  • 2
  • 6
  • 9
32
votes
2 answers

find(1): how is the star wildcard implemented for it to fail on some filenames?

In a file system where filenames are in UTF-8, I have a file with a faulty name; it is displayed as: D�sinstaller, actual name according to zsh: D$'\351'sinstaller, Latin1 for Désinstaller, itself a French barbarism for "uninstall." Zsh would not…
Michaël
  • 774
  • 5
  • 18
32
votes
10 answers

How to print all printable ASCII chars in CLI?

How can I list all the printable ASCII characters in the terminal?
LanceBaynes
  • 39,295
  • 97
  • 250
  • 349
1
2 3
26 27