Questions tagged [unicode]

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems.

476 questions
149
votes
11 answers

How can I remove the BOM from a UTF-8 file?

I have a file in UTF-8 encoding with BOM and want to remove the BOM. Are there any linux command-line tools to remove the BOM from the file? $ file test.xml test.xml: XML 1.0 document, UTF-8 Unicode (with BOM) text, with very long lines
m13r
  • 2,635
  • 2
  • 17
  • 14
94
votes
3 answers

Awesome symbols and characters in a bash prompt

I just ran across a screenshot of someone's terminal: Is there a list of all of the characters which can be used in a Bash prompt, or can someone get me the character for the star and the right arrow?
Naftuli Kay
  • 38,686
  • 85
  • 220
  • 311
71
votes
2 answers

How can I set Vim's default encoding to UTF-8?

I'd like to contribute to an open source project by providing translated strings. One of their requirements is that contributors must use UTF-8 as the encoding for the PO files. I'm using Vim 7.3 on Linux. How can I be sure that Vim's encoding is…
Paolo
  • 16,955
  • 11
  • 31
  • 40
61
votes
3 answers

Why is printf "shrinking" umlaut?

If I execute the following simple script: #!/bin/bash printf "%-20s %s\n" "Früchte und Gemüse" "foo" printf "%-20s %s\n" "Milchprodukte" "bar" printf "%-20s %s\n" "12345678901234567890" "baz" It prints: Früchte und Gemüse foo Milchprodukte…
René Nyffenegger
  • 2,201
  • 2
  • 23
  • 28
59
votes
6 answers

Filtering invalid utf8

I have a text file in an unknown or mixed encoding. I want to see the lines that contain a byte sequence that is not valid UTF-8 (by piping the text file into some program). Equivalently, I want to filter out the lines that are valid UTF-8. In other…
Gilles 'SO- stop being evil'
  • 807,993
  • 194
  • 1,674
  • 2,175
46
votes
5 answers

Updated my arch linux server and now I get tmux: need UTF-8 locale (LC_CTYPE) but have ANSI_X3.4-1968

I recently updated my Arch Linux server and during that process tmux got updated. I was using tmux while the upgrade was going on and used it afterwards, but all during the same SSH session. Now, however, whenever I try to issue any tmux command I…
RPiAwesomeness
  • 980
  • 2
  • 8
  • 10
46
votes
2 answers

What fonts are good for unicode glyphs

So I was looking at this answer on stackoverflow and realized that my fonts aren't covering a whole lot of the utf-8 unicode spectrum (as I get lots of squares). Does anyone know a font that will cover all of that post?
xenoterracide
  • 57,918
  • 74
  • 184
  • 250
43
votes
7 answers

Is there an alternative to sed that supports unicode?

For example: sed 's/\u0091//g' file1 Right now, I have to do hexdump to get hex number and put into sed as follows: $ echo -ne '\u9991' | hexdump -C 00000000 e9 a6 91 |...| 00000003 And then: $ sed…
A-letubby
  • 699
  • 2
  • 6
  • 6
42
votes
2 answers

How to make tr aware of non-ascii(unicode) characters?

I'm trying to remove some characters from file(UTF-8). I'm using tr for this purpose: tr -cs '[[:alpha:][:space:]]' ' '
MatthewRock
  • 6,826
  • 6
  • 31
  • 54
40
votes
4 answers

How to specify characters using hexadecimal codes in `grep`?

I am using following command to grep character set range for hexadecimal code 0900 (instead of अ) to 097F (instead of व). How I can use hexadecimal code in place of अ and व? bzcat archive.bz2 | grep -v '<[अ-व]*\s' | tr '[:punct:][:blank:][:digit:]'…
39
votes
4 answers

gitk crashes when viewing commit containing emoji: X Error of failed request: BadLength (poly request too large or internal Xlib length error)

I'm able to open gitk but it crashes as soon as I open a commit whom changes contains an emoji (not the commit message). Error ❯ gitk --all X Error of failed request: BadLength (poly request too large or internal Xlib length error) Major opcode…
Édouard Lopez
  • 1,282
  • 12
  • 23
38
votes
7 answers

Convert between Unicode Normalization Forms on the unix command-line

In Unicode, some character combinations have more than one representation. For example, the character ä can be represented as "ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as "ä", that is the two codepoints U+0061…
glts
  • 572
  • 1
  • 4
  • 12
37
votes
1 answer

Should we use UTF-8 characters like ⏰ in bash/shell script?

The simple code here is working as expected on my machine if launched with bash : function ⏰(){ date } ⏰ Could there be a problem for other people using this, or is it universal ? I'm wondering because I've never seen anything like this in other…
bob dylan
  • 1,832
  • 3
  • 20
  • 31
37
votes
8 answers

How can I correctly decompress a ZIP archive of files with Hebrew names?

Someone sent me a ZIP file containing files with Hebrew names (and created on Windows, not sure with which tool). I use LXDE on Debian Stretch. The Gnome archive manager manages to unzip the file, but the Hebrew characters are garbled. I think I'm…
einpoklum
  • 8,772
  • 19
  • 65
  • 129
33
votes
4 answers

Find the best font for rendering a codepoint

How to find the appropriate font for rendering unicode codepoints ? gnome-terminal find that characters like «⼼» can be rendered with fonts like Symbola rather than my terminal font or the codepoint-in-square fallback (). How ?
Nope
  • 461
  • 4
  • 5
1
2 3
31 32