1

In a shell where \u is valid (bash +4.3, ksh93 or zsh) we can print Unicode characters:

$ printf 'a b c \ua0 \ua1 \ua2 \ua3 \n'
a b c   ¡ ¢ £

Which are some characters from the Latin-1_Supplement range.

However as soon as an Unicode 9fcharacter is added, printing stops until a Unicode 9c gets printed.

Both \u9f and \u9c (APC and ST) are C1 control character.

$ printf 'a b c \u9f d e f \u9c \ua0 \ua1 \ua2 \ua3 \n'
a b c    ¡ ¢ £ 

The characters d e f disappear.

It is sure that printf is generating all the characters, and that redirecting the output to some other software (not the terminal) will show the characters generated:

$ printf 'a b c \u9f d e f \u9c \ua0 \ua1 \ua2 \ua3 \n' | od -A n -tx1
 61 20 62 20 63 20 c2 9f 20 64 20 65 20 66 20 c2
 9c 20 c2 a0 20 c2 a1 20 c2 a2 20 c2 a3 20 0a

That goes as far as demonstrating that the characters are being generated. Then, why are they not being printed (shown with some visible glyph) ?.

The questions I have are:

  1. Is APC actually connected to ST. Where is it defined ?
  2. Are the characters between those two characters sent to some application ?
  3. If so, to which application ?
  4. Who is responsible for such redirection? The shell, the terminal or something else ?

EDIT

Neither the xterm nor konsole terminals remove the d e f characters.

That confirms that it is an internal issue of the terminal application, not the shell. Have not found where that is defined yet.

QuartzCristal
  • 1,963
  • 3
  • 23

3 Answers3

5

Is APC actually connected to ST. Where is it defined ?

These control characters are not actually original to Unicode, but inherited from older character set specifications, like ECMA-48, ISO/IEC 6429 and the ISO/IEC-8859 family of character encodings. Broadly speaking, these standards essentially agree with each other on C1 control characters (because they are being backwards compatible with each other and some even older specifications).

Since copies of ISO/IEC 6429 are being sold, I don't expect to find a legitimate copy of it freely available on the internet, but ECMA-48 says:

8.3.2 APC - APPLICATION PROGRAM COMMAND

Notation: (C1)

Representation: 09/15 or ESC 05/15

APC is used as the opening delimiter of a control string for application program use. The command string following may consist of bit combinations in the range 00/08 to 00/13 and 02/00 to 07/14. The control string is closed by the terminating delimiter STRING TERMINATOR (ST). The interpretation of the command string depends on the relevant application program.

and:

8.3.143 ST - STRING TERMINATOR

Notation: (C1)

Representation: 09/12 or ESC 05/12

ST is used as the closing delimiter of a control string opened by APPLICATION PROGRAM COMMAND (APC), DEVICE CONTROL STRING (DCS), OPERATING SYSTEM COMMAND (OSC), PRIVACY MESSAGE (PM), or START OF STRING (SOS).

Unicode defines only one control character within the C1 control character range: U+0085 Next Line (NEL). For any other characters within the C1 range, this part of the specification applies:

The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.

I can't verify it here, but I'd expect ISO/IEC 6429 to pretty closely conform to what ECMA-48 said, as above. Also, the author of the terminal might have considered "being backwards compatible with pre-Unicode 7-bit and 8-bit character encodings, like ECMA-48" to be a specific application use.

So the terminal might legitimately interpret the characters between APC and ST as "I don't know what these are for, but I sure know these are not intended to be displayed as regular output."

The terminal might or might not be programmed react in some fashion to some specific strings encapsulated between APC and ST, and ignore any non-matching strings. Since the terminal window is the "last step before the human", it would be certainly able to assume that any application program command strings arriving to it are meant for the terminal to interpret and act on if applicable, and any such strings that are unrecognizable by the terminal must be errors.

Displaying an "invalid encoding" character or other error message would not be appropriate, as the string is validly encoded as "application-specific control string, not for displaying". So the answer to the titular question "where are the characters going?" is most likely: they are being discarded as parts of an invalid control string.

But note that the Unicode specification said "...may be interpreted...", not "...must be interpreted...". Therefore, the other terminal implementations' choice of just ignoring the APC and ST characters as non-printable control characters with no applicable meaning is not necessarily invalid either.

This question on Stack Overflow also discusses the control sequences involving the APC and ST control characters.

The accepted answer there says:

The reality is, APC is very rarely implemented – most systems never generate APC sequences and silently ignore any received. No application should send or interpret APC sequences unless it knows the other end of the connection is using them in a particular way – such as through a configuration option to enable their use, or if it (somehow) knows which terminal emulator is being used and knows that terminal emulator assigns them a particular meaning [...]

telcoM
  • 87,318
  • 3
  • 112
  • 232
1

The characters are not being sent anywhere, they are simply not being displayed by your terminal despite being there in the output:

$ printf 'a b c \u9f d e f \u9c \ua0 \ua1 \ua2 \ua3 \n' | od -c
0000000   a       b       c     302 237       d       e       f     302
0000020 234     302 240     302 241     302 242     302 243      \n
0000037

You can also confirm they are in the output by redirecting to a file and then investigating the file:

$ printf 'a b c \u9f d e f \u9c \ua0 \ua1 \ua2 \ua3 \n' > file
$ od -c file
0000000   a       b       c     302 237       d       e       f     302
0000020 234     302 240     302 241     302 242     302 243      \n
0000037

It appears that what a terminal does with the combination of \u9f and \u9c is implementation-dependent. It simply happens that the way your terminal handles it is by moving back a few characters and continuing printing from there, which results in overwriting other characters. This is why you see:

$ printf 'a b c \u9f d e f \u9c \ua0 \ua1 \ua2 \ua3 \n'
a b c    ¡ ¢ £ 

I can reproduce that on gnome-terminator, but xterm just prints a space:

$ printf 'a b c \u9f d e f \u9c \ua0 \ua1 \ua2 \ua3 \n'
a b c  d e f    ¡ ¢ £ 

Here's the same thing in screenshots:

screenshot showing different output in xterm and gnome-terminal

This is similar to what happens in a more clear cut case, that of using a carriage return (\r) whose job is precisely to move back to the beginning of a line. This is why you get:

$ printf '12345\r67890\n'
67890

The terminal started printing 12345, then the \r sent it back to the beginning of the line where it overwrote the 12345 with the 67890 so what you end up seeing is only 67890. But the 132345 wasn't sent to any other program, it is still there, it is simply not visible because the other characters have overwritten it:

$ printf '12345\r67890\n' | od -c
0000000   1   2   3   4   5  \r   6   7   8   9   0  \n
0000014
terdon
  • 234,489
  • 66
  • 447
  • 667
-1

You're explaining commands that output UTF8 character sequences and the results that you see displayed in the window of your terminal emulator (often referred to as "my terminal window").

Then you describe character sequences that don't seem to cause visible results to be displayed in the window of your terminal emulator. And you ask, "are the characters being sent to some application?"

Yes, they're being delivered to your terminal emulator, which interprets the character sequences it receives and decides what glyphs it will display in its window for you to view.

Sotto Voce
  • 3,664
  • 1
  • 8
  • 21
  • printf will generate utf8 character only if the locale is set to utf8. Use a locale with iso-8859-1 for example, and the generated characters will be of a different encoding. Try `LC_ALL=en_DK.iso88591 printf 'a b c \u9f d e f \u9c g h i \ua0 \ua1 \ua2 \ua3 \n' | od -tx1`. – QuartzCristal Jul 09 '22 at 20:02
  • 1
    You say *Yes, they're being delivered to your terminal emulator,*, and yes the characters are being generated by `printf`, no problem. But then, again, why is the terminal emulator deciding to **not** assign visible glyphs to the given byte sequences after `\u9f`?. – QuartzCristal Jul 09 '22 at 20:06
  • @QuartzCristal _why is the terminal emulator deciding to not assign visible glyphs to the given byte sequences after `\u9f`?_ I haven't written terminal emulators that handle UTF8 sequences, so I don't know the guidelines/rules about rendering them into glyphs. In my mind it's possible that an emulator author could think that a character sequence that begins well and then contains an unrecognized/invalid character should be abandoned without rendering anything. – Sotto Voce Jul 10 '22 at 00:49