Convert to uppercase, except for escaped characters

Question

The methods I found break things further down the line by also affecting linebreaks.
For example...

$ message="First Line\nSecond Line"; 
$ echo "${message^^}"
FIRST LINE\NSECOND LINE

Is there an elegant way to convert a string to uppercase, but leaving escaped characters alone, to get the following output instead?

FIRST LINE\nSECOND LINE

I could just do something convoluted like changing "\n" to 0001 or something along those lines, apply the conversion and then return 0001 to "\n". But maybe there is a better way.

Is this for later inclusion as part of some other data, possibly in XML or JSON format? If so, a parser of that format may possibly have routines for turning strings into uppercase in the way you describe, as, for example, `ascii_upcase` in tho JSON parser `jq`, or the XPath function `upper-case()` for XML. — Kusalananda, Jul 24 '22 at 11:50
@Kusalananda For me this is only about text processing, but someone else stumbling across this question might have such a use case. — Ocean, Jul 25 '22 at 09:52

Stéphane Chazelas · Answer 1 · 2022-04-26T07:48:31.807

With zsh instead of bash:

$ message="First Line\nSecond Line"
$ set -o extendedglob
$ print -r -- ${message//(#b)((\\?)|(?))/$match[2]$match[3]:u}
FIRST LINE\nSECOND LINE

In bash (or any shell) and with the GNU implementation of sed, you can do the same with:

$ printf '%s\n' "$message" | sed -E 's/(\\.)|(.)/\1\u\2/g'
FIRST LINE\nSECOND LINE

Some potentially more efficient variants as they minimise the number of substitutions:

zsh

print -r -- ${message//(#b)((\\?)|([^\\]##))/$match[2]$match[3]:u}

or

print -r -- ${message//(#b)((\\?)#)([^\\]##)/$match[1]$match[3]:u}

their GNU sed translations:

printf '%s\n' "$message" | sed -E 's/(\\.)|([^\\]+)/\1\U\2/g'

or

printf '%s\n' "$message" | sed -E 's/((\\.)*)([^\\]+)/\1\U\3/g'

Beware they convert \Mx (Meta-x, an escape sequence supported by zsh's print for instance and that expands to the 0xf8 byte ('x' + 0x80)) to \MX (0xd8). They also convert \x7a to \x7A or \u007a to \u007A or \Cx to \CX but that shouldn't be a problem as those expand to the same.

score 3 · Answer 2 · edited Apr 25 '22 at 19:05

3

I'd be tempted to interpret the escape sequences into literal characters:

message="First Line\nSecond Line"
declare -u Message                       # uppercase on assignment
printf -v Message -- "${message//%/%%}"  # assign
declare -p Message                       # inspect

result

declare -u msg="FIRST LINE
SECOND LINE"

edited Apr 25 '22 at 19:05

Stéphane Chazelas

522,931
91
1,010
1,501

answered Apr 25 '22 at 19:03

glenn jackman

84,176
15
116
168

3

Beware that with `message='\141'` for instance, you'd get `declare -u Message="A"` instead of `declare -u Message="a"` – Stéphane Chazelas Apr 25 '22 at 19:07
Note that any ```\``` will ve doubled ```\\```. – Apr 25 '22 at 22:45
1

Not giving `printf` a format causes the change of `%` that you want to avoid by duplicating every `%`. However, a `printf -v Message '%b' -- "${message}"` will interpret back-slashed characters exactly as `echo -e` without changing the `%`s. – Apr 25 '22 at 22:57
Please read: https://unix.stackexchange.com/q/700508/232326 – Apr 27 '22 at 19:26

score 1 · Answer 3 · answered Jul 24 '22 at 11:42

1

I'd consider evaluating the \n and other escape sequences at the point that the variable was defined. Here $message actually contains a newline.

message=$(printf '%b' 'First Line\nSecond Line')
echo "${message^^}"

Output

FIRST LINE
SECOND LINE

answered Jul 24 '22 at 11:42

roaima

107,089
14
139
261

Kadir · Answer 4 · 2022-04-26T08:31:51.590

0

The variable can be iterated line by line. Then concatenate the output again.

bash:

$ message="First Line\nSecond Line";
$ message=$(echo -e ${message} |while read -r line; do echo -n "${line^^}\n" ; done) && message=${message%??}
$ echo ${message} 
FIRST LINE\nSECOND LINE

edited Apr 26 '22 at 08:31

answered Apr 26 '22 at 07:09

Kadir

254
1
5

See [Understanding "IFS= read -r line"](//unix.stackexchange.com/q/209123), [When is double-quoting necessary?](//unix.stackexchange.com/q/68694) and [Why is printf better than echo?](//unix.stackexchange.com/q/65803) – Stéphane Chazelas Apr 26 '22 at 07:34
1

That will likely leave linefeeds alone, but the OP asked for all escaped characters to be left alone. – Henrik supports the community Apr 26 '22 at 08:04
1

Backslash processing should be removed from the while read loop for sure. Just edited the answer. – Kadir Apr 26 '22 at 08:35
(1) For starters, `${message}` should be `"$message"`. See [`${variable_name}` doesn’t mean what you think it does …](https://unix.stackexchange.com/q/32210/80216#286525). (2) You should explain your answer better — in particular (IMO) the `%??` part. (You don’t need to explain it *to me;* I figured it out.) … … … … … … … … … … … … … … … Please do not respond in comments; [edit] your answer to make it clearer and more complete. … (Cont’d) – G-Man Says 'Reinstate Monica' May 07 '22 at 19:04
(Cont’d) … (3) This is a classic example of providing a solution for the example while ignoring the larger question. `foo\012bar` will turn into `FOO\nBAR`, `\g\h\i\j\k\l\m\n\o\p\q` will turn into `\G\H\I\J\K\L\M\n\O\P\Q`, and any of `\a`, `\b`, `\c`, `\e`, `\f`, `\r`, `\t`, `\v`, and ``\\`` will cause problems. Also, leading and trailing spaces, and multiple spaces. (4) Strictly speaking, the question didn’t say that you should clobber the original variable. If you need a multi-step process, you should assign the intermediate value to a `temp` variable. – G-Man Says 'Reinstate Monica' May 07 '22 at 19:04

G-Man Says 'Reinstate Monica' · Answer 5 · 2022-05-09T23:15:35.573

echo "$message"  |  sed -e 's/^[[:lower:]]/\u&/' -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g' \
                                                 -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'

-e 's/^[[:lower:]]/\u&/' If the first character in the string (or, more generally, the first character on a line) is a lower-case letter, capitalize it. Because the first character on a line can’t be escaped. Duh. That’s a no-brainer.
-e 's/$[^\]$$[[:lower:]]$/\1\u\2/g' Look at the line two characters at a time. If a lower-case letter is preceded by something other than a backslash, leave the preceding character alone, and capitalize the lower-case letter.

You might think that this would be enough to process the entire line. Unfortunately, since it processes the line two characters at a time, it gets only every other letter:
```
$ echo "first line\nsecond line" | sed -e 's/$[^\]$$[[:lower:]]$/\1\u\2/g'
fIrSt LiNe\nSeCoNd LiNe
```
so,
-e 's/$[^\]$$[[:lower:]]$/\1\u\2/g' Do the exact same thing a second time. This will pick up the letters that were skipped on the first pass.

Alternative version:

echo "$message" | sed -e 's/^[[:lower:]]/\u&/' \
                                  -e ': loop; s/\([^\]\)\([[:lower:]]\)/\1\u\2/g; t loop'

Basically the same as the first version, but, instead of repeating the second s command, it iterates it with a loop.

Unfortunately, this will not work correctly for double backslashes:  foo\\bar will become FOO\\bAR, even though the b should be capitalized, since the \\ is an escaped backslash, and so should not cause the b to be escaped.

No, the first character could be escaped, like when you want to insert a tab at the beginning, which would be "\t". — Ocean, May 09 '22 at 15:19
One of us is not understanding the other. If the line begins with `\t`, then *the first character* is ``\``. `t` is *the **second** character.* If I’m misunderstanding you, please explain more clearly. — G-Man Says 'Reinstate Monica', May 09 '22 at 22:36
Semantics. If a line begins with "\t", then the first character is an escaped "t". But one can also say that "\" is the first character. Depends on how you look at it, I guess. It could also be an escaped "\" by having "\\t", so one gets "\t" instead of the tab character. Since these constructs are supposed to represent a single character (\t is tab), I treat them as single entities, which was the origin of the misunderstanding. — Ocean, May 10 '22 at 12:08

jubilatious1 · Answer 6 · 2022-07-24T11:33:13.433

Using Raku (formerly known as Perl_6)

~$ echo 'a\nb'
a\nb
~$ echo 'a\nb' | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB
~$ echo "a\\nb"
a\nb
~$ echo "a\\nb" | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB

Above uses a negative look-behind assertion, <!after "\\">, to select out all characters except those immediately after a \ backslash. Selected characters are then uppercased with Raku's .uc routine.

Certainly it's safer to provide the regex with a custom <-[ … ]> negative character class, sparing backslashed characters like \n and \t from being uppercased. (FYI, custom positive character classes are written <+[ … ]> or more simply <[ … ]> in Raku).

Below, using Raku's "Q-lang" (quoting language) to feed the substitution operator a string. In all four examples below \n is returned (not uppercase \N). Note in the third example how \n is operationally-interpreted as a newline character, and this remains unchanged in the fourth example, telling us that \n still exists in that string (i.e. it has NOT been uppercased to \N):

~$ raku -e 'put Q<a\nb>'
a\nb
~$ raku -e 'put Q<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A\nB
~$ raku -e 'put Q:b<a\nb>'
a
b
~$ raku -e 'put Q:b<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A
B

NOTE, see: "Place an escape sign before every non-alphanumeric characters" for Raku answers to a related question on StackOverflow.

References:
https://docs.raku.org/language/quoting
https://docs.raku.org/language/regexes#Literals_and_metacharacters
https://raku.org

Convert to uppercase, except for escaped characters

6 Answers6