5

The methods I found break things further down the line by also affecting linebreaks.
For example...

$ message="First Line\nSecond Line"; 
$ echo "${message^^}"
FIRST LINE\NSECOND LINE

Is there an elegant way to convert a string to uppercase, but leaving escaped characters alone, to get the following output instead?

FIRST LINE\nSECOND LINE

I could just do something convoluted like changing "\n" to 0001 or something along those lines, apply the conversion and then return 0001 to "\n". But maybe there is a better way.

Ocean
  • 250
  • 1
  • 7
  • Is this for later inclusion as part of some other data, possibly in XML or JSON format? If so, a parser of that format may possibly have routines for turning strings into uppercase in the way you describe, as, for example, `ascii_upcase` in tho JSON parser `jq`, or the XPath function `upper-case()` for XML. – Kusalananda Jul 24 '22 at 11:50
  • @Kusalananda For me this is only about text processing, but someone else stumbling across this question might have such a use case. – Ocean Jul 25 '22 at 09:52

6 Answers6

6

With zsh instead of bash:

$ message="First Line\nSecond Line"
$ set -o extendedglob
$ print -r -- ${message//(#b)((\\?)|(?))/$match[2]$match[3]:u}
FIRST LINE\nSECOND LINE

In bash (or any shell) and with the GNU implementation of sed, you can do the same with:

$ printf '%s\n' "$message" | sed -E 's/(\\.)|(.)/\1\u\2/g'
FIRST LINE\nSECOND LINE

Some potentially more efficient variants as they minimise the number of substitutions:

  • zsh

    print -r -- ${message//(#b)((\\?)|([^\\]##))/$match[2]$match[3]:u}
    

    or

    print -r -- ${message//(#b)((\\?)#)([^\\]##)/$match[1]$match[3]:u}
    
  • their GNU sed translations:

    printf '%s\n' "$message" | sed -E 's/(\\.)|([^\\]+)/\1\U\2/g'
    

    or

    printf '%s\n' "$message" | sed -E 's/((\\.)*)([^\\]+)/\1\U\3/g'
    

Beware they convert \Mx (Meta-x, an escape sequence supported by zsh's print for instance and that expands to the 0xf8 byte ('x' + 0x80)) to \MX (0xd8). They also convert \x7a to \x7A or \u007a to \u007A or \Cx to \CX but that shouldn't be a problem as those expand to the same.

Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
3

I'd be tempted to interpret the escape sequences into literal characters:

message="First Line\nSecond Line"
declare -u Message                       # uppercase on assignment
printf -v Message -- "${message//%/%%}"  # assign
declare -p Message                       # inspect

result

declare -u msg="FIRST LINE
SECOND LINE"
Stéphane Chazelas
  • 522,931
  • 91
  • 1,010
  • 1,501
glenn jackman
  • 84,176
  • 15
  • 116
  • 168
  • 3
    Beware that with `message='\141'` for instance, you'd get `declare -u Message="A"` instead of `declare -u Message="a"` – Stéphane Chazelas Apr 25 '22 at 19:07
  • Note that any ```\``` will ve doubled ```\\```. –  Apr 25 '22 at 22:45
  • 1
    Not giving `printf` a format causes the change of `%` that you want to avoid by duplicating every `%`. However, a `printf -v Message '%b' -- "${message}"` will interpret back-slashed characters exactly as `echo -e` without changing the `%`s. –  Apr 25 '22 at 22:57
  • Please read: https://unix.stackexchange.com/q/700508/232326 –  Apr 27 '22 at 19:26
1

I'd consider evaluating the \n and other escape sequences at the point that the variable was defined. Here $message actually contains a newline.

message=$(printf '%b' 'First Line\nSecond Line')
echo "${message^^}"

Output

FIRST LINE
SECOND LINE
roaima
  • 107,089
  • 14
  • 139
  • 261
0

The variable can be iterated line by line. Then concatenate the output again.

bash:

$ message="First Line\nSecond Line";
$ message=$(echo -e ${message} |while read -r line; do echo -n "${line^^}\n" ; done) && message=${message%??}
$ echo ${message} 
FIRST LINE\nSECOND LINE
Kadir
  • 254
  • 1
  • 5
  • See [Understanding "IFS= read -r line"](//unix.stackexchange.com/q/209123), [When is double-quoting necessary?](//unix.stackexchange.com/q/68694) and [Why is printf better than echo?](//unix.stackexchange.com/q/65803) – Stéphane Chazelas Apr 26 '22 at 07:34
  • 1
    That will likely leave linefeeds alone, but the OP asked for all escaped characters to be left alone. – Henrik supports the community Apr 26 '22 at 08:04
  • 1
    Backslash processing should be removed from the while read loop for sure. Just edited the answer. – Kadir Apr 26 '22 at 08:35
  • (1) For starters, `${message}` should be `"$message"`.  See [`${variable_name}` doesn’t mean what you think it does …](https://unix.stackexchange.com/q/32210/80216#286525).  (2) You should explain your answer better — in particular (IMO) the `%??` part. (You don’t need to explain it *to me;* I figured it out.) … … … … … … … … … … … … … … … Please do not respond in comments; [edit] your answer to make it clearer and more complete. … (Cont’d) – G-Man Says 'Reinstate Monica' May 07 '22 at 19:04
  • (Cont’d) …  (3) This is a classic example of providing a solution for the example while ignoring the larger question.  `foo\012bar` will turn into `FOO\nBAR`, `\g\h\i\j\k\l\m\n\o\p\q` will turn into `\G\H\I\J\K\L\M\n\O\P\Q`, and any of `\a`, `\b`, `\c`, `\e`, `\f`, `\r`, `\t`, `\v`, and ``\\`` will cause problems.  Also, leading and trailing spaces, and multiple spaces. (4) Strictly speaking, the question didn’t say that you should clobber the original variable.  If you need a multi-step process, you should assign the intermediate value to a `temp` variable. – G-Man Says 'Reinstate Monica' May 07 '22 at 19:04
0
echo "$message"  |  sed -e 's/^[[:lower:]]/\u&/' -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g' \
                                                 -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'
  • -e 's/^[[:lower:]]/\u&/'  If the first character in the string (or, more generally, the first character on a line) is a lower-case letter, capitalize it.  Because the first character on a line can’t be escaped.  Duh.  That’s a no-brainer.

  • -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'  Look at the line two characters at a time.  If a lower-case letter is preceded by something other than a backslash, leave the preceding character alone, and capitalize the lower-case letter.

    You might think that this would be enough to process the entire line.  Unfortunately, since it processes the line two characters at a time, it gets only every other letter:

    $ echo "first line\nsecond line" | sed -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'
    fIrSt LiNe\nSeCoNd LiNe
    

    so,

  • -e 's/\([^\]\)\([[:lower:]]\)/\1\u\2/g'  Do the exact same thing a second time.  This will pick up the letters that were skipped on the first pass.


Alternative version:

echo "$message" | sed -e 's/^[[:lower:]]/\u&/' \
                                  -e ': loop; s/\([^\]\)\([[:lower:]]\)/\1\u\2/g; t loop'

Basically the same as the first version, but, instead of repeating the second s command, it iterates it with a loop.


Unfortunately, this will not work correctly for double backslashes:  foo\\bar will become FOO\\bAR, even though the b should be capitalized, since the \\ is an escaped backslash, and so should not cause the b to be escaped.

  • No, the first character could be escaped, like when you want to insert a tab at the beginning, which would be "\t". – Ocean May 09 '22 at 15:19
  • One of us is not understanding the other.  If the line begins with `\t`, then *the first character* is ``\``.  `t` is *the **second** character.*  If I’m misunderstanding you, please explain more clearly. – G-Man Says 'Reinstate Monica' May 09 '22 at 22:36
  • Semantics. If a line begins with "\t", then the first character is an escaped "t". But one can also say that "\" is the first character. Depends on how you look at it, I guess. It could also be an escaped "\" by having "\\t", so one gets "\t" instead of the tab character. Since these constructs are supposed to represent a single character (\t is tab), I treat them as single entities, which was the origin of the misunderstanding. – Ocean May 10 '22 at 12:08
0

Using Raku (formerly known as Perl_6)

~$ echo 'a\nb'
a\nb
~$ echo 'a\nb' | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB
~$ echo "a\\nb"
a\nb
~$ echo "a\\nb" | raku -pe 's:g/ <!after "\\"> (.) /{$0.uc}/;'
A\nB

Above uses a negative look-behind assertion, <!after "\\">, to select out all characters except those immediately after a \ backslash. Selected characters are then uppercased with Raku's .uc routine.

Certainly it's safer to provide the regex with a custom <-[ … ]> negative character class, sparing backslashed characters like \n and \t from being uppercased. (FYI, custom positive character classes are written <+[ … ]> or more simply <[ … ]> in Raku).

Below, using Raku's "Q-lang" (quoting language) to feed the substitution operator a string. In all four examples below \n is returned (not uppercase \N). Note in the third example how \n is operationally-interpreted as a newline character, and this remains unchanged in the fourth example, telling us that \n still exists in that string (i.e. it has NOT been uppercased to \N):

~$ raku -e 'put Q<a\nb>'
a\nb
~$ raku -e 'put Q<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A\nB
~$ raku -e 'put Q:b<a\nb>'
a
b
~$ raku -e 'put Q:b<a\nb>' | raku -pe 's:g/ <!after "\\"> (<-[nt]>) /{$0.uc}/;'
A
B

NOTE, see: "Place an escape sign before every non-alphanumeric characters" for Raku answers to a related question on StackOverflow.

References:
https://docs.raku.org/language/quoting
https://docs.raku.org/language/regexes#Literals_and_metacharacters
https://raku.org

jubilatious1
  • 2,385
  • 8
  • 16