I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:
infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile
I don't mind using sed or awk, but I would have to encode the allowed characters.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */