3

I was running a little bash script on some text files:

find . -name "*.dat" -exec iconv --from-code='UTF-8' --to-code='ASCII//TRANSLIT' {} --output={} \;

My machine is a Ubuntu 14.04 LTS.

After a while, I found that like half of the data in the files disappeared; simply cut off in the middle of a line/word. It is the mysterious signal 7 or core dump (as I heard). The problem is somehow, when the files are too large. Some of my file have >60kB but iconv made them around 30kB.

What can I do about this? Is this a bug? Is there a workaround? Is there any other convenient way to transliterate diacritics?

MERose
  • 527
  • 1
  • 10
  • 24
  • The man page doesn't say what happens when the input file is the same as the output file, and it looks like nothing special is done. So when `iconv` truncates the file in preparation for writing the first block, any reads after that may be zero length. – Mark Plotnick Dec 26 '14 at 15:05
  • Yes, it seems that you are right. However, the problem only occurs for files larger than 30kB. Smaller files are handled correctly. Also, the text in the truncated files get converted correctly. – MERose Dec 26 '14 at 15:19
  • 1
    I ran `strace` on it. It looks like `iconv` starts to write to the output file when it has amassed 32768 bytes. – Mark Plotnick Dec 26 '14 at 15:30
  • If I try to use `iconv` individually on a file larger than 32 K (and overwrite the original file), I'll got the "Bus error" message, with an exit code = 135. – Guillaume Husta Feb 04 '22 at 10:58

1 Answers1

2

As pointed out in the comments to my question, the problem occurs when two conditions are met:

  1. Source and target file are the same.
  2. File is larger than 32768 bytes.

There are two solutions: Either cast a temporary file which then automatically replaces the source file, or use recode.

As to the first solution see for example. https://unix.stackexchange.com/a/10243/94483. For sponge, there is a very good question on SO (https://stackoverflow.com/q/64860/362146) and also an answer here: https://unix.stackexchange.com/a/19980/94483

I will now use iconv as recode supports less character sets (and I also failed to make it run):

FILELIST=$(find . -type f -name "*.dat")

for file in $FILELIST
do
  iconv --from-code='UTF-8' --to-code='ASCII//TRANSLIT' "$file" | sponge "$file"
done

sponge does the replacing job. It's from moreutils.

MERose
  • 527
  • 1
  • 10
  • 24