4

Is it safe to use tar even if there are some characters other than ASCII printable characters?

For example, Japanese characters, Chinese characters, newline character etc.

Are there any known problems that might make tarball extraction fail if using special characters?

iruvar
  • 16,515
  • 8
  • 49
  • 81
Kevin Dong
  • 1,139
  • 1
  • 9
  • 18
  • 1
    Have a read of http://superuser.com/questions/60379/how-can-i-create-a-zip-tgz-in-linux-such-that-windows-has-proper-filenames/60591#60591 - it may help. – garethTheRed Dec 01 '14 at 14:18
  • @garethTheRed - that answer might apply to GNU `tar` - which does not encode anything but sparse files - but a POSIX `tar/pax` will encode files as UTF-8 *or* not at all and record its type as *BINARY*. And a POSIX `tar/pax` *does* allow for changing a charset, anyway. – mikeserv Dec 01 '14 at 16:05
  • 1
    @mikeserv - I thought the link gave a good overview of issues with `tar` that's all - @Sys' answer covered the POSIX option. The OP has tagged this as `linux` and the CentOS 7 and Ubuntu 14.04 boxes I run both show `tar --show-defaults` still as `--format=gnu`, therefore there is a very good chance that the OP will be using `tar` in GNU mode by default. Thanks for mentioning `pax` by the way - never really looked at it before :-) Time for some reading... – garethTheRed Dec 01 '14 at 16:32
  • @garethTheRed - all *very* good points. I mean to ask a question about `pax` soon - the features listed [here](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html) are really friggin cool - but I've yet to actually find a `pax` that supports them - or all of them - especially the `-o listop=...`. The closest I've come, actually, is GNU `tar` and modified headers with `--format=posix --pax-option=...` - but still not `listop`... *sigh*. The POSIX `pax` description is among the best - reading it was enough for me to do [this](http://unix.stackexchange.com/a/151057/52934). – mikeserv Dec 01 '14 at 16:38
  • Related: http://unix.stackexchange.com/questions/228234/untar-filenames-in-a-character-encoding-different-from-encoding-used-in-the-file – Incnis Mrsi Sep 17 '15 at 10:24

1 Answers1

2

You can of course read the source of tar to check for yourself.

Simply put, tar does no interpretation of the byte sequence that make up a filename. Just like the kernel, it treats it as an abstract sequence of bytes. So it is 'safe', in the sense that usable files will be extracted.

In the environment where the files are unpacked, then user tools may interpret the filenames as different characters; that's always an issue with changing locales, and not specific to the transport (tar, NFS, FTP, ...).

Toby Speight
  • 8,460
  • 3
  • 26
  • 50