Commands for data manipulation of binary to octave/hex formats?

Question

I need to convert binary data to hex/octave/any suitable format and back to binary when I am splitting one big 1GB file into files containing 777 events each which are not of the same size such that each event is separated by a string fafafafa in hexdump format (but note that this separator may not exist in telnet-examples so you can choose any string there for practice). I am trying to understand which of these commands are suitable for this, motivated by this answer here.

The following data source, binary of telnet, is just an example. I use by purpose pseudolevel in commenting about the outputs, not confuse you with details; I have full documentation of headers and their parts but their understanding is not necessary for this task.

od - v

od -v /usr/bin/telnet | head
0000000 042577 043114 000402 000001 000000 000000 000000 000000
0000020 000003 000076 000001 000000 054700 000000 000000 000000
0000040 000100 000000 000000 000000 073210 000001 000000 000000
0000060 000000 000000 000100 000070 000010 000100 000034 000033
0000100 000006 000000 000005 000000 000100 000000 000000 000000
0000120 000100 000000 000000 000000 000100 000000 000000 000000
0000140 000700 000000 000000 000000 000700 000000 000000 000000
0000160 000010 000000 000000 000000 000003 000000 000004 000000
0000200 001000 000000 000000 000000 001000 000000 000000 000000
0000220 001000 000000 000000 000000 000034 000000 000000 000000

Comments

first strings should be some header but it is odd that they go from 2, 4, 6, 10, ... so I think this may be a limitation later

hexdump -v

hexdump -v /usr/bin/telnet | head
0000000 457f 464c 0102 0001 0000 0000 0000 0000
0000010 0003 003e 0001 0000 59c0 0000 0000 0000
0000020 0040 0000 0000 0000 7688 0001 0000 0000
0000030 0000 0000 0040 0038 0008 0040 001c 001b
0000040 0006 0000 0005 0000 0040 0000 0000 0000
0000050 0040 0000 0000 0000 0040 0000 0000 0000
0000060 01c0 0000 0000 0000 01c0 0000 0000 0000
0000070 0008 0000 0000 0000 0003 0000 0004 0000
0000080 0200 0000 0000 0000 0200 0000 0000 0000
0000090 0200 0000 0000 0000 001c 0000 0000 0000

Comments

numbering of the first string ok
some letters between so may be later problem for readability
first string different size than later four letter combos

hexdump -vb

hexdump -vb /usr/bin/telnet | head
0000000 177 105 114 106 002 001 001 000 000 000 000 000 000 000 000 000
0000010 003 000 076 000 001 000 000 000 300 131 000 000 000 000 000 000
0000020 100 000 000 000 000 000 000 000 210 166 001 000 000 000 000 000
0000030 000 000 000 000 100 000 070 000 010 000 100 000 034 000 033 000
0000040 006 000 000 000 005 000 000 000 100 000 000 000 000 000 000 000
0000050 100 000 000 000 000 000 000 000 100 000 000 000 000 000 000 000
0000060 300 001 000 000 000 000 000 000 300 001 000 000 000 000 000 000
0000070 010 000 000 000 000 000 000 000 003 000 000 000 004 000 000 000
0000080 000 002 000 000 000 000 000 000 000 002 000 000 000 000 000 000
0000090 000 002 000 000 000 000 000 000 034 000 000 000 000 000 000 000

Command od -v gives me six letter strings like 000000 042577 which I think is an octave format. Another command hexdump -v gives me also four letter strings like 457f 464c but with some octave options hexdump -vo gives three letter words like 000000 177 105 ....

Which of these commands are suitable for data manipulation of binary data such as to make splitting easy?

Are you splitting at fixed intervals? There's `split` for that... — Stephen Kitt, Jun 26 '15 at 05:57
@StephenKitt No. I am splitting by a specific header `fafafafa`. — Léo Léopold Hertz 준영, Jun 26 '15 at 06:01
Reading the Q implies to me that you don't know that the _first string_ (as you call it) on each line is the starting offset in the file for the following data. — FloHimself, Jun 26 '15 at 06:56
@FloHimself I took more abstract approach on this. I do not want to stick to details when I am finding the right tool for the task. I call it by purpose *first string* to keep pseudolevel sufficient. Yes, I have specific documentation for headers and each their part but to explain all details of the task is not necessary in thinking about tools. I added a comment about your comment to the body so readers understand that this is an attempt to understand the benefits of these tools for the task, and not to stick to details. — Léo Léopold Hertz 준영, Jun 26 '15 at 07:10
You should stick to details to find the right tool. The right tool for what you want to do is `xxd`, which is basicly `xxd -ps file | process | xxd -ps -r`. — FloHimself, Jun 26 '15 at 07:16
@FloHimself you should submit that as an answer, I also think that's the right approach! — Stephen Kitt, Jun 26 '15 at 07:25
@Masi specifying `-v` still requires you to specify the output format if the default isn't suitable (so `-t x1` for `od` or something like that). — Stephen Kitt, Jun 26 '15 at 07:26
I'm puzzled by what you're trying to achieve here. Make a file pass through an ASCII channel? That would be something like this: `base64 file | process | base64 -d`. Why do you need the various hex dumpers anyway? What's wrong with Base64 / UUencode / XXencode / whatever? The are zillions of other formats. Which one is better depends on what you're trying to do. — lcd047, Jun 26 '15 at 07:53
I made a community wiki about your comments here. Feel free to post better answers. I think @FloHimself's comment is the best answer here and fits well to my challenge. It would be nice to get some comparison between `xxd -ps` and `od -v -t x1`. OD seems to be easier to read, but may have some additional benefits in the datatype of output. — Léo Léopold Hertz 준영, Jun 26 '15 at 09:36
@lcd047 - i'm w/ you. I would think csplit could be useful here - barring NULs or at least given some possibility for a proxy byte - because i think just 777 fafafa is what is actually wanted? — mikeserv, Jun 26 '15 at 10:58

score 2 · Accepted Answer · edited Apr 13 '17 at 12:45

Binary test data 1GB, discussed here, is created by

dd if=/dev/urandom of=sample.bin bs=64M count=16

Split by byte position

Please, see the thread about this here. I think this is the most appropriate way to do the split if the byte offset is fixed. You need to determine the locations of the first two event headers and count the size of the event. Consider also the tail of the last event header so you know when to end the splitting.

xxd

Answer is in FloHimself's comment.

xxd -ps sample.bin | process | xxd -ps -r

od -v

In od -v, one should specify the output format like the following based on StephenKitt's comment

od -v -t x1 sample.bin

giving

0334260    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334300    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334320    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334340    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334360    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334400    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334420    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334440    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334460    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334500    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334520    00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00
0334540

which is easier to handle.

Comment about passing through Ascii channel instead of hex/octave

I think the conversion from binary to hex and back to binary is sufficient by xxd -ps. I did base64 sample.bin | less -s -M +Gg but I noticed significant slower processing and output looking like this

CGgUAMA0GCSqGSIb3DQEBAQUABIIBABR2EFj76BigPN+jzlGvk9g3rYrHiPKNIjKGprJMaB91ATT6gc0Rs3xlEr6Ybzm8NVcxMnR+2chto/oSh85ExuH4Lk8mELHOIZLeAUUr8eFAXKnZ4SBZ6a8Ewr0x/zX09Bp6IMk18bdVUCT15PT2fbluvJfj7htWCDy0ewm+eU2LIJgkriK8AA0oarqjjK/CIhfglQutfN6QDEp4zqc6tJVqUO7XrEsFlGDOgcPTzeWJuWx31/8MrvEn5HcPzhq+nMI1D6NYjzGhHN08//ObF3z3zthlCDVmowbV161i2LhQ0jy9a/TNyAM0juCR0IF9j7zSyFW0/vvMZYdt5kg1J1EAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA