[DSP] File formats are hard

by Paweł Świątkowski
31 Mar 2017

Last time I wrote that I would probably need to design a custom file format to contain all information about compressed file. This seemed unlikely at the beginning of the project, but then requirements changed. I “invented” something that is widely known in cryptography world - a hybrid encryption, where random key for symmetric encryption is generated, file is encrypted with it and then key is encrypted with public key of the recipient. So now I have to send two chunks of data and custom file format became a necessity. But it is not as easy as it sounds…

At first, I was aiming for the simplest. Put a number of bytes that encrypted key has, then put the key and and then follow it by encrypted message. While decrypting. I would have to read a number (easy), read a given number of bytes (easy) and read the rest of the file (also easy). But is it enough?

Probably not. What if I want to pass something extra with the message. For example, date when it was encrypted. Or original file name, if I want to hide this information too. Soon I realized that I need something more elastic. I would also need to version file format somehow, should it change in the future.

JSON to the rescue!

While looking at some other files, you might notice that they use some existing format or language to contain its data. For example, Office files are XML under the hood. So why not go this way? Of course, I won’t use XML, I have my dignity, but JSON sounds fair enough. With that thought I quickly sketched an outline of my .ble file:

{
	"version": 1,
	"key": "...",
	"content": "..."
}

Easy, right?

Yes, but it’s not enough. Encrypted data is binary, so it’s just a sequence of random bytes. It does not print as text (try cat some_file.pdf), it does not play along with text formats like JSON, XML or YAML.

Now, I don’t know about many binary formats and I didn’t really want to go into the research of yet another thing. Instead, I decided to use industry standard for such situations.

Base64 not to the rescue!

Yes, you probably used base64 more than once to transmit binary data over text protocols. You disagree? Ask your web browser ;) Anyway, base64 is fast and it’s widely supported.

So I used it. And it worked! And I was pretty happy, until I decided to encrypt some 20MB PDF file I randomly chosen from my oversized Downloads directory. It compressed in few seconds, but… it was 27MB afterwards. After another round of investigation I found that this is no mistake. Base64 output is on average 33% larger than input.

I don’t want that. What are my options? Well, there is Base-122 which produces only 14% larger output. Better, but still not quite there…

Hybrid approach once again (to the rescue!)

Since I turned out to be pretty good in designing hybrid approaches to solve problem I encounter while developing Bletchley, I came up with yet another one. This is the result:

{"version":1,"key":"...","original_filename":"myfile.tiff"}
[hell of binary bytes here]

The first line of my .ble file is a simple JSON. It contains version and all the things I want. It also includes encrypted key in base64. It is 33% larger than it could, however the key is short, so it’s no loss. After second line, the binary blob floats around, devouring incautious wanderers.

It is easy task to read a first line of file. It is easy to read the rest of the file. Everyone is happy. Well, maybe except for me, because I worked until small hours twice to get it done, instead of casually code it in two hours, like I planned. But it works, and you can see the code in the repo.

State of Bletchley

Right now I have two subcommands to encrypt and decrypt files from command line. And they work!

./bletchley encrypt -k id_rsa.pub my_file.tiff
./bletchley encrypt -k id_rsa output.ble

There is also one subcommand to just generate the key pair:

./bletchley gen --name id_rsa

Right now I think it’s time to turn my attention to “front end” and start coding the GUI for the project.

end of the article

Tags: dsp general thoughts

This article was written by me – Paweł Świątkowski – on 31 Mar 2017. I'm on Fediverse (Ruby-flavoured account, Elixir-flavoured account) and also kinda on Twitter. Let's talk.

Related posts: