I love doing research. My first reaction to a gnarly problem is to try to discover everything I can about it and, crucially, to see how others have tackled it.
It’s not just because other people’s work often means I have to do less of my own, but experiencing the plurality of expression that’s available out there i a thing of joy.
For example, a problem that regularly comes up when you’re making software that lets users upload their own files is ensuring the consistency and integrity of said file.
How do you do that? is a solid question to research.
I don’t need to discover the nuances and full back story of the field, just enough to not fall into the Chesterton’s fence trap – know the why of the status quo before criticising it.
Through my research, I discovered that there are, effectively, three ways to do this:
- Simple checksums such as Cyclic Redundancy Checks or Adler-32. Possibly the simplest way to check that a file hasn’t changed.
- Non-cryptographic hashes. like xxHash or MurmurHash. These have a variety of uses. They can be used to check if a file has changed.
- Cryptographic hashes.. These can be used to assert a unique identity for a file.
You generally need both a hash and a checksum and if you are potentially in an adversarial environment – like, y’know, the internet – the hash probably needs to be a secure cryptographic one.
- Checksums and CRCs are incredibly fast – optimised versions can use as little as a single CPU instruction for each pass – and are sensitive to the specific kinds of changes that are associated with transmission errors, network errors or file corruption.
- Hashes assert a unique identity for a file but are generally slower and usually checked at a later stage where it might be more tricky to retry the request.
- You could use only a fast non-cryptographic hash such as xxHash (an excellent non-cryptographic hash, if you’re looking for one) but the ones with a shorter bit length don’t actually assert uniqueness (collisions in a 64 bit hash can happen out of accident and lead to bugs in production that are hard to fix, especially if the systems is specifically built around the hash fitting in a 64 bit BigInt) and the longer ones can still be fooled by an adversary, which might let them substitute a user’s file without your systems discovering them. So, they’re about as useful as the even shorter checksums but usually a tad slower.
- Cryptographic hashes can be quite fast these days. Not as fast as the non-cryptographic ones, but plenty fast for this purpose. Even supposedly unsafe ones, like SHA-1, can be relatively safe if the rest of the system is sensibly designed, which is why the world hasn’t fallen over yet even though git still defaults to SHA-1.
And if you’re going to use a cryptographic hash, you should generally use whatever comes with your platform because that’s the likeliest to be hardware accelerated. For the web, that means SHA-256, SHA-384, or SHA-512. Fast algorithms like Blake3 are sometimes fast enough to make up for the lack of hardware acceleration, especially if implemented using native code or WASM – in my own benchmarks WASM implementations of Blake3 even outperformed the browser’s built-in cryptographic hashing for some use cases, but not by a large enough margin for it to matter in production.
So, to condense days, if not weeks, of research and experimentation – a process that was a lot of fun: store files with a 32 bit CRC integer (first line of defence) and a cryptographic hash (primary defence) and check one or the other at various points in your systems. You’re going to want both. I’m glossing over a few details here and there and criminally oversimplifying a bunch of things, but in general you’re going to want both.
(By the way, my favourite implementation of CRC has to be this one in JS which extracts the generally excellent but opaquely implemented fflate’s indecipherable implementation and rewrites it into much clearer code.)
The reason why I wrote the above is, beyond the fact that I’m boring enough to find this stuff interesting, after 28?, 29?, years of doing this (that can’t be right) my head is full of shit like this and my first reaction to getting asked on social media about something tangential to a topic I’ve researched used to be to go out and explain it in an extended info-dump. Like I did above.
The problem is, and I say this with love, that all too many of you are gaping assholes.
This problem, though not unique to it, is especially acute on Mastodon.
The issue is threefold:
- Too many of the responses are bad faith arguments from people intentionally or unintentionally being assholes. They engage in conversation specifically to be syphilitic dicks in your face.
- It’s often hard to tell the difference between when somebody is starting a bad faith argument or when somebody is just being, y’know, German or Nordic where a more brusque manner of engagement is more normal. I can handle that. I get it in spades here in Iceland, so I have quite a bit of practice in spotting it and I think I generally can.
- Even when I know for a fact, or am reasonably certain, that the interaction is being made in good faith, Mastodon and other social networks surface replies to followers, which means that any reply whatsoever has a good chance of attracting assholes looking for an opportunity to posture themselves into the conversation.
It’s reached a point where my first reaction to any reply is a mild anxiety attack followed by me closing the tab (or app) and ignoring it for the rest of the day. Simple replies and casual conversation are fine – they’re the foundation that weak social ties are built on – but anything that resembles the start of a “debate” makes my face twitch.
So, apologies in advance if I don’t reply to your reply. It’s not you, it’s one of the many symptoms of my general social media burnout. I’m not asking you to stop sending whatever good faith questions you might have my way. I just probably won’t answer.
And extra apologies if I do actually reply, because then there’s a good chance you got a serialised info-dump like the above piled into the replies.