How is data encrypted in Oblivious?
Oblivious uses libsodium's secure key generation function to create a new encryption key for every file (data blob) that's uploaded through it. This Data Encryption Key is then encrypted using a Key Encryption Key that's derived from the user Passphrase using a Password Based Key Derivation Function. This is sometimes called "envelope encryption" because the key needed to decrypt each file is encrypted by a secondary key. We use ChaCha20+Poly1304 for data encryption due to it's high re-key length (2^64 Bytes, 18.44 Exabytes), and AES-256-CTR + SHA256 HMAC for smaller data elements. Once encrypted, the encrypted data is uploaded to the cloud storage provider (S3, Azure, etc.) and the encrypted Data Encryption Key is sent to Oblivious. The Key Encryption Key never leaves the client and is never stored outside of system memory.
How is metadata (filename, size, etc.) kept secure?
We subscribe to the philosophy that "metadata is data" and should be secured with the same scrutiny as any plaintext. The file names ('Keys' in S3) are replaced with random UUIDv4 strings so the cloud provider can't read them. All metadata collected (name, file size, feature tags, UUIDs) is encrypted with the same protocols as encryption keys (AES-256-CTR + SHA256 HMAC). Additionally, the file is compressed so the exact size of the plaintext isn't obvious to the cloud provider.
How can you look up a file name if it's encrypted?
When a file is uploaded, the name (key) is encrypted using a shared IV and then SHA256 hashed. Using a predictable IV ensures we have a deterministic output. Deterministic encryption is bad for security, so this ciphertext is discarded and only the hash is saved. We call this a searchKey.
When it comes time to retrieve the file, the client encrypts and hashes the requested filename again, resulting in the same searchKey. The client can then request information about the file from Oblivious without revealing any information about the plaintext.
How can Oblivious make separate encrypted solution spaces within the same data bucket?
The answer is in how Pseudo Random functions operate. Given a fixed starting value - called an initialization vector or IV - a Pseudo-Random Function will always output the same number. For example:
PRF(1) => 27 PRF(2) => 39 PRF(3) => 12 PRF(1) => 27 … PRF(i) => N
We would call this PRF a ‘deterministic’ function: Given the same input the function will always produce the same output. In simple terms, Stream Ciphers used to encrypt data use this same concept in conjunction with an encryption key. Most importantly the output is similarly deterministic. With those functions we’d write the output not as a number, but a set containing the initialization vector and the output.
E: [IV, key, plaintext] => [IV, key, ciphertext] E: [1, ‘FF’, ‘hello world’] => [1, ‘FF’, ‘ae9rbar5b2a=’] E: [1, ‘AA’, ‘hello world’] => [1, ‘AA’, ‘l3kjg3k3810=’] ← Changed key E: [2, ‘FF’, ‘hello world’] => [1, ‘FF’, ‘sg8sgnsf8gn=’] ← Changed IV E: [1, ‘FF’, ‘hello world’] => [1, ‘FF’, ‘ae9rbar5b2a=’] ← Same IV/key Matches original output
Reusing an IV like this makes for a breakable encryption. The resulting ciphertext - if exposed - would be susceptible to CPA and key-reuse attack vectors. But what about a non-reversible hash of those values? Let’s use SHA256 as an example:
SHA( E: [1, ‘FF’, ‘hello world’] ) => ‘22596363b3de40b06f981fb85d82312e8c0ed511’ SHA( E: [1, ‘AA’, ‘hello world’] ) => ‘7b387928db2e4081cbfa92fdc11ea5216b10cdf4’ SHA( E: [2, ‘FF’, ‘hello world’] ) => ‘8a3c1cf222d07862d42e1340599c98ad4ae960b8’ SHA( E: [1, ‘FF’, ‘hello world’] ) => ‘22596363b3de40b06f981fb85d82312e8c0ed511’
We now have a unique key that represents the plaintext, but can’t be reversed and leaks no information (other than that it exists). We can think of this as a set of initial values from the plaintext set that have discrete, non-reversible, idempotent projection onto only one encrypted solution space.
In this diagram we can say that the encryption of P under IV1 and k1 always yields a value in the set [IV1, k1, •].
Any encryptions with these same starting conditions will end up in the same solution space. Even if two users encrypt the same data, they’ll be using a different IV and key so the resulting encrypted hash will be unique. This creates a natural isolation of these values. By building an index on these resulting hashes, additional information about file size, location, common words, etc. can be encrypted and given a unique address without revealing information about that plaintext. Oblivious supports multiple feature tags on these output values to support further granularity. Check out our API docs if you’re interested in taking advantage of these features.