Thursday, July 22, 2010

Storing data on the filesystem

Relational databases are passé. NoSQL is where it's at.


Sometimes both are overkill, and we're not doing anyone a favor by over-architecting a solution. Providing a bigger solution than is needed wastes time and money, increases complexity and maintenance costs, and more importantly, doesn't provide any extra value to our users.

The de facto approach for application data storage is to use a dedicated database product, for (mostly) good reasons. However, since we're all well aware of the benefits of using a database, let's take some time to explore the filesystem as a candidate for storing your data.

File writes are atomic
To be more precise, file rename operations are atomic on POSIX systems, according to the Python documentation. Sorry Windows users, you're out of luck.
os.rename(src, dst)

Rename the file or directory src to dst... If successful, the renaming will be an atomic operation (this is a POSIX requirement). On Windows, if dst already exists,OSError will be raised even if it is a file; there may be no way to implement an atomic rename when dst names an existing file.

To perform atomic file writes, you must first write your changes to a temporary file, then rename the temporary file to it's final destination. Sounds harder than it really is. The code would look something like this:

import os
f = open('temp.txt', 'w')
f.write('do the monkey')
os.rename('temp.txt', 'final.txt')

The filesystem is reliable
You'd better hope so anyway, everything ultimately lives on the filesystem, including, yes, that fancy and expensive relational database.

Storing data in files allow you to use standard filesystem-based backup solutions. In addition, many filesystems have snapshot features built in.

Instant API using a web server, or even WebDAV
I suggest storing your data in a document-oriented fashion. That is, store your data using a single file per entity. If you're storing data about 6 different users, then that should be 6 different files. This will greatly simplify things, and allow you to expose this data via an HTTP API.

If you follow this advice, you can simply point your favorite web server at your filesystem and you immediately have an API. Requesting data from this API couldn't be simpler, and may look like this:

Enable more features on your webserver, such as PUT, or even WebDAV, and you now have a read+write API.

The filesystem scales (probably)
I currently have 454,823 files on my computer consuming ~140GB. I don't know if there is a practical limit to filesystem storage, but I'm willing to bet that you and I aren't going to reach it.

Files work with the network
See above note about APIs. Or, see: NFS

Everybody's doing it
Subversion does it. Oracle does it.


The relational databases and NoSQL data stores will still be there, waiting for you if you need them in the future. My advice? Ignore your DBA. Drop acid and think about data.

No comments: