Wednesday, January 16, 2008

Getting Along without a Database?

A thread on news.YC has got me thinking about some non-traditional web architectures. Iamelgringo asked pg if he was still handling persistence without a database and how it worked. The conversation then turned to transactional memories and cooperative multi-tasking (particularly Communicating Sequential Processes). This takes us toward an event driven model and away from the traditional one process or one thread per request model.

How would this work?

I admit I'm still a novice on web architectures having spent most of my time on mobile and now Windows clients, but I have a few ideas.

The paper Ralph linked to was my first exposure to CSP, and its rather interesting. Basically, you organize your program into a pipeline with each stage in a separate thread so that they can operate independently. This is the same technique a processor core uses to exploit parallelism without actually being parallel. As long as you can keep all of the stages active, you can increase your throughput without having to find a way to speed up the task itself.

What does this do for the web? Well, consider a request broken down into stages such as:
  1. Receiving the request
  2. Authorizing the user
  3. Looking up the data
  4. Rendering the template
  5. Returning the response
The request can be modeled as a five stage pipeline which itself can be modeled with five CSP processes. You could of course just have the same number of threads processing requests in parallel, and this would work fine if you're using a traditional database because the database will handle concurrent access to the data for you. However, if you're using something other than a traditional database (which was the whole point of the thread), you'll have race conditions which will corrupt your data. The CSP model solves this problem since data is only accessed in stage 3 (that is, by a single process).

The question I'm still grappling with is what if you want to scale beyond five concurrent requests? Assuming you can't expand your pipeline into more stages, I can see partitioning the application so that different services are processed by different pipelines, or by replicating the pipeline. The second option eliminates the benefit of CSP by creating multiple processes that access the data in step three concurrently. The first option, on the other hand, can only be applied to completely independent services so it doesn't really help us to scale the original service.

So while CSP looks like it will help spread an application across multiple cores on a single machine, it still looks like we need concurrent access to data if we're going to scale a web application horizontally.

We know that simply storing objects in a flat file can work for a web application, and we know that providing a versioning mechanism can allow concurrent access to data. Perhaps we could simply serialize each object to disk on a file system that can be safely shared across machines? We would of course have to organize the files in a way that doesn't overwhelm the file system with too many files or files that are too large.