edoneel daily - SBCL stuff


distributed file systems stuff

This is the page for my distributed file systems stuff.


I find the problem of distributed file systems interesting. What would this be? This would be a file system that does not have any central storage, say, a central meta data server, and would expand in capacity and speed as one adds new systems with new disks.

This would be very useful in the cluster world where you have a lot of identical systems. That said, most offices start to look like this because you do have a bunch of systems, with, in most cases a lot of spare disk space. For this you might want to have a separate disk partition to keep space reserved.

I'd like it to be simple to add systems to, and fairly robust.

So, that being the goal, how might we implement this?

Each file is broken up into segments, for now 64k. The file name and the segment number are hashed together (say, with sha1), and this is looked up in a table who's index is a machine number or machine/disk tuple.

Then you send a request to this system with this file name and segment number and the machine returns that 64k segment to you. Or, if you're writing then you do the reverse but also send the segment to the remote system.

Therefore you have a table of systems. Each segment is stored on some number of systems, let's say 3, called repcnt. Therefore each segment lives on disk n, n+1 ... n+repcnt. This gives you some sort of redundancy.

This table must live on each node, and, must be kept in sync. I haven't looked, but, suspect that the BGP folks have basically solved this, else, the table will have to dealt with a 2 phase commit

Machines actually just store each 64k segment at the name of its hash. This needs to be stored in some nice directory tree so that each directory doesn't get too big. You also store a crc so you know if you wrote it correctly, so you store the 64k+crc size data.

So, to recap, to read a file you take the name and offset, hash it into togther and look up mod table size in the machine table. Then you make a request from the machine that you looked up and you get your 64k chunk.

To write you do the same, but you send the 64k chunk to the remote system. I suspect that the writes have to be done with a 2 phase commit, or, directories have to be dealt with specially to make sure that you don't get two files with the same name. I'm guessing that the maximum directory size should be the segment size, otherwise we might end up with two files of the same name in two different directory segments.

Remember that directories are nothing more than lists of files

What you get then is that your reads and writes are spred over a list of systems. This distributes the load, and, if repcnt is greater than 1 (it really really should be) then you have more resistance to losing data when a disk breaks, etc.

So, what could possible go wrong?

Background stuff

There need to be a number of background things to keep the file system in good shape.

Bruce O'Neel

Last modified: Tue Dec 01 00:00:00 MET 2008