distributed file systems stuff
This is the page for my distributed file systems stuff.
I find the problem of distributed file systems interesting. What would this be? This would be a file system that does not have any central storage, say, a central meta data server, and would expand in capacity and speed as one adds new systems with new disks.
This would be very useful in the cluster world where you have a lot of identical systems. That said, most offices start to look like this because you do have a bunch of systems, with, in most cases a lot of spare disk space. For this you might want to have a separate disk partition to keep space reserved.
I'd like it to be simple to add systems to, and fairly robust.
So, that being the goal, how might we implement this?
Each file is broken up into segments, for now 64k. The file name and the segment number are hashed together (say, with sha1), and this is looked up in a table who's index is a machine number or machine/disk tuple.
Then you send a request to this system with this file name and segment number and the machine returns that 64k segment to you. Or, if you're writing then you do the reverse but also send the segment to the remote system.
Therefore you have a table of systems. Each segment is stored on some number of systems, let's say 3, called repcnt. Therefore each segment lives on disk n, n+1 ... n+repcnt. This gives you some sort of redundancy.
This table must live on each node, and, must be kept in sync. I haven't looked, but, suspect that the BGP folks have basically solved this, else, the table will have to dealt with a 2 phase commit
Machines actually just store each 64k segment at the name of its hash. This needs to be stored in some nice directory tree so that each directory doesn't get too big. You also store a crc so you know if you wrote it correctly, so you store the 64k+crc size data.
So, to recap, to read a file you take the name and offset, hash it into togther and look up mod table size in the machine table. Then you make a request from the machine that you looked up and you get your 64k chunk.
To write you do the same, but you send the 64k chunk to the remote system. I suspect that the writes have to be done with a 2 phase commit, or, directories have to be dealt with specially to make sure that you don't get two files with the same name. I'm guessing that the maximum directory size should be the segment size, otherwise we might end up with two files of the same name in two different directory segments.
Remember that directories are nothing more than lists of files
What you get then is that your reads and writes are spred over a list of systems. This distributes the load, and, if repcnt is greater than 1 (it really really should be) then you have more resistance to losing data when a disk breaks, etc.
So, what could possible go wrong?
- You lose file space over the total sum of disks that you have all networked together. Basically the total disk space is the smallest disk, times the total number of disks, divided by repcount. Therefore adding one small disk to a cluster with lots of big disks makes the whole thing smaller. Don't do that.
- Disks get read errors. In this case the system you requested the data from doesn't have the data. Behind your back that system will try to refresh the data from the next system in the table up to repcnt to get the data. If repcnt systems can't return the data than it's lost. It's possible that all the systems could be queried to find the data which would cover the case where new systems are added.
- Disks can die. This is basically the read error problem, but with the problem that you don't have any data at the start. I talk about garbage collection and scrubbing below. Note that if you replace disk N, then you tell machine N-1 and machine N+1 to start a scrub. Then, as they scrub, machine N will get refreshed.
- Machines can die. Same as disks die, but, no response. After a short time out you go to the next machine.
- Reads should be as fast as NFS, without the one machine overload problem. Writes, otoh, should be repcnt slower than nfs, but also without the nfs server overload problem.
- Because of the scrubbing and gc (see below) there is a period of time after a disk is replaced before it's back in sync and you have repcnt copies again
- Adding a new system is potentially slow. When you add a new system a bunch of blocks are moved around, over time.
- Removing a system is a bad idea. It should work, but, will take a long time to move all the blocks to their final places. Don't do this.
Background stuff
There need to be a number of background things to keep the file system in good shape.
- Each system should walk over the segments that they have stored, copy the hash, and see if they should be stored. If they should be stored then read do a stat query to the other systems that should also have this data stored to make sure they are refreshed as well. If they should not be on this system they are moved to the correct system and then deleted. This could happen when a system is down for a while and comes back up or when new systems are added. Note that one should just check for the existance of the file, not actually open it so that the atime is not updated. Or, if we want to check the CRC, a good idea, we should restore the atime and only let atime be updated by normal access and gc runs.
- A set of systems should walk the whole file system calling stat calls on each file segment. This should force all the systems, over time, to refresh. In addition this can do garbage collection by remembering the start time, and, when done telling the systems to remove all files with an access time older than the start time minus some delta (say a day or so) just to be sure. Before the files are actually removed the systems should check to see that the belong on it (ie, hash shows that this is the correct system), otherwise the segment should just be moved to the correct system. It's possible that we should store the GC time rather than the atime since the atime will be modified with the scrubbing above.
- Each system keeps a list of recently deleted segments. This allows the whole system to recover from one system/disk being down for a while and not have blocks live forever.
Last modified: Tue Dec 01 00:00:00 MET 2008