From: Phillip S. <ps...@cf...> - 2010-04-27 15:44:14
|
On 4/26/2010 5:18 PM, Kenneth Porter wrote: >> The archives show almost no activity on this list for some time, so I >> hope this finds its way to someone. > > That's likely an indication that those of us who use dump find it very > stable and have nothing to complain about. (At least, that we're willing to > put effort into fixing. ;)) > > I'll leave it to Stelian to explain the architecture. Heheh. I find it quite nice as well, but have been investigating how it works to see if I can make it faster on modern hardware, since the code is SO old. After parsing the code a lot more and walking through it a bit with gdb ( NOT easy when it uses multiple processes... gdb tends to get confused ) it seems that it has a master process and 3 slave processes. The master walks the inodes to be dumped, adds a 1kb inode record to the queue, then adds each of the 4kb disk blocks to the queue one at a time until the queue is full. More specifically, it adds the block number to the queue rather than actually reading the block. The size of the queue depends on the -b parameter, which defaults to only 10. Special blocks, like the inode header, are stored in another buffer, and the data blocks have their numbers stored in the queue. When the queue is full it is sent to a slave via a socketpair() as well as any of the queued special blocks. The slave walks the request list either reading the 1kb special block from the socket, or reading blocks from disk until it has assembled the entire -b size record. Then it optionally compresses the record and writes it to tape. At first I was using -b 64 -z9 and when I tried -b 1024 -z9, both of my cpu cores went to 100% for a long time to complete the dump and temperatures got rather high. Now that I understand what is going on, this seems to simply be a result of gzip actually being able to do much compression. See, with -z9, gzip tries to use a 900kb tree to hold recently repeated phrases. That doesn't do much good when you are only compressing 64kb at a time. It ran rather fast since with only 64k of input, the tree never gets very full. Once I switched to -b 1024, gzip had some real data to work with, filled the tree, and slowed down, though also getting better compression in the process. I may just try using -z3 in the future which seems to use much less cpu and compress almost as well with -b 1024. I noticed that the slave was only reading one 4kb block at a time, so I experimented a bit last night with having dumpblock() try to combine the new block with the last one in the request queue ( if they are contiguous ) to try to generate larger block read requests. This seemed to give me about 5-10% better throughput when dumping my root on an ssd to /dev/null. I also realized that the slave was leaving the special records sitting in the socket while reading disk blocks, so I changed that code to make two passes over the request queue and only read from the socket in the first pass, then go back and read the disk blocks. This seems to have gained another 5-10%. Unfortunately, it does not combine nearly as many large requests as it could, because if two different inodes happen to follow one another on the disk, they still have their inode special record inserted between them in the dump record, so the blocks can not be read into the dump record all at once. I'm mulling over a few more radical changes to speed things up, as I feel it could go quite a bit faster than it does now: First, why are the slaves completely separate processes that have their commands and any special blocks passed to them via socket? It would be better if the slaves were just threads that could directly access the request and special block buffers without the bother of copying them to kernel socket buffers and back. Second, why write the dump output as inode, data, inode, data, etc. instead of inode, inode, inode, data, data, data? It could be quite a bit faster to dump all of the inodes first, building block maps of them on the way, THEN go back and dump the data blocks. This would make parsing the dump file to find what files are in it much faster since it would only have to read the first part with all of the inodes in it, and not bother with the rest of the data, and dumping should be faster since it can walk the entire inode table in one go, then all of the data blocks ( combining sequential ones into larger reads ) in one go. Finally, this could be combined with O_DIRECT and aio to get rid of the time spent in the kernel copying data between buffer cache and the dump record buffer, and stop the kernel readahead from picking up data from the disk that isn't needed, and creating memory pressure in the buffer cache. I have just now also been looking at the TS_ADDR record. It looks like it has one byte per block in a file that is set to 0 or 1 to indicate whether the block actually exists and is present, or if it was a hole. Wouldn't it be better if this were instead an array of { blockno, count, present }? In fact present could probably be encoded in the high order bit of count, so count would be 100 for 100 actual blocks or -100 for a hole of 100 missing blocks. That way you would not need an additional TS_ADDR record for every 2mb of a file 2mb or larger where half of it is redundant information every special record contains about the entire volume, and the other half is just filled with 0x01. |