Thread: [Openpacket-devel] Organize traces on file system
Brought to you by:
crazy_j,
taosecurity
|
From: Mark M. <mas...@gm...> - 2006-08-03 13:51:01
|
How will you organize the traces on the file system? If you're getting thousands of traces uploaded, will you need a file structure to organize the traces? |
|
From: Jacob H. <ha...@gm...> - 2006-08-03 14:30:55
|
Well, lets start out with somethings that we would like to have in a storage structure: - easily scalable - fast access - compression? - Do we want a API to directly access the files? - Do we want the structure to be humanly accessible? The first two are easy. We can let the file system, and hardware handle all that. We can develop Openpacket so it doesn't care if we use NTFS or ZFS or whatever. We could use the filesystem to compress the data, or we could compress the data (gzip etc). It would probably be best to have the file system handle the compression. If we compress the data our selves, the cpu cost may be to great for the amount of traces we will be storing. Remember storage is cheaper than cpu time! :-) We could easily have an API to access the files. No matter what framework we use, it could easily interface with the file system for retrieval. Do we want the structure to be humanly accessible? I would say no. It would force everyone to use the API or have direct access to the DB. The structuring could go as follows. Once a packet is uploaded, a hash/check sum can me taken of the file. This is store in the database once uploaded. From there the packet is renamed to its check sum, and stored on our servers. All metadata and information about the capture is located in the DB on upload. To find a capture, use the DB hash to access the file system and grab the file. This insures that: 1. capture data can be checked against its original check sum to determined if anything has changed or has been damaged. 2. access can be fast 3. if there are 2 of the same captures uploaded, it will reference the same file (unless we find some collision!! haha) and won't take up extra space. This is just a quick sum of ideas that popped in my head. Jake On 8/3/06, Mark Mason <mas...@gm...> wrote: > How will you organize the traces on the file system? > > If you're getting thousands of traces uploaded, will you need a file > structure to organize the traces? > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Openpacket-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openpacket-devel > |
|
From: Tim F. <fu...@cc...> - 2006-08-03 18:24:53
|
Hi Jake, I was wondering if you could clarify what you mean about the file system handling the compression. Did you have something specific in mind? I'm thinking that unless the fs is remote, any compression that it does would incur about the same number of CPU cycles as doing it inline. It might work better on a multiprocessor system, but it wouldn't be too hard to have openpacket.org handle compression in a seperate process. One of the other ideas we've been kicking around a bit is the idea of having the system based around BitTorrent; in which case, openpacket.org wouldn't have to directly store many traces, probably just the ones awaiting moderator approval. It would mainly have to store the torrent files, so volume of data wouldn't be as much of an issue as number of files. Honestly, we probably wouldn't have to store them as files, we could probably just store the contents of the torrent file in a DB and only dump it to file long enough to send it to a user, and maybe have a cache of frequently-requested torrents. Anyway, those are optimizations that can be done transparently when needed. The other thing is that BitTorrent uses hash values as an integral part of identifying a particular file, so integrity checking would be built-in. -Tim On 8/3/06, Jacob Ham <ha...@gm...> wrote: > > Well, lets start out with somethings that we would like to have in a > storage structure: > > - easily scalable > - fast access > - compression? > - Do we want a API to directly access the files? > - Do we want the structure to be humanly accessible? > > The first two are easy. We can let the file system, and hardware > handle all that. We can develop Openpacket so it doesn't care if we > use NTFS or ZFS or whatever. > > We could use the filesystem to compress the data, or we could compress > the data (gzip etc). It would probably be best to have the file system > handle the compression. If we compress the data our selves, the cpu > cost may be to great for the amount of traces we will be storing. > Remember storage is cheaper than cpu time! :-) > > We could easily have an API to access the files. No matter what > framework we use, it could easily interface with the file system for > retrieval. > > Do we want the structure to be humanly accessible? I would say no. > It would force everyone to use the API or have direct access to the > DB. > > The structuring could go as follows. Once a packet is uploaded, a > hash/check sum can me taken of the file. This is store in the > database once uploaded. From there the packet is renamed to its > check sum, and stored on our servers. All metadata and information > about the capture is located in the DB on upload. To find a capture, > use the DB hash to access the file system and grab the file. > > This insures that: > 1. capture data can be checked against its original check sum to > determined if anything has changed or has been damaged. > 2. access can be fast > 3. if there are 2 of the same captures uploaded, it will reference > the same file (unless we find some collision!! haha) and won't take up > extra space. > > This is just a quick sum of ideas that popped in my head. > > Jake > > > On 8/3/06, Mark Mason <mas...@gm...> wrote: > > How will you organize the traces on the file system? > > > > If you're getting thousands of traces uploaded, will you need a file > > structure to organize the traces? > > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > > opinions on IT & business topics through brief surveys -- and earn cash > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Openpacket-devel mailing list > > Ope...@li... > > https://lists.sourceforge.net/lists/listinfo/openpacket-devel > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys -- and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Openpacket-devel mailing list > Ope...@li... > https://lists.sourceforge.net/lists/listinfo/openpacket-devel > -- Tim Furlong tim...@gm... |
|
From: Richard B. <tao...@gm...> - 2006-08-03 18:29:07
|
On 8/3/06, Tim Furlong <fu...@cc...> wrote: > Hi Jake, > > I was wondering if you could clarify what you mean about the file system > handling the compression. Did you have something specific in mind? I'm > thinking that unless the fs is remote, any compression that it does would > incur about the same number of CPU cycles as doing it inline. It might work > better on a multiprocessor system, but it wouldn't be too hard to have > openpacket.org handle compression in a seperate process. > > One of the other ideas we've been kicking around a bit is the idea of having > the system based around BitTorrent; in which case, openpacket.org wouldn't > have to directly store many traces, probably just the ones awaiting > moderator approval. It would mainly have to store the torrent files, so > volume of data wouldn't be as much of an issue as number of files. > Honestly, we probably wouldn't have to store them as files, we could > probably just store the contents of the torrent file in a DB and only dump > it to file long enough to send it to a user, and maybe have a cache of > frequently-requested torrents. Anyway, those are optimizations that can be > done transparently when needed. The other thing is that BitTorrent uses > hash values as an integral part of identifying a particular file, so > integrity checking would be built-in. > > -Tim Hi all, I think we should consider BitTorrent for distributing the entire trace collection, or perhaps large portions of it. OpenPacket.org should always be the "seed of last resort" for these files -- we shouldn't hope that others can seed it. I expect the vast majority of traces to be small, so we should serve those without requiring BitTorrent. I personally wouldn't want to launch a BT client every time I want a small exploit trace or what have you. Thank you, Richard |
|
From: Jacob H. <ha...@gm...> - 2006-08-03 19:08:45
|
Hi All, On 8/3/06, Tim Furlong <fu...@cc...> wrote: > Hi Jake, > > I was wondering if you could clarify what you mean about the file system > handling the compression. Did you have something specific in mind? I'm > thinking that unless the fs is remote, any compression that it does would > incur about the same number of CPU cycles as doing it inline. It might work > better on a multiprocessor system, but it wouldn't be too hard to have > openpacket.org handle compression in a seperate process. > Indeed, I had ZFS in mind for a file system. It is extremely expandable, fast, provides data integrity, and has low CPU usage compression. You can read more about it here if interested, http://www.opensolaris.org/os/community/zfs/ . I think it really doesn't matter now, but if we grow to hundreds of gigs of data, it will definitely be something to think about. Another option would be to gather meta data once uploaded, gzip it once, and always serve the file compressed. The only problem with this is if we ever decide to reference captures in line (on the site instead of having to download and open the capture in Wireshark). Say if someone wants to describe a capture in detail, he could reference lines 10-29, describe them, then move to 30-45 (assuming we had a system in place like this). I don't know what kind of systems we have here for use, revenue model (advertising, donations etc?), or hosting issues. I assume Richard is working on this. If we need to save bandwidth and space we could do so in the design. > It would mainly have to store the torrent files, so > volume of data wouldn't be as much of an issue as number of files. > Honestly, we probably wouldn't have to store them as files, we could > probably just store the contents of the torrent file in a DB and only dump > it to file long enough to send it to a user, and maybe have a cache of > frequently-requested torrents. If we cache the most requested ones, it will be faster but then we are back where we are now.... How would we store the cached files?! What if there are 1000s of popular files we cache? Jake |
|
From: Tim F. <fu...@cc...> - 2006-08-03 20:54:03
|
On 8/3/06, Jacob Ham <ha...@gm...> wrote: > > Hi All, > > On 8/3/06, Tim Furlong <fu...@cc...> wrote: > > Hi Jake, > > > > I was wondering if you could clarify what you mean about the file system > > handling the compression. Did you have something specific in mind? I'm > > thinking that unless the fs is remote, any compression that it does > would > > incur about the same number of CPU cycles as doing it inline. It might > work > > better on a multiprocessor system, but it wouldn't be too hard to have > > openpacket.org handle compression in a seperate process. > > > > Indeed, I had ZFS in mind for a file system. It is extremely > expandable, fast, provides data integrity, and has low CPU usage > compression. You can read more about it here if interested, > http://www.opensolaris.org/os/community/zfs/ . I think it really > doesn't matter now, but if we grow to hundreds of gigs of data, it > will definitely be something to think about. It looks good, I just have two questions: First, has it already been ported to FreeBSD, or would we have to run Solaris 10? My impression was that Richard is fairly keen on using FreeBSD as a platform, and the official ZFS FAQ says there's no official plans to port it to anything other than Solaris 10. Second, is anyone here knowledgeable about the CDDL (the license for opensolaris)? I took a quick look, but I'm not familiar with it, and I'd feel more comfortable if I could have someone (preferably a lawyer and preferably not one retained by Sun) tell us what we should be looking out for. In fact, that's probably also a good idea regardless of what we use; I'm not familiar enough even with the GPL and LGPL to know what the gotchas are as far as designing a publicly accessible system like this. For instance, the CDDL FAQ suggests that there may be issues with statically linking source files that are under different licenses. Another option would be to gather meta data once uploaded, gzip it > once, and always serve the file compressed. The only problem with > this is if we ever decide to reference captures in line (on the site > instead of having to download and open the capture in Wireshark). Say > if someone wants to describe a capture in detail, he could reference > lines 10-29, describe them, then move to 30-45 (assuming we had a > system in place like this). I think there are ways around that; for instance the reviewer could just upload WireShark screenshots, or the analysis submission could allow the user to specify packet numbers, then fill in the blanks by decompressing the file, extracting the info for the desired packets into the DB, and then recompressing the file. It'd be best to do that offline though, which would just mean that the interface to present analyses would have to be able to recognize a not-yet-complete operation and display <packet info pending> or something. I don't know what kind of systems we have here for use, revenue model > (advertising, donations etc?), or hosting issues. I assume Richard is > working on this. If we need to save bandwidth and space we could do > so in the design. > > > It would mainly have to store the torrent files, so > > volume of data wouldn't be as much of an issue as number of files. > > Honestly, we probably wouldn't have to store them as files, we could > > probably just store the contents of the torrent file in a DB and only > dump > > it to file long enough to send it to a user, and maybe have a cache of > > frequently-requested torrents. > > If we cache the most requested ones, it will be faster but then we are > back where we are now.... How would we store the cached files?! What > if there are 1000s of popular files we cache? I haven't looked, but I suspect that it's possible, with PHP or Ruby or directly through an apache module or such, to bypass the filesystem entirely and just have the web interface fetch the data straight from the DB. That would still involve the fs, of course, since the DB would be housed there, but it would be optimized by the DB software. It's easy enough to do in perl at least, you just output the appropriate HTTP header and dump the data down the pipe, regardless of where the data comes from. I don't expect that it would be much harder in the other frameworks. If you're worried about the sheer number of files on the filesystem, with ext2/ext3 at least, you can set the number of inodes created when you set up the filesystem; if you're going to have lots of small files, you create more than the default number of inodes (4% of the filesystem or something like that, I think). If you're more worried about access times (finding a file gets bloody slow with thousands of files in one directory), a standard trick is to radix sort into subdirectories. In this case, we could do that using the hash; i.e. use the first two or three hexadecimal characters of the hash as the name of a subdirectory in the base dir, the next two or three as the next subdirectory, etc. So if you had files with the following five hashes (I'll use the full hash as the filename for the example): 25A1078996BE4F57DD89ABD8692538A0FB64428D 25C69487E704607EC72D19D9E6E0552A47004F64 E4EABBA07718253835B74ADB8B276B2A45EC3F93 E4EBED96FB5CFF73922D15AA533032EB35A673E7 FCC4DF6660CB0E7C2ABFE439A7C423690B4CD7A6 you could create a tree like: ./25/A1/25A1078996BE4F57DD89ABD8692538A0FB64428D ./25/C6/25C69487E704607EC72D19D9E6E0552A47004F64 ./E4/EA/E4EABBA07718253835B74ADB8B276B2A45EC3F93 ./E4/EB/E4EBED96FB5CFF73922D15AA533032EB35A673E7 ./FC/C4/FCC4DF6660CB0E7C2ABFE439A7C423690B4CD7A6 I suggest two or three, because ext2 at least can't handle more than 32767 subdirectories including ./ and ../, so four would potentially cause problems. If we can go with ZFS, though, such kludges might not be necessary (*knock wood*). We'd probably have to do some testing to see for sure, though. So perhaps we should try and identify all of the issues we're worried about in the context of storage, and possible solutions? 1) Sheer number of bytes 1a) background built-in compression by ZFS 1b) automatic compression by openpacket.org on receipt of a trace (after summarization) 1c) "offshoring" large traces via bittorrent 1d) background compression on an ext3 filesystem (or whatever FS FreeBSD prefers) 2) number of files on the filesystem 2a) ZFS (need to confirm that it handles large numbers of files gracefully) 2b) FS tuning 2c) some sort of automated archival of less-used files 2d) files stored in DB instead of on fs 3) number of files in a given directory 3a) ZFS (need to confirm that directory seek time scales well for large directories) 3b) radix sorting 3c) pure DB handling of files served Have I missed anything, either concerns or possible solutions? I think the major questions are whether ZFS will solve all the issues, whether it will solve them better than any of the other possible solutions, and whether it's worth either changing over to OpenSolaris or porting ZFS to FreeBSD ourselves (I suspect the latter will be a rather large job). -Tim |
|
From: Anthony J. <an...@pf...> - 2006-08-04 00:54:36
|
Tim Furlong wrote: > > On 8/3/06, *Jacob Ham* <ha...@gm... <mailto:ha...@gm...>> wrote: > >>> snip snip snip > > I think the major questions are whether ZFS will solve all the issues, > whether it will solve them better than any of the other possible > solutions, and whether it's worth either changing over to OpenSolaris or > porting ZFS to FreeBSD ourselves (I suspect the latter will be a rather > large job). > > -Tim > > There is this thread on freebsd-hackers: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=81033+0+archive/2006/freebsd-hackers/20060528.freebsd-hackers Dealing with ZFS. Looks like it's not to be counted on on FreeBSD anytime soon. ant |