From: <gr...@re...> - 2002-10-08 20:54:18
|
hi, along our list of things to do is nailing down the file format of oprofile's devices and database files, and making sure they are platform portable and self-describing. this is basically just to make sure that debugging someone else's oprofile setup is plausible, even from another machine. to this end, will suggested I write up an "abi document", which we could put into doc/ or something, which just specifies file formats. so then we would have something in english + diagrams, and explicit at the byte-level, to compare structs, testsuites, and i/o functions against as we carry on development. So I have written a quick draft of such a document, which follows. Any comments or corrections? If it looks OK I'm going to start emitting patches which tighten up at least the database i/o routines to conform. -graydon --- oprofile ABI ~~~~~~~~~~~~ This document is a normative reference to the oprofile ABI, by which we mean the binary layout of two interfaces: - the structure of the mmap()'ed device files used to move data between the kernel and the daemon - the structure of the mmap()'ed database file used to store samples permanently on disk The ABI must satisfy the following requirements: - it must be reasonably simple, and represent roughly what the tools already do at the time of writing. - it must have an unambiguous encoding at the byte stream level. - it must support 31, 32, 64 and possibly larger word-size architectures. - it must support classical "big", "little", and possibly other endiannesses (word byte orders). - the ABI must be reasonably self-identifying and self-describing. This document places no requirements on the reading end; it is acceptable for a reading application to simply give up if some combination of input fields is confusing or unacceptable. The purpose of this document is to lay out what must be *written*, so that a reading application has a chance to decide whether it can sensibly handle a file or not, and so that all information needed to decode the file is available even if the file has been moved to a different host environment. Headers ~~~~~~~ The first 8 bytes of any "oprofile ABI" file are called the header, and establsh a magic number, version codes (major versions being backwards-incompatible, minor being compatible), a width byte, and a reserved byte. note-device header: 0 1 2 3 4 5 6 7 8 +-------+-------+-------+-------+-------+-------+-------+------+ | 'o' | 'p' | 'n' | 'd' | major | minor | width | rsvd | sample-device header: 0 1 2 3 4 5 6 7 8 +-------+-------+-------+-------+-------+-------+-------+------+ | 'o' | 'p' | 's' | 'd' | major | minor | width | rsvd | hash-map device header: 0 1 2 3 4 5 6 7 8 +-------+-------+-------+-------+-------+-------+-------+------+ | 'o' | 'p' | 'h' | 'd' | major | minor | width | rsvd | database file header: 0 1 2 3 4 5 6 7 8 +-------+-------+-------+-------+-------+-------+-------+------+ | 'o' | 'p' | 'd' | 'b' | major | minor | width | rsvd | The width byte specifies the width of subsequent words in the file, after the reserved byte. the width byte is an unsigned value, referred to hereafter as "wb", and is a count of bytes; for example if the wb=8, the remaining words in the file are 8 bytes (64 bits) each. the remainder of the file is a sequence of words of this size, for efficiency sake. The next wb bytes after the header are called the endianness word, referred to hereafter as "eb"; they specify a byte-permutation which maps bytes in subsequent words of the file to sub-word bytes in memory, counting from the low-order byte. In other words, the "identity" interpretation of subsequent words in the file is as classical little-endian words, but any byte order may be specified using appropriate eb bytes. Specifically, let the byte eb[0] be the first byte in the endianness word, and eb[wb-1] be the last byte in the endianness word, then a word in memory can be faithfully constructed from any on-disk format, in any type of host memory, using this procedure: word read_word(int const * eb) { long w = 0x0; for(int i = 0; i < wb; ++i) { w |= (next_byte() << (eb[i] * 8)); } return w; } Obviously, it is illegal for any byte in the endian word to be greater than wb-1, and also that implementations may choose to use more efficient algorithms than the above (such as the identity function on IA32) if they recognize a compatible endianness in the endian word. All words in the remainder of the file must follow the encoding scheme of the endian word. examples: platform wb eb[0]..eb[wb-1] ------------------------+-------+---------------------- IA32 little endian | 4 | 0 1 2 3 IA64 little endian | 8 | 0 1 2 3 4 5 6 7 PPC32 big endian | 4 | 3 2 1 0 PPC64 big endian | 8 | 7 6 5 4 3 2 1 0 kooky word and endian | 5 | 3 4 2 0 1 All remaining structures in either file, with the exception of the string pool in the hash map file, are sequences of words (of length wb bytes) following the endianness mapping of eb. Note device file ~~~~~~~~~~~~~~~~ after the endianness word, the note device file is a dynamically produced stream of notes. there is no limit to the number of notes which occur. each note is 6 words, of the following form: [ address ] [ length ] [ offset ] [ hash of path ] [ pid ] [ note type ] Sample device file ~~~~~~~~~~~~~~~~~~ after the endianness word, the sample device is a dynamically produced stream of sample buffers. there is no limit to the number of sample buffers which occur. each sample buffer begins with a 3-word buffer head: [ cpu number ] [ sample count ] - called sc hereafter [ profiler state ] after the buffer head, there is a sequence of 4*sc words, divided into sc 4-word samples: _ [ sample 0 ] \ [ ] \__ one sample [ ] / [ ]_/ [ sample 1 ] \ [ ] \__ one sample [ ] / [ ]_/ ... ... ... ... _ [ sample sc-1 ] \ [ ] \__ one sample [ ] / [ ]_/ as depicted above, each sample is 4 words long, and contains the following fields: [ eip (instruction pointer) ] [ sample count ] [ hardware counter number ] [ pid (process ID) ] Hash map file ~~~~~~~~~~~~~ after the endianness word, the hash map file has 2 words containing the sizes of the hashtable and string pool: [ size of hashtable ] - called hsz hereafter [ size of stringpool ] - called spsz hereafter followed by (2 * hsz) words containing hash entries _ [ entry 1 ] \_ one hash entry [ ]_/ [ entry 1 ] \_ one hash entry [ ]_/ ... ... _ [ entry hsz-1 ] \_ one hash entry [ ]_/ followed by spsz *bytes*, which is the string pool. strings are packed one after another, are not aligned, and are null-terminated. as depicted above, each hash entry is 2 words long, and contains the following fields: [ name (index in string pool) ] [ parent (index in hashtable) ] Database files ~~~~~~~~~~~~~~ after the endianness word, a DB file has the following 8 words, representing a database-speficif header: [ total size (allocated) ] - called tsz hereafter [ current size (in use) ] [ root page index ] [ page table size ] - called ptsz hereafter [ padding / reserved 0 ] [ padding / reserved 1 ] [ padding / reserved 2 ] [ padding / reserved 3 ] the database header is then followed by a sequence of tsz page table records. each page table record is a sequence of (3*ptsz)+2 words, divided into a 2-word header and ptsz 3-word page table entries (ptes): [ occupation count ] [ left page index ]_ [ page table entry 0 ] \ [ ... ] >-- one pte [ ... ]_/ [ page table entry 1 ] \ [ ... ] >-- one pte [ ... ]_/ ... ... ... _ [ page table entry ptsz-1 ] \ [ ... ] >-- one pte [ ... ]_/ as depicted above, each page table entry is 3 words long, and contains the following fields: [ right page index ] [ database val (sample count) ] [ database key (eip) ] |