abi

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

hi,

along our list of things to do is nailing down the file format of
oprofile's devices and database files, and making sure they are
platform portable and self-describing. this is basically just to make
sure that debugging someone else's oprofile setup is plausible, even
from another machine.

to this end, will suggested I write up an "abi document", which we
could put into doc/ or something, which just specifies file
formats. so then we would have something in english + diagrams, and
explicit at the byte-level, to compare structs, testsuites, and i/o
functions against as we carry on development. So I have written a
quick draft of such a document, which follows. Any comments or
corrections? If it looks OK I'm going to start emitting patches which
tighten up at least the database i/o routines to conform.

-graydon

---

oprofile ABI
~~~~~~~~~~~~

This document is a normative reference to the oprofile ABI, by which
we mean the binary layout of two interfaces:

	- the structure of the mmap()'ed device files used to move
	  data between the kernel and the daemon

	- the structure of the mmap()'ed database file used to store
	  samples permanently on disk

The ABI must satisfy the following requirements:

	- it must be reasonably simple, and represent roughly what the
	  tools already do at the time of writing.

	- it must have an unambiguous encoding at the byte stream
          level.

	- it must support 31, 32, 64 and possibly larger word-size
	  architectures.

	- it must support classical "big", "little", and possibly
	  other endiannesses (word byte orders).

	- the ABI must be reasonably self-identifying and
          self-describing.

This document places no requirements on the reading end; it is
acceptable for a reading application to simply give up if some
combination of input fields is confusing or unacceptable. The purpose
of this document is to lay out what must be *written*, so that a
reading application has a chance to decide whether it can sensibly
handle a file or not, and so that all information needed to decode the
file is available even if the file has been moved to a different host
environment.

Headers
~~~~~~~

The first 8 bytes of any "oprofile ABI" file are called the header,
and establsh a magic number, version codes (major versions being
backwards-incompatible, minor being compatible), a width byte, and a
reserved byte.

note-device header:

    0       1       2       3       4       5       6       7      8
    +-------+-------+-------+-------+-------+-------+-------+------+
    |  'o'  |  'p'  |  'n'  |  'd'  | major | minor | width | rsvd |

sample-device header:

    0       1       2       3       4       5       6       7      8
    +-------+-------+-------+-------+-------+-------+-------+------+
    |  'o'  |  'p'  |  's'  |  'd'  | major | minor | width | rsvd |

hash-map device header:

    0       1       2       3       4       5       6       7      8
    +-------+-------+-------+-------+-------+-------+-------+------+
    |  'o'  |  'p'  |  'h'  |  'd'  | major | minor | width | rsvd |

database file header:

    0       1       2       3       4       5       6       7      8
    +-------+-------+-------+-------+-------+-------+-------+------+
    |  'o'  |  'p'  |  'd'  |  'b'  | major | minor | width | rsvd |

The width byte specifies the width of subsequent words in the file,
after the reserved byte. the width byte is an unsigned value, referred
to hereafter as "wb", and is a count of bytes; for example if the
wb=8, the remaining words in the file are 8 bytes (64 bits) each. the
remainder of the file is a sequence of words of this size, for
efficiency sake.

The next wb bytes after the header are called the endianness word,
referred to hereafter as "eb"; they specify a byte-permutation which
maps bytes in subsequent words of the file to sub-word bytes in
memory, counting from the low-order byte. In other words, the
"identity" interpretation of subsequent words in the file is as
classical little-endian words, but any byte order may be specified
using appropriate eb bytes.

Specifically, let the byte eb[0] be the first byte in the endianness
word, and eb[wb-1] be the last byte in the endianness word, then a
word in memory can be faithfully constructed from any on-disk format,
in any type of host memory, using this procedure:

	word read_word(int const * eb) {
		long w = 0x0;
		for(int i = 0; i < wb; ++i) {
			w |= (next_byte() << (eb[i] * 8));
		}
		return w;
	}

Obviously, it is illegal for any byte in the endian word to be greater
than wb-1, and also that implementations may choose to use more
efficient algorithms than the above (such as the identity function on
IA32) if they recognize a compatible endianness in the endian word.

All words in the remainder of the file must follow the encoding scheme
of the endian word.

examples:

	platform		  wb	   eb[0]..eb[wb-1]
	------------------------+-------+----------------------
	IA32 little endian	| 4	| 0 1 2 3
	IA64 little endian	| 8	| 0 1 2 3 4 5 6 7
	PPC32 big endian	| 4	| 3 2 1 0
	PPC64 big endian	| 8	| 7 6 5 4 3 2 1 0
        kooky word and endian	| 5	| 3 4 2 0 1

All remaining structures in either file, with the exception of the
string pool in the hash map file, are sequences of words (of length wb
bytes) following the endianness mapping of eb. 

Note device file
~~~~~~~~~~~~~~~~

after the endianness word, the note device file is a dynamically
produced stream of notes. there is no limit to the number of notes
which occur. each note is 6 words, of the following form:

	[ address                     ]
	[ length                      ]
	[ offset                      ]
	[ hash of path                ]
	[ pid                         ]
	[ note type                   ]

Sample device file
~~~~~~~~~~~~~~~~~~

after the endianness word, the sample device is a dynamically produced
stream of sample buffers. there is no limit to the number of sample
buffers which occur. each sample buffer begins with a 3-word buffer
head:

	[ cpu number                  ]
	[ sample count                ] - called sc hereafter
	[ profiler state              ]

after the buffer head, there is a sequence of 4*sc words, divided into
sc 4-word samples:
                                       _
	[ sample 0                    ] \
	[                             ]  \__ one sample
	[                             ]  /
	[                             ]_/
	[ sample 1                    ] \
	[                             ]  \__ one sample
	[                             ]  /
	[                             ]_/
	  ...
	  ...
	  ...
	  ...                          _
	[ sample sc-1                 ] \
	[                             ]  \__ one sample
	[                             ]  /
	[                             ]_/

as depicted above, each sample is 4 words long, and contains the
following fields:

	[ eip (instruction pointer)   ]
	[ sample count                ]
	[ hardware counter number     ]
	[ pid (process ID)            ]

Hash map file
~~~~~~~~~~~~~

after the endianness word, the hash map file has 2 words containing
the sizes of the hashtable and string pool:

	[  size of hashtable          ] - called hsz hereafter
	[  size of stringpool         ] - called spsz hereafter

followed by (2 * hsz) words containing hash entries
                                       _
	[ entry 1                     ] \_ one hash entry
	[                             ]_/
	[ entry 1                     ] \_ one hash entry
	[                             ]_/
	  ...
	  ...                          _
	[ entry hsz-1                 ] \_ one hash entry
	[                             ]_/

followed by spsz *bytes*, which is the string pool. strings are packed
one after another, are not aligned, and are null-terminated.

as depicted above, each hash entry is 2 words long, and contains the
following fields:

	[ name (index in string pool) ]
	[ parent (index in hashtable) ]

Database files
~~~~~~~~~~~~~~

after the endianness word, a DB file has the following 8 words,
representing a database-speficif header:

	[ total size (allocated)      ]  - called tsz hereafter
	[ current size (in use)       ]
	[ root page index             ]
	[ page table size             ]  - called ptsz hereafter
	[ padding / reserved 0        ]
	[ padding / reserved 1        ]
	[ padding / reserved 2        ]
	[ padding / reserved 3        ]

the database header is then followed by a sequence of tsz page table
records. each page table record is a sequence of (3*ptsz)+2 words,
divided into a 2-word header and ptsz 3-word page table entries
(ptes):

	[ occupation count            ]
	[ left page index             ]_
	[ page table entry 0          ] \ 
	[ ...                         ]  >-- one pte
	[ ...                         ]_/ 
	[ page table entry 1          ] \
	[ ...                         ]  >-- one pte
	[ ...                         ]_/
	  ...
	  ...
	  ...                          _
	[ page table entry ptsz-1     ] \
	[ ...                         ]  >-- one pte
	[ ...                         ]_/

as depicted above, each page table entry is 3 words long, and
contains the following fields:

	[ right page index            ]
	[ database val (sample count) ]
	[ database key (eip)          ]