#49 Extended Label support for review & comment

new Feature

this is still beta quality (actually has one outstanding crash bug), but here for review & comment. I tried to send an email to the dev list but it bounced. I guess I'll just copy & paste it here for now (if it'll let me):

I'm working on a project (reverse engineering some old c64 stuff) and I'm not very up on current-day macro assemblers. In fact, I've never used them. When I wrote assembly as a child, I did it all on paper. So I've been trying to catch up on this (not having played with a c64 since the early 90s) and I haven't found any disassemblers that offer the support I need (including Vice's). As such, I've had to add it somewhere and I ended up doing it with Vice (actually, I added some of this to dxa, but I like working with Vice for this better).

So, any interest in my extended labels patches? Here's the synopsis:

Backwards compatible with existing syntax.
add_label addr name [type [size]] [comment]
Changed management of names & addresses:
exactly one label allowed for a given address
exactly one label allowed for a given name
you still get a warning when replacing one
Current types are
code (the default)
addr, word (treated the same)
asciz (null-terminated ascii string or array of strings)
asci7 (bit-7-terminated ascii string or array of strings)
Rewrote address & name lookup mechanism:
Added red-black tree implementation from Linux kernel to project as well as a library I wrote last year that extends it.
Removed hash table mechanism
Added rbtree mechanism for lookups of addresses & symbol names.
Note: Most implementations of the STL container std::map uses rbtrees, so converting some of this to C++ is one alternative to adding the Linux rbtree code (which appears to work quite well, although it's easy to misuse and employs the gcc-specific function __builtin_constant_p).

The asci7 is a strange one and I'm sure there are gazillions of string termination mechanisms, but I need this one personally in my stuff (feel free to remove it if you want my patches).

Some outstanding issues (some serious, some my preferences):

built-in .PC label is broken (will mostly likely crash)
help text needs to be updated for new syntax
make warning message for replacing labels friendlier when only changing type, size and/or comments (this warning is supposed to help you not blow away other labels by accident)
make sure I haven't broken the lexer in some way that hasn't been discovered yet (I'm completely new to yacc, bison, flex, etc.)
fix lexer to have labels not dot prefix (which is pretty non-standard and conflicts with psudo-op syntax IMO)
show_labels and save_labels should ideally return them sorted by address.
comments can be formatted more prettily
perhaps a mechanism to annotate disassembled code with comments outside of labels?

The last issue leads to a big question, "Should labels / symbols be integrated with memory content definitions (i.e., code, data, etc.)?" (or for that matter, comments) I'm thinking that they should be separate but inter-operable mechanisms personally, but I'm personally in no rush to "fix" this, as what I have now is enough for me to do my work on my project.

So here's a little sample (using some Ultima 1 data). This is some string data (an array of bit-7-terminated strings) displayed in various formats:

(C:$7930) al 7911 .u1TextClasses asci7 1f; this is how it's supposed to look
(C:$7930) d 7911 792e
.u1TextClasses: ; this is how it's supposed to look
.C:7911 .asci7 "peasant", "fighter", "cleric", "wizard", "thief"
(C:$7930) al 7911 .u1TextClasses ascii 1f; this wont look right
Warning: Replacing label with the same address [.u1TextClasses=$7911 asci7 size=31] with [.u1TextClasses=$7911 ascii size=31].
(C:$7930) d 7911 792e
.u1TextClasses: ; this wont look right
.C:7911 .ascii "peasan\xF4fighte\xF2cleri\xE3wizar\xE4thie\xE6"
(C:$7930) dl .u1TextClasses
(C:$7930) al 7911 .u1TextClasses byte 1f
(C:$7930) d 7911 792e
.C:7911 .byte 70, 65, 61, 73, 61, 6e, f4, 66, 69, 67, 68, 74
.C:791d .byte 65, f2, 63, 6c, 65, 72, 69, e3, 77, 69, 7a, 61
.C:7929 .byte 72, e4, 74, 68, 69, 65, e6
(C:$7930) dl .u1TextClasses
(C:$7930) al 7911 .u1TextClasses word 1f
(C:$7930) d 7911 792e
.C:7911 .word 6570, 7361, 6e61, 66f4, 6769, 7468, f265, 6c63
.C:7921 .word 7265, e369, 6977, 617a, e472, 6874, 6569, 66e6

Here's some code & data that's intermixed, first without defining the data sections and then with:

(C:$8614) al 848e .u1PrintStringOnStack; Print a null-terminated string at stack return address, return to instruction after string
(C:$8614) d 8607 8612
.C:8607 20 8E 84 JSR .u1PrintStringOnStack
.C:860a 48 PHA
.C:860b 75 68 ADC .zBITS,X
.C:860d 3F 00 20 RLA $2000,X
.C:8610 72 JAM
.C:8611 87 38 SAX $38
.C:8613 60 RTS
(C:$8614) al 860a .u1TextHuhQ asciz 05; much easier to read
(C:$8614) d 8607 8612
.C:8607 20 8E 84 JSR .u1PrintStringOnStack
.u1TextHuhQ: ; much easier to read
.C:860a .asciz "Huh?"
.C:860f 20 72 87 JSR $8772
.C:8612 38 SEC
.C:8613 60 RTS

I've attached the current patch if you guys have any interest. I guess it's most helpful when trying to do more in-depth reverse engineering. It lacks a lot of what dxa does, as well as a mechanism to attach symbols to code/data loaded from specific files (which would make it even more useful).



  • Soci/Singular

    This sounds interesting. Not exactly for reverse engineering, but for debugging. An assembler could output these data type hints (plus labels) and disassembling in monitor would be much more readable ;)

    It would be useful to have data type definition without any label for fields of non-uniform data tables, so I would vote to separate data definition from labels.

    I think gcc specific extensions are not that popular here. The problem is it has to compile on visualc with it's c89 limitations and idiotic warnings as well ;(

    Uniq name and address definition could be a bit limiting but ok at first. What about having more than one named trees (switchable somehow), so that different banks could have different labels for the same addresses? (e.g. on c128 or cartridges)

    Had a very quick look, here's two comments on code:

    + return 1;
    + default: {
    + BYTE *p;
    + BYTE *end = &mon_memmap[loc + 256];

    Use mon_memmap_size as modulo, this will crash. What's the reason guess_label_size has to depend on FEATURE_CPUMEMHISTORY? Why not use the mon_get_mem_val instead?

    +static const struct rbtree_relationship rel = { \
    + .root_offset = offsetof(ctype, croot), \
    + .node_offset = offsetof(otype, onode), \
    + .key_offset = offsetof(otype, okey), \
    + .compare = (int(*)(const void*, const void*))compare_fn; \
    +}; \

    Probably will not compile on VC ;(

  • daniel_santos

    Thanks so much! I even got that piece of code to work properly!

    WORD i;
    const WORD end = loc < 0xff00u ? loc + 0x100u : 0xffffu; /* max 256 bytes, but no wrapping */

    /* sanity check type */
    assert(type == e_label_ascii || type == e_label_asciz || type == e_label_asci7);

    /* not messing with drives for now */
    if(mem != e_comp_space)
    return 0;

    for(i = loc; i != end; ++i) {
    BYTE b = mon_get_mem_val(e_comp_space, i);

    switch(type) {
    case e_label_ascii: if (isprint(b)) continue; else break;
    case e_label_asciz: if (b) continue; else break;
    case e_label_asci7: if (! (b & 0x80)) continue; else break;
    default: break; /* suppress enum warning */
    return i - loc + (type == e_label_ascii ? 0 : 1);

    As far as the struct initialization, it's a new syntax to me too, I just learned it last year working on the kernel. I don't know what C standard it's supported in, but I don't mind changing that. However, I don't think I ever should have written those macros in rbtree_ext.h, I think it's better to just let the implementer do that code themselves. Plus, it makes debugging more of a pain because you can't step through it. But I can change the struct initialization. The one that I'm pretty certain will break elsewhere is __builtin_constant_p(). I think I'll encapsulate it somehow in a pre-processor macro so we can still catch uses of those functions with non-compile-time data when compiling with gcc.

    As far as marking data type separately, I think I'm starting to get sold on that. However, I came up with a separate requirement -- allowing the ability to comment code without having an explicit label. Again, should I remove comments from the label functionality and institute a separate comment functionality? In this case, I'm thinking no just because you may have a comment on a label but also a comment on the piece of code or data that it points to. I'm still not sure how to do this here, but I'm thinking about just implementing them as labels with no name, type or size... I dunno still.

    What I do know is that I would like to be able to have a separate file with my annotations (labels, data type definitions, comments, etc) that the monitor pulls in to help me debug, reverse-engineer, etc.

    Also, I think I need a mechanism to define formatting & alignment. Currently, with data, I just say "try to break at column 70, force break at 127." So if you have a long string, it will get to 127, close the quote, line break and start again. (this code also contains a flaw that doesn't put a comma after closing quotes because it's going to print the next string in an array on the next line). Well generally want to see labels & the instr/data address at col 0, the raw bytes at 2 spaces after the address (col 9 in Vice), then then instructions all lined up, followed by comments, but I think I'm going to define parameters in the config (or perhaps based upon the terminal size?) to give full control over this. not sure how to go about it yet.

  • daniel_santos

    hmm, looks like another portability issue may be the use of pointer arthritic on types of void*. gcc allows this (as an extension to the C language), but other compilers may not like it (this is in my rbtree_ext.h). Also, use of __builtin_constant_p (on gcc 4.5.3 anyway) appears to always return zero if the supplied type is either an instance of a struct or a pointer to it, even when the data is fully compile-time constant and definitely optimized out.

    This really worried me at first because I thought that gcc wasn't treating it as a compile-time constant in __rbtree_find and __rbtree_insert, thus destroying the entire reason that I wrote the extension library in the 1st place. However, I picked through a few various optimized builds and gcc is indeed optimizing it out, although not quite to the degree I had at first hoped. This doesn't matter in Vice, but for some of my kernel code, it would be an issue.

    So I'm going to remove the use of the gcc builtin entirely.

  • Olaf Seibert
    Olaf Seibert

    It is only a gcc extension if you consider MS Visual Studio to be a C compiler. However, it isn't. It doesn't compile the language defined by the current C standard, not for the last 13 years. Designated initializers are fine in C99 (even the referenced gcc page says so).

  • daniel_santos

    Nice response. :) I've re-added them and I do hope that m$Visual Crap wont choke and die on them (if so, I can remove them, i don't care too much really -- but the thought of anything m$ choking and dying always sounds good).

    I almost forgot to post my new patch, but I think it's hacky. I've added support for arbitrary comments without a label, but I smushed it in together with the label code -- I think that comments, labels/symbols and memory data type definitions should be 3 separate mechanisms.

  • daniel_santos

    OK, I guess this patch deserves a listing of changes & new features.

    * fixed .PC built-in label (which I had broken)
    * added Linux Kernel-style slist_t and list_t (single- and double-linked lists, respectively) with supporting functions {,s}list_init_head, {,s}list_insert, {,s}list_remove and {,s}list_find_and_remove -- most are documented. I also added some iteration macros list_for_each() and list_for_each_obj(). You can use these instead of just putting a "mystruct *next" member in your struct and then you can re-use generic (pre-existing & bug-free) code and not bloat the sources unnecessarily. Modern compilers like the inline keyword and such generic code wont bloat your object files & executables. Just note that these are circular lists and your head must be initialized prior to use.
    * Added support for comments w/o labels. More specifically, if you create a label with the name ".comment" it will store NULL in place of the name and print your comments to the right of the disassembled code.
    * Fixed some bugs (thanks soci)
    * Ate a carrot (because they're healthy)
    * Added a sort function to sort labels by address when you do show_label or save_label

    The last thing I really want to do is to improve layout when disassembling, add some configuration options to control it and have long comments wrap instead of droning on.

    It really needs to have comments, memory data types & labels separated out, but I'm kinda not in the mood for that atm. :)

    Eventually, I would like this to work so that you can have all of your data/code annotations (symbols, comments & data types) saved in a pretty file that you can edit, reload, and go round-trip on. I guess there is also a need for namespaces and perhaps local symbols (i.e., symbols like "loop" that only exist within a certain context or offset range +-127 bytes).

  • Soci/Singular

    I've updated it to current trunk and dumbed it down here and there in hope it compiles now even with VC. But I'm not sure about that as I don't use it ;-) Please check.

  • Soci/Singular

    Add long/dword, "." for comment and data type definition only

  • Soci/Singular

    Implemented interval handling, so that it's possible to disassemble in the middle of a data area. And symbol references in areas can print an offset more than +1.