Menu

#660 Stack overflow in network code

Always occurs!
open
None
5
2020-11-29
2020-05-03
No

Network code results in an immediate stack overflow due to an out of bounds write. It crashes very often on big endian, but is caught far less on x86. The issue is an array is being written to (net_netnodes) in d_net.c on line 1730 with an offset of doomcom->remotenode (which seems to be -1 by this point). The check this code is written in only changes that remotenode < MAXNETNODES. I think either the defaults for this need to change (to 0), or the node creating the game server probably isn't supposed to populate this (meaning the check is invalid). With address sanitizers, here'st the output:

==14441==ERROR: AddressSanitizer: global-buffer-overflow on address 0x00001067b66c at pc 0x000010436800 bp 0x3ffff542a220 sp 0x3ffff542a290
WRITE of size 4 at 0x00001067b66c thread T0
    #0 0x104367fc in HGetPacket /home/adam/doomlegacy-svn-latest/src/d_net.c:1730
    #1 0x1042df48 in Net_Packet_Handler /home/adam/doomlegacy-svn-latest/src/d_clisrv.c:5206
    #2 0x1042fff8 in NetUpdate /home/adam/doomlegacy-svn-latest/src/d_clisrv.c:5803
    #3 0x1042ee18 in TryRunTics /home/adam/doomlegacy-svn-latest/src/d_clisrv.c:5596
    #4 0x10447f38 in D_DoomLoop /home/adam/doomlegacy-svn-latest/src/d_main.c:966
    #5 0x10006d70 in main sdl/i_main.c:86
    #6 0x3fffaf8eb400  (/lib64/libc.so.6+0x48400)
    #7 0x3fffaf8eb64c in __libc_start_main (/lib64/libc.so.6+0x4864c)

0x00001067b66c is located 12 bytes to the right of global variable 'ackpak' defined in 'd_net.c:267:17' (0x10664960) of size 93440
0x00001067b66c is located 20 bytes to the left of global variable 'net_nodes' defined in 'd_net.c:300:18' (0x1067b680) of size 3072
SUMMARY: AddressSanitizer: global-buffer-overflow /home/adam/doomlegacy-svn-latest/src/d_net.c:1730 in HGetPacket
Shadow bytes around the buggy address:
  0x0200020cf670: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0200020cf680: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0200020cf690: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0200020cf6a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0200020cf6b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0200020cf6c0: 00 00 00 00 00 00 00 00 00 00 00 00 f9[f9]f9 f9
  0x0200020cf6d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0200020cf6e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0200020cf6f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0200020cf700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0200020cf710: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==14441==ABORTING

To reproduce, simply start a network coop game (though any mode might work.
Testing was done in PPC64 Linux, with svn revision 1518.

Discussion

  • Wesley Johnson

    Wesley Johnson - 2020-05-05

    When there is no packet, I_NetGet sets doomcom->remotenode == -1 .
    The functions that do this also return an NE_xx error message.
    The dummy function Internal_Get(), is used when there is no network game, and it returns NE_not_netgame. This error indication got moved around and ended up right after NE_empty,
    as it also indicates that there is no packet. The tests were inadequate to detect this, as both error tests missed it.
    Fixed SVN 1522.

    I have tested network games extensively and without netgame too, and I never got this to happen.
    So, good catch, in that you caught something that I missed, and I probably would not have caught it.
    But, what are you doing different, and what did the endian have to do with it?
    I suspect this bug report is not done yet. The real problem is just going to move the error somewhere else.

    As we are about 1 or 2 days from release of the next version, I have to leave this to you to investigate.

     
    • Adam Stylinski

      Adam Stylinski - 2020-05-05

      I wasn't doing anything differently, I think memory layout was just
      different enough that the stack boundary caused a segfault. I managed
      to make the issue happen a couple of times on x86, but I could
      reliably get the crash to happen on big endian PowerPC.

      Now I get further, but when I try to connect to an x86 machine from a
      powerpc client, it just hangs, with the dialog that says "waiting for
      game to start". The UI is then unresponsive.

       

      Last edit: Wesley Johnson 2020-05-05
      • Adam Stylinski

        Adam Stylinski - 2020-05-05

        So part of the issue seems to be that it's "failing" to find
        legacy.wad, despite them being identical on both systems (I did an md5
        on both ends, they are identical). I however modified the code that
        searches for the files so that the status was set to FS_OPEN (as it
        already was). The check for the file failing is obviously another
        bug, probably a simple endianness bug with md5 checking, but I'm more
        concerned with the fact that it doesn't make it into the game if we
        get past this.

         

        Last edit: Wesley Johnson 2020-05-05
  • Adam Stylinski

    Adam Stylinski - 2020-05-05

    Ok, so I managed to make this work, but only by changing "waitplayers" to 2. It seems that setting "waitplayers" to 0 does not work. That's perhaps a different bug. I'll look into your endianness bug (I suspect your memcmp here is wrong, I think one of those needs to be flipped to match endianness, as the md5 checksum is a 128 bit integer).

     
  • Wesley Johnson

    Wesley Johnson - 2020-05-05

    Trying to connect a big-endian and x86 machine in netplay is new untested capability, a work-in-progress. I worked 3 or 4 weeks on the wait-for-game-start just to get it to work with two x86 machines. The main problems is that savegames are endian sensitive, so a normal join does not work.

    The UI hang is due to a synchronization point waiting for a particular message from the client, and it is not getting it. If the client would just die the server would notice and drop it. As long as the client keeps responding to tic messages, it is seen as still alive. In spite of timers, and a timeout mechanism on this synchronization mechanism, it still manages to hang. You will need a fully instrumented message reporting (debugging printf) to diagnose that hang.

    This is already something that I will probably investigate sooner or later, and devise a more reliable method of synchronization. Perhaps after a few months of not having to look at that same code every day. Anything you can discover will be helpful, but you should realize that any code that you submit as a fix, will be a temporary patch.

    As that feature is currently not documented, you may be suffering from operator error also.
    The wait-for-game-start player is at the mercy of the other players (especially the server) starting the game, and exiting a level. During the intermission of an exit level the waiting client is sent the critical join messages (when a savegame is not needed). It is expected to update itself and eventually satisfy the conditions that allow the server to continue. I think the real problem is that it does not actually use a particular explicit message to do that (too much shoe-horning this on to existing messaging and too expecting things to fall into place). Something unexpected is happening, and it is very likely related to receiving packets unexpectedly, too soon, or too late, in relation to some other event, and then not doing what was expected. Trouble is, all these timing issues are completely silent, so detection and debugging is a real pain.

    Changes to network packets would require a new network version, and would have to wait
    for the next major release of DoomLegacy (1.49), which is why I worked so hard to get all the needed packet structures into release 1.48. For now, fixes cannot touch the packet structures.

     
  • Wesley Johnson

    Wesley Johnson - 2020-05-05

    Your postings are auto-repeating the entire bug report into the bug reporting system (3 or 4 pages worth).
    I deleted those portions of your postings (see EDIT).
    Please disable auto-reply.

     
    • Adam Stylinski

      Adam Stylinski - 2020-05-05

      Sorry about that, using gmail's reply appends the message and hides it, so it's easy to forget. Replying here with sourceforge, directly. For the MD5 comparison to work in an endian independent manner, either:

      1.) The packet needs to be modified to send the string representation of the hash
      or
      2.) There needs to be an endianness flag tacked onto the packet with a conditional byte swap for what I'm guessing are 4 integers internally
      or
      3.) The worst of both worlds, assume an endianness for the sum and on the non-native endianness convert on both send and receive.

      As far as getting things working between clients, I'm perfectly fine with helping you debug and come up with a patch post release and have the fixes sitting in trunk (I follow trunk, anyway, rather than release).

       
  • Wesley Johnson

    Wesley Johnson - 2020-05-06

    There already is an endian flag in the JOIN packet, NF_big_endian.

    I had looked at one of those calculated hash previously. It is calculated byte by byte and stored in an array.
    There are no endian sensitive integers in it. It does not matter about intermediate results. If all the values to be transported are bytes, then those bytes can be transported without endian issues.
    That array should be transportable without needed conversion. The order of an array in memory is not endian-sensitive either.

    If there was a conversion needed, then the format of the packet should be fixed, and everybody convert their md5 to meet it.
    The order of the bytes in fileneed_t, and md5sum, is fixed by the packet format.

    Does the big-endian machine actually calculate a differnt md5 for the legacy.wad ??
    Could that be due to not taking into account that legacy.wad is little-endian ??

     
  • Wesley Johnson

    Wesley Johnson - 2020-11-20

    What is current status of this issue on your machines ?
    It is two releases old.

     
  • Adam Stylinski

    Adam Stylinski - 2020-11-29

    I had a water pump fail on the G5. I've since converted it to air cooling but haven't had much time to try this out again. I'll let you know sometime next week.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.