Opportunity for optimization, Z80

2014-03-20
2014-03-22
  • Robert Baruch
    Robert Baruch
    2014-03-20

    I used --opt-for-speed --max-allocs-per-node 100000 to see what SDCC would do with this:

    volatile __at (0x3C00) char videoRam[0x300];
    
    // Video is 64x16 characters.
    
    static short x = 0;
    static short y = 0;
    
    void putchar(char c) {
        videoRam[y * 64 + x] = c;
    }
    

    Here's the result:

       0000                      58 _putchar:
       0000 DD E5         [15]   59         push    ix
       0002 DD 21 00 00   [14]   60         ld      ix,#0
       0006 DD 39         [15]   61         add     ix,sp
       0008 ED 5Br02r00   [20]   63         ld      de,(_y)
       000C CB 23         [ 8]   64         sla     e
       000E CB 12         [ 8]   65         rl      d
       0010 CB 23         [ 8]   66         sla     e
       0012 CB 12         [ 8]   67         rl      d
       0014 CB 23         [ 8]   68         sla     e
       0016 CB 12         [ 8]   69         rl      d
       0018 CB 23         [ 8]   70         sla     e
       001A CB 12         [ 8]   71         rl      d
       001C CB 23         [ 8]   72         sla     e
       001E CB 12         [ 8]   73         rl      d
       0020 CB 23         [ 8]   74         sla     e
       0022 CB 12         [ 8]   75         rl      d
       0024 FD 2Ar00r00   [20]   76         ld      iy,(_x)
       0028 FD 19         [15]   77         add     iy, de
       002A 11 00 3C      [10]   78         ld      de,#_videoRam
       002D FD 19         [15]   79         add     iy, de
       002F DD 7E 04      [19]   80         ld      a,4 (ix)
       0032 FD 77 00      [19]   81         ld      0 (iy), a
       0035 DD E1         [14]   82         pop     ix
       0037 C9            [10]   83         ret
       0038                      84 _putchar_end::
    

    I thought I could do better, knowing that indirect hl access can be faster than indirect ix/iy access, plus the add hl,hl trick:

       0038                      90 _putchar2:
       0038 DD E5         [15]   92         push ix
       003A DD 21 00 00   [14]   93         ld ix,#0
       003E DD 39         [15]   94         add ix,sp
       0040 2Ar02r00      [16]   95         ld hl,(_y)
       0043 29            [11]   96         add hl,hl
       0044 29            [11]   97         add hl,hl
       0045 29            [11]   98         add hl,hl
       0046 29            [11]   99         add hl,hl
       0047 29            [11]  100         add hl,hl
       0048 29            [11]  101         add hl,hl
       0049 ED 5Br00r00   [20]  102         ld de,(_x)
       004D 19            [11]  103         add hl,de
       004E 11 00 3C      [10]  104         ld de,#_videoRam
       0051 19            [11]  105         add hl,de
       0052 DD 7E 04      [19]  106         ld a,4 (ix)
       0055 77            [ 7]  107         ld (hl),a
       0056 DD E1         [14]  108         pop ix
       0058 C9            [10]  109         ret
       0059                     110 _putchar2_end::
    

    putchar takes 282 cycles and 56 bytes, while putchar2 takes only 228 cycles (19% fewer cycles) and 33 bytes (41% fewer bytes).

    Does anyone see anything obviously wrong (like, hl is reserved and I should have saved it, incurring 29 more cycles)? What would it take to put this sort of optimization in?

     
    Last edit: Robert Baruch 2014-03-20
    • Am 20.03.2014 06:16, schrieb Robert Baruch:

      I used --opt-for-speed --max-allocs-per-node 100000 to see what SDCC
      would do with this:

      volatile __at (0x3C00) char videoRam[0x300];

      // Video is 64x16 characters.

      static short x = 0;
      static short y = 0;
      static char processingCommand = 0;

      void putchar(char c) {
      /if (c == 0x27 && !processingCommand) {
      processingCommand = !0;
      return;
      }
      /
      videoRam[y * 64 + x] = c;
      }

      Here's the result:

      0000 58 _putchar:
      0000 DD E5 [15] 59 push ix
      0002 DD 21 00 00 [14] 60 ld ix,#0
      0006 DD 39 [15] 61 add ix,sp
      0008 ED 5Br02r00 [20] 63 ld de,(_y)
      000C CB 23 [ 8] 64 sla e
      000E CB 12 [ 8] 65 rl d
      0010 CB 23 [ 8] 66 sla e
      0012 CB 12 [ 8] 67 rl d
      0014 CB 23 [ 8] 68 sla e
      0016 CB 12 [ 8] 69 rl d
      0018 CB 23 [ 8] 70 sla e
      001A CB 12 [ 8] 71 rl d
      001C CB 23 [ 8] 72 sla e
      001E CB 12 [ 8] 73 rl d
      0020 CB 23 [ 8] 74 sla e
      0022 CB 12 [ 8] 75 rl d
      0024 FD 2Ar00r00 [20] 76 ld iy,(_x)
      0028 FD 19 [15] 77 add iy, de

      This addition is what made sdcc not use hl: It has a global variable
      operand and is 16 bits or wider. In some such cases, hl is needed to
      hold the address of the global operand. These checks are found in
      HLinst_ok() in ralloc2.cc. I had another look at this situation, and
      allowed to use hl in some more cases, including your example. Since we
      are currently before a release, I made the change in the sdcc-stm8
      branch, which will be merged after the 3.4.0 release.

      002A 11 00 3C [10] 78 ld de,#_videoRam
      002D FD 19 [15] 79 add iy, de
      002F DD 7E 04 [19] 80 ld a,4 (ix)
      0032 FD 77 00 [19] 81 ld 0 (iy), a
      0035 DD E1 [14] 82 pop ix
      0037 C9 [10] 83 ret
      0038 84 _putchar_end::

      I thought I could do better, knowing that indirect hl access can be
      faster than indirect ix/iy access, plus the add hl,hl trick:

      0038 90 _putchar2:
      0038 DD E5 [15] 92 push ix
      003A DD 21 00 00 [14] 93 ld ix,#0
      003E DD 39 [15] 94 add ix,sp
      0040 2Ar02r00 [16] 95 ld hl,(_y)
      0043 29 [11] 96 add hl,hl
      0044 29 [11] 97 add hl,hl
      0045 29 [11] 98 add hl,hl
      0046 29 [11] 99 add hl,hl
      0047 29 [11] 100 add hl,hl
      0048 29 [11] 101 add hl,hl
      0049 ED 5Br00r00 [20] 102 ld de,(_x)
      004D 19 [11] 103 add hl,de
      004E 11 00 3C [10] 104 ld de,#_videoRam
      0051 19 [11] 105 add hl,de
      0052 DD 7E 04 [19] 106 ld a,4 (ix)
      0055 77 [ 7] 107 ld (hl),a
      0056 DD E1 [14] 108 pop ix
      0058 C9 [10] 109 ret
      0059 110 _putchar2_end::

      putchar takes 282 cycles and 56 bytes, while putchar2 takes only 228
      cycles (19% fewer cycles) and 33 bytes (41% fewer bytes).

      And with the small change in ralloc2.cc this is exactly what sdcc
      generates. If you want to try the version with the change:

      svn co https://svn.code.sf.net/p/sdcc/code/branches/sdcc-stm8/sdcc sdcc-stm8

      Philipp

       
  • Robert Baruch
    Robert Baruch
    2014-03-22

    I'm impressed -- it worked and came up with the exact same code as I did.