GCBASIC / Discussion / Open Discussion: Loop Performance

Anobium - 2022-02-19

There are many ways to loop a program in Great Cow BASIC

for next

do loop

repeat end repeat

These loops all do similar things but the question was 'the performance of each loop?'

Is it a really complex answer. Each type of loop has many parameters that impact the performance and the compiler optimises the assembly based on the specific conditions of the definition of each loop.

A simple example showing all three types of loops follows. This uses the LGT chipset because I have the greatest control over the frequency but any chip will work.

Code

#option Explicit #chip LGT8F328P, 1 ' #include <millis.h> ' Include the Library 'USART settings for USART1 #define USART_BAUD_RATE 9600 #define USART_TX_BLOCKING #define USART_DELAY OFF Dim count as byte Dim CurMs, LstMs as word ' declare working variables ' Main ' This loop runs over and over forever. LstMs = 0 CurMs = 0 Wait 2 s HSerPrintCRLF LstMs = millis() For count = 0 to 254 'wait 1 ms Next CurMs = millis() HSerPrint CurMs - LstMs HSerPrintCrlf LstMs = millis() count=0 do until count=254 count++ 'wait 1 ms loop CurMs = millis() HSerPrint CurMs - LstMs HSerPrintCrlf LstMs = millis() count=255 Repeat count 'wait 1 ms End Repeat CurMs = millis() HSerPrint CurMs - LstMs HSerPrintCrlf

Yields

8:17:22.869> 2
8:17:22.869> 3
8:17:22.869> 2

The time and the number of millisecs per loop type. 2ms for the for-next and the repeat loop. With the do-loop taking 3ms.

note the frequency.. it is very low to get a meaningful result.

Consider the code.

Remember - each loop type is handled within the compiler by different sections of the compiler. And, the methods used are specific to PIC or AVR/LGT and there is heavy optimisation of the assembly generated per chip family, variables/constant used, performance of chip in terms of handling specific assembly instructions - the performance of the LGT over an AVR is very different, and the memory access methods per chip family.

Hugh wrote the compiler sections many years ago, and, I have maintained the level of optimisation as I have revised the compiler. Hugh's work is stunningly good.

Oh boy .. this is complex.

for-next loop. This example uses For count = 0 to 254 where the startvalue and endvalue are constants. The compiler handles this by using the constants in the assembly, but consider if startvalue and endvalue where not constants but variables (byte, word, long) then timing to complete the loop will increase as with each type of variable. If STEP is included then the loop timing increases even more.

The fastest for-next loop? Use constants

do-loop. The user program looks simple. But, on test this is the slowest. Why? I am very surprised to. But, the compiler generates assembly for this chip uses different branches and this must account for the performance decrease. I would be interested for other to test this on other chip families - these branches will perform differently on other chip families.

repeat-end repeat. The tests show this performs as good as the the for-next loop. The assembly is very different from the other two loops. However, this loop is the least optimised loop. The loop has little optimisation for use of variable types or constants. So, you can rely of this loop in terms of performance - it will use the same approach every time.

So, now consider a for-next loop with the range of 0 to 1024. What is faster ?

1.
for loopvar = 0 to 1024

or

2.
for loopvar1 = 0 to 255
for loopvar 2 = 0 to 3

or
3.
do loop loopvar1 < 1025

.... well is it 2 the multiple for next loops - these are bytes and they are constants - highly optimised. Hence, the GLCD code uses nested for-next loops to increase performance.

Hope this helps. An interesting but complex subject.

And, I hope the analysts out there wade into the specifics of each chip to dig even deeper into the performance of each instruction.

Last edit: Anobium 2022-02-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anobium - 2022-02-19

Same program ... same frequency.. different chip ,, an 18F27Q84

6ms
7ms
3ms

Interesting.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anobium - 2022-02-19

Same program ... same frequency.. different chip ,, an 18F25K50

4ms
4ms
2ms

Even more interesting

Same program ... same frequency.. different chip ,, an 18F24K42

6ms
8ms
3ms

But, cannot try this using an Arduino... stuck at 16mhz!

Last edit: Anobium 2022-02-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2022-02-19
  
  Same program ... same frequency.. different chip ,, an 16F1778 - go figure this one!!
  
  3ms
  4ms
  1ms
  
  Last edit: Anobium 2022-02-19
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

I tried this and the for-next took 2233 mS
the do-loop took 2075 mS
Do-loop was always faster.
I'll try repeat .

#chip mega328p, 16
dim count1,count2 as Word
Dim CurMs, LstMs,totalms as word
CurMs = millis()
for count1=0 to 1000
  for count2=0 to 1000
  next count2
next count1
LstMs=millis()
totalms=LstMs-CurMs
GLCDPrint 0,0,str(totalms)
;
count1=0:count2=0
CurMs = millis()
do until count1=1000
  count1++
  count2=0
  do until count2=1000
    count2++
  loop
loop
LstMs=millis()
totalms=LstMs-CurMs
GLCDPrint 0,24,str(totalms)

stan cartwright - 2022-02-19

I tried repeat and it took 566mS ... 4 times faster?!
Can repeat be nested?
Does the compiler see if the repeat count is <= 255 and use a byte otherwise it uses a word?

CurMs = millis() Repeat 1000 Repeat 1000 end repeat end repeat LstMs=millis() totalms=LstMs-CurMs GLCDPrint 0,48,str(totalms)

Last edit: stan cartwright 2022-02-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stan cartwright - 2022-02-19

I tried this and it took 2228 mS.

count1=0 CurMs = millis() loop1: count2=0 loop2: count2++ if count2<1000 then goto loop2 count1++ if count1<1000 then goto loop1 LstMs=millis() totalms=LstMs-CurMs GLCDPrint 0,64,str(totalms)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stan cartwright - 2022-02-19

I should try byte values for count1 and count2.
Using Repeat seems a clear winner for speed from what I found.
I've never used Repeat but think I will as sometimes I need code to run as fast as possible.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stan cartwright - 2022-02-19

Thinking though, only for-next lets you use the counter value for using arrays.
do loop and repeat would need an extra variable as a pointer and need incrementing so in practical terms it depends.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2022-02-19
  
  Told you it was complex!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stan cartwright - 2022-02-19

Nothing's complex in gcb ;) That's why we use it!
No-one knows what the compiler does with our code.
I stick with arduino 328 for consistency so I got a working , I hope, target model.
It's fast enough for most stuff but I try to some times get it to be as efficient/fast as possible.
And oh dear , I did use repeat, I just found out. I thought to optimise ili9341 code cos lots of for -next . This was ripped from the glcd.
make sprite_size a constant not a var

sub erase_sprite (sprite_x,sprite_y,sprite_width,sprite_height,sprite_size) ;fills window background colour SetAddressWindow_ILI9341 ( sprite_x,sprite_y,sprite_x +sprite_width-1,sprite_y +sprite_height-1 ) repeat sprite_size SendWord_ILI9341 GLCDBackground end repeat end sub

Last edit: stan cartwright 2022-02-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jerry Messina - 2022-02-21

Same program ... same frequency.. different chip ,, an 18F27Q84
Same program ... same frequency.. different chip ,, an 18F25K50
Same program ... same frequency.. different chip ,, an 18F24K42

That is interesting. I would expect all of them to produce pretty much the same timing.

The K50 is a "classic 18F", and seems to be on par with the 16F1778 timing, give or take.
For simple loop code that's probably to be expected.

The K42 and Q84 are the new "xv18" core devices, and appear to be almost twice as slow!
Is the asm radically different than the K50? In theory it shouldn't be...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2022-02-21
  
  Jerry it would take a much larger analysis to compare - I was testing the results.
  
  Many things are different, interrupt cache being one of the major differences. The newer chips have smart interrupt caching, but, this may come at the cost of performance?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jerry Messina - 2022-02-21

The newer chips have smart interrupt caching, but, this may come at the cost of performance?

If anything I would think that should make them faster since all of the cpu regs are saved in a single cycle. Plus, doesn't the millis() tick operate at 1ms? If so, that would only be a handful of interrupts during the test.

Do the new chips use a lot of MOVFFL instructions? That might account for the difference, but I wouldn't think that code would need to use it since everything should be within the MOVFF range.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2022-02-21
  
  Re MOVFFL nope... some new chips need this always. Found this last week.
  
  It would be a good project for an Intern!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jerry Messina - 2022-02-21

some new chips need this always.

Only if the src/dest is outside the 4K range of MOVFF. MOVFFL is larger and slower, so I prefer to only use it as req'd.

The initial xv18 chips (ie K42, K83) put the SFR registers way at the top of memory, so MOVFFL was needed. The newer ones moved them down to bank 0 where MOVFF can get to them. Much better IMHO.

Any interns hanging around??

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2022-02-21
  
  Q43 needs it regardless.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stan cartwright - 2022-02-21

I have a pic 18f25K22 to test as it's 64MHz and don't pics do an instruction every 4 clocks?
that's the nearest to a mega328p at 16MHz but does an instruction every clock,
or is there no comparison?
Just wondered how much the 328p is optimised. GCB was always pic orientated imho.
Was gcb ported from pic to 328p or was it ground up?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2022-02-21
  
  AVR is optimised and is the PIC
  
  I do not know which came first. Back in 2009 the code supported AVR and PIC.... so, my guess from day one.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stan cartwright - 2022-02-21

Good. Arduino came out in 2005 so hoping 328p was considered serious and when gcb came out arduino was established so someone thought let's write for that not oh well we'll have to support it and not put the same effort as for pics.
I like 328p and gcb, no probs, well happy, not moaning.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2022-02-21
  
  No moaning, no probs.
  
  Uber fast UNO coding..... the new IDE
  
  YouTube for you Stan
  
  https://youtu.be/095AIvr7b_A
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - stan cartwright - 2022-02-21
    
    Cool, looks interesting. thanks. When I installed it I backed up gcb.
    I will now need latest version. What version of GCB am I getting please.
    I don't have a clue about this. I thought it was about using alternate ide like Geany... wrong!
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - stan cartwright - 2022-02-21
    
    In the 1st video 328p needed no ,16 ie 328p,16.
    in 2nd video it's 328p,16.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jerry Messina - 2022-02-21

Q43 needs it regardless.

Evan, is there some unpublished errata/special conditions for this? I've used the Q43 w/MOVFF and it seems to work fine.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2022-02-21
  
  Really sorry I was incorrect - Q40 and q41
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Loop Performance

Forums

Help

Loop Performance

Loop Performance

Forums

Help

Loop Performance document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Loop Performance