GCBASIC / Discussion / Open Discussion: Produce "DMA optimized" with new 18FXXQ43 PICs

ikonsgr74 - 2020-12-27

I was taking a look at the features of the new 18FXXQ43 family, and one that looks very promising in boosting performance, is the existance of 6 (DMA) Controllers that can be used for Data transfers to SFR/GPR spaces from either Program Flash Memory, Data EEPROM or SFR/GPR spaces.
I suppose that Cow Basic functions like readtable or hsersend/hserreceive could benefit a lot by utilizing DMA.
So i was wondering, does this new feature is utilized by the compiler, and if not, is there any schedule of adding it in the future, in order for cow basic to produce "DMA optimized" code, whenever a PIC equipped with it is used?

Last edit: ikonsgr74 2020-12-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anobium - 2020-12-27

As the new release candidates support the Q43s you can investigate the methods that need to be adapted to support DMA.

Bill Roth is working on the DMA on a project and he will be sharing his insights in the coming days. But, the essentials are already in the compiler we may need to tweak to enable DMA.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ikonsgr74 - 2020-12-27

Great! If you remember, i'm the guy who had some issues with large tables, on a rather big project (currently using a 18F47Q10):
https://sourceforge.net/p/gcbasic/discussion/596084/thread/5398dd8bc1/?limit=25&page=0

Do you think that "DMA supported" methods will be available with the coming release of Cow Basic then? As i'm heavily using readtable method in my project (btw i suppose that access of large table variables implemented in RAM, could also benefit from DMA, right?) , and if DMA can offer significant speed gain, it willl surely boost a lot the performance!
Take a look here if you want, to see a small presentation i made.

Last edit: ikonsgr74 2020-12-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- mkstevo - 2020-12-28
  
  That disk interface you have made is incredible. Congratulations, I'm very, very impressed.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - ikonsgr74 - 2020-12-29
    
    Thanks my friend! True, especially the 765 Floppy Disk Controller low level emulation, was really a tough job to do, but thanks to covid19 quarantines and plenty of free time,i've manage to make it work! :-)
    I may also upload the CB code for this project in a new topic! ;-)
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

@Anobium ,I took a more thorough look on DMA details from 18F47Q43 datasheet, i don't know if i get this right, but it seems that the utilization of DMA controllers can have a MAJOR impact on performance!
For example, i take a look on the asm code of my project generated by Cow Basic, regarding the interrupt trigger on receiving a byte from hardware UART module (On Interrupt UsartRX2Ready Call readUSART ):

Interrupt
;Save Context
    movff   WREG,SysW
    movff   STATUS,SysSTATUS
    movff   BSR,SysBSR
;Store system variables
    movff   FSR0L,SaveFSR0L
    movff   FSR0H,SaveFSR0H
    movff   SysWordTempB,SaveSysWordTempB
    movff   SysWordTempB_H,SaveSysWordTempB_H
    movff   SysWordTempA,SaveSysWordTempA
    movff   SysWordTempA_H,SaveSysWordTempA_H
    movff   SysByteTempX,SaveSysByteTempX
;On Interrupt handlers
    banksel PIE3
    btfss   PIE3,RC2IE,BANKED
    bra NotRC2IF
    btfss   PIR3,RC2IF,BANKED
    bra NotRC2IF
    banksel 0
    call    READUSART
    banksel PIR3
    bcf PIR3,RC2IF,BANKED
    bra INTERRUPTDONE
NotRC2IF
;User Interrupt routine
INTERRUPTDONE
;Restore Context
;Restore system variables
    movff   SaveFSR0L,FSR0L
    movff   SaveFSR0H,FSR0H
    movff   SaveSysWordTempB,SysWordTempB
    movff   SaveSysWordTempB_H,SysWordTempB_H
    movff   SaveSysWordTempA,SysWordTempA
    movff   SaveSysWordTempA_H,SysWordTempA_H
    movff   SaveSysByteTempX,SysByteTempX
    movff   SysW,WREG
    movff   SysSTATUS,STATUS
    movff   SysBSR,BSR
    retfie  0
    banksel 0

As you can see, dozens of instructions are needed for saving CPU state before executing interrupt routine (which moves a byte from UART input buffer to a large buffer table variable in RAM) and restoring it after finishing.
Is it correct to assume that, if a DMA controller is used to service the interrupt routine, there is no need to save/restore CPU state?
Moreover, since the actual code of the interrupt routine:

  READUSART
;buffer(next_in) = HSerReceive2
    call    FN_HSERRECEIVE2
    lfsr    0,BUFFER
    movf    NEXT_IN,W,ACCESS
    addwf   AFSR0,F,ACCESS
    movf    NEXT_IN_H,W,ACCESS
    addwfc  AFSR0_H,F,ACCESS
    movff   HSERRECEIVE2,INDF0
;next_in = ( next_in + 1 )
    incf    NEXT_IN,F,ACCESS
    btfsc   STATUS,Z,ACCESS
    incf    NEXT_IN_H,F,ACCESS
;IF (NEXT_IN>BUFFER_SIZE) Then
    movff   NEXT_IN,SysWORDTempB
    movff   NEXT_IN_H,SysWORDTempB_H
    movlw   28
    movwf   SysWORDTempA,ACCESS
    movlw   12
    movwf   SysWORDTempA_H,ACCESS
    call    SysCompLessThan16
    btfss   SysByteTempX,0,ACCESS
    bra ENDIF234
;NEXT_IN=1
    movlw   1
    movwf   NEXT_IN,ACCESS
    clrf    NEXT_IN_H,ACCESS
;END IF
ENDIF234
;next_in = ( next_in + 1 ) % BUFFER_SIZE
;DIR PORTA.5 IN
    return

will be modified for DMA utilization, maybe it will be faster to execute too ?
(from what i read on datasheet, you only need to set a bunch of DMA registeres and then the actual DMA transfer of 1 byte takes only 2 instructions!)

Btw, i just ordered a PICKIT4 and a couple of 18F47Q43 from Microchip direct, so when i get them, i might be able to give you extra feedback on DMA testing! ;-)

Last edit: ikonsgr74 2020-12-29

And here is an example code i found from datasheet:
This code example illustrates using DMA1 to transfer 10 bytes of data from 0x1000 in Flash memory to the UART transmit buffer.

void initializeDMA(){
//Select DMA1 by setting DMASELECT register to 0x00
 DMASELECT = 0x00;
//DMAnCON1 - DPTR remains, Source Memory Region PFM, SPTR increments, SSTP
 DMAnCON1 = 0x0B;
//Source registers
//Source size
 DMAnSSZH = 0x00;
 DMAnSSZL = 0x0A;
//Source start address, 0x1000
 DMAnSSAU = 0x00;
 DMAnSSAH = 0x10;
 DMAnSSAL = 0x00;
//Destination registers
//Destination size
 DMAnDSZH = 0x00;
 DMAnDSZL = 0x01;
//Destination start address,
 DMAnDSA = &U1TXB;
//Start trigger source U1TX. Refer the datasheet for the correct code
 DMAnSIRQ = 0xnn;
//Change arbiter priority if needed and perform lock operation
 DMA1PR = 0x01; // Change the priority only if needed
 PRLOCK = 0x55; // This sequence
 PRLOCK = 0xAA; // is mandatory
 PRLOCKbits.PRLOCKED = 1; // for DMA operation
//Enable the DMA & the trigger to start DMA transfer
 DMAnCON0 = 0xC0;
}

So,it seems that any routine implementation using DMA, is practically only a bunch of DMA register sets! ;-)

This is pretty simple to write in Great Cow BASIC. A few changes to support the word pointer addresses (bu using the alias).

#chip 18f26Q43


initializeDMA
'do stuff
end

    sub initializeDMA

    'create an word alias to support the
     dim DMAnDSAWord as word alias DMAnDSAH, DMAnDSAL

    //Select DMA1 by setting DMASELECT register to 0x00
     DMASELECT = 0x00;
    //DMAnCON1 - DPTR remains, Source Memory Region PFM, SPTR increments, SSTP
     DMAnCON1 = 0x0B;
    //Source registers
    //Source size
     DMAnSSZH = 0x00;
     DMAnSSZL = 0x0A;
    //Source start address, 0x1000
     DMAnSSAU = 0x00;
     DMAnSSAH = 0x10;
     DMAnSSAL = 0x00;
    //Destination registers
    //Destination size
     DMAnDSZH = 0x00;
     DMAnDSZL = 0x01;
    //Destination start address,
'change to word pointer - as this would have only pointed to the lower address byte of U1TXB
     DMAnDSAWord = @U1TXB;
    //Start trigger source U1TX. Refer the datasheet for the correct code
     DMAnSIRQ = 0xnn;
    //Change arbiter priority if needed and perform lock operation
     DMA1PR = 0x01; // Change the priority only if needed
     PRLOCK = 0x55; // This sequence
     PRLOCK = 0xAA; // is mandatory
'remove  PRLOCKbits.
     PRLOCKED = 1; // for DMA operation
    //Enable the DMA & the trigger to start DMA transfer
     DMAnCON0 = 0xC0;
    end sub

I used the latest RC candidate (RC34) and PICInfo to figure that I need to create the alias. The pointer assignment was

This would fail as & is invalid, and, the assignment would only move the (low) address of U1TXB (in Great Cow BASIC) as DMAnDSA is a byte (address 240/0x00F0).

//Destination start address,
 DMAnDSA = &U1TXB;

Yields in the assembly, with the change of & to @.

;DMAnDSA = @U1TXB;
    movlw   low(U1TXB)
    movwf   DMANDSA,BANKED

So, the changes:

Create a word alias and then use a similar assignment.

'create an word alias to support the
 dim DMAnDSAWord as word alias DMAnDSAH, DMAnDSAL

creates the word at the correct address, as follows:

;Alias variables
DMANDSAWORD EQU 240
DMANDSAWORD_H   EQU 241

And, the assignment.

;DMAnDSAWord = @U1TXB;

Yields in the assembly: Show the low and high address being loaded into the correct DMA addresses.

;Destination start address,
;change to word pointer - as this would have only pointed to the lower address byte of U1TXB
;DMAnDSAWord = @U1TXB;
    movlw   low(U1TXB)
    movwf   DMANDSAWORD,BANKED
    movlw   high(U1TXB)
    movwf   DMANDSAWORD_H,BANKED

Enjoy. Hope this makes sense.

Evan

Anobium - 2020-12-29

PICInfo shows the addresses to cross-reference to the alias addresses.

Last edit: Anobium 2020-12-29

Screenshot 2020-12-29 203423.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ikonsgr74 - 2020-12-30

Thanks for the "insight" Evan!
So it seems that modification of various COW BASIC routines to include DMA utilization (whenever supported by the selected PIC) would be rather easy and simple after all!
Can you make a rough estimate on performance increase when using DMA?
For example,how faster an "on interrupt" HW UART byte read or a readtable byte read will be, using for example a 18FXXQ43, compared to current Routines usde for 18FXXQ10 (e.g. like the ones i post earlier)?

Last edit: ikonsgr74 2020-12-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2020-12-30
  
  I would have to test.
  
  Do you have any MPLAB-X code as a baseline?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Chris Roper - 2020-12-30
  
  In my experience using DMA, on a PIC32 device not the 18f26Q43, trying to estimate a performance improvement is a mute point as DMA is effectively hardware multi tasking,
  On the PIC32 at least when you executed the DMA transfer it was fire and forget, working in the background whilst the user program continued at full speed in the foreground. It was several years ago and my memory is not what it was so I don't recall any of the c++ code that I used but it was fast.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - ikonsgr74 - 2020-12-30
    
    Depending on system arbitration used, this is true for 18FXXQ43 family too. Quoted from datasheet:
    Depending on the priority of the DMA with respect to CPU execution (Refer to section “Memory Access Scheme” in
    the “PIC18 CPU” chapter for more information), the DMA Controller can move data through two methods:
    • Stalling the CPU execution until it has completed its transfers (DMA has higher priority over the CPU in this
    mode of operation).
    • Utilizing unused CPU cycles for DMA transfers (CPU has higher priority over the DMA in this mode of
    operation). Unused CPU cycles are referred to as bubbles, which are instruction cycles available for use by the
    DMA to perform read and write operations. In this way, the effective bandwidth for handling data is increased; at
    the same time, DMA operations can proceed without causing a processor stall.
    
    If you use the 2nd method, it practically executes DMA transfer without any speed penalty for CPU
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ikonsgr74 - 2020-12-30

I never developed any code using MPLAB, only Cow Basic :-)
But, i have installed MPLAB X IDE 5.20 and mainly use it for MCC code configurator (in order to configure the various CLC's needed for my project), but i see that the new 18FXXQ43 are not supported, maybe i need to install a newer version.

Last edit: ikonsgr74 2020-12-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ikonsgr74 - 2021-05-10

Any news about DMA support?
I receive a couple of 18F47Q43's and have a pickit4 programmer too, so i'm really looking forward to test a... "DMA optimized" code! :-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anobium - 2021-05-10

The post https://sourceforge.net/p/gcbasic/discussion/579125/thread/b0baec8294/#6acb shows the method. Unless someone writes a DMA editor (like PPSTool or PICINFO tool) then you will have to hack through the datasheet to setup the registers.

and, Q43 is supported by PICKit2 & 3 .... using PICKitPlus. :-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ikonsgr74 - 2021-05-10

Ok then, maybe you can write a "how to" code guide (based on 18F4XQ43 as this is the 1st PIC family supporting DMA, and most probable all that follows, like Q83/Q84 and future PIC's, will use same methods too) , with specific DMA examples like:
- Read from HWuart and place byte to a single variable/array variable
- Read a byte from a a single variable/array variable/table and write to HWuart/PIC port
Then,i will try to incorporate these codes to my GCB code, and make tests to see if they work right, and what impact will have in performance.

Last edit: ikonsgr74 2021-05-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ikonsgr74 - 2021-05-10

I was wondering, is there a way to access a table element directly, without using "readtable" command? Reading bytes from byte tables and place them to a PIC's PORT or a variable, is done all the time in my project, but in order to use DMA for that, i need a way to read specific element without calling readtable, as this command does the transaction directly, but without using DMA....

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- William Roth - 2021-05-11
  
  To answer your question, Yes and No
  
  When a table is defined with TABLE the compiler looks for a related Readtable command
  IF there is none , the table is never written to memory. So this is a "NO"
  
  However when a readtable is executed, even if only to initialize the table in memory then the table will then be written to program memory.
  
  But where in memory is the question.
  
  There is no way to tell the compiler where in memory to put the table;
  
  The compiler decides based upon how much memory the rest of the code uses. I cannot tell by looking at the ASM where in memory the table begins. Someone else might.
  
  However if you know the data you are looking for you can look at the hex and see where the first byte of the table is located. But if your code changes, this this memory address location will change as well.
  
  But for the sake of argument, Let's say the code never changes and the table never changes. You could then read the data directly via the TBLRD* command as described in the Chip's Datasheet. See the section on the Nonvolatile Memory (NVM) module
  
  Not worth the trouble IMO
  
  Last edit: William Roth 2021-05-11
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ikonsgr74 - 2021-05-10

Last edit: ikonsgr74 2021-05-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

William Roth - 2021-05-11

My name was mentioned somewhere in regards to adding DMA support to GCB for chips that support it. To be clear, I have no plans now or in the future to do so.

It would be a rather huge, time consuming effort that in the end would likely only be utilized by a handful of advanced users.

I am not saying that it will not be done eventually, just that I will not be the one doing it.

As far as an estimated time for adding DMA support, Anobium or Hugh can answer that better than I can. However, I would not think it would be any sooner than 6 months if not a year or more.

Bill

Last edit: William Roth 2021-05-11

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anobium - 2021-05-11
  
  My error attributing you to writing some DMA stuff. I dont know what I was thinking.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mkstevo - 2021-05-11

Not fully understanding the concept, but...

If the "Table" was written to storage area flash, could the location in the PIC be specified and therefore be a known value? The storage area flash looks to be limited to 128 words, which might restrain the size of any table.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anobium - 2021-05-11

We can look into this soon, but, looks rather simple to use, but, this would require a fundamental change change to the serial write (in this example).

But, there is nothing to stop you from using the code shown in the DMA posts (above) in the latest release candidate.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anobium - 2021-05-11

I read AppNote TB3164 today. This AppNote lays out the basics in a total vacuum of other practices used with an overall solution.

To use DMA requires a total architectural approach/impact analysis. Example. Move data from a table to serial looks easy. But, what is the data to be moved to the serial and the format (byte or word data) ? If byte data then it may work, if word ...then, the table data in the Progmem would need to formatted (laid out) so the DMA is usable.

Then, assuming the data is byte data then moving the data out the serial would still be one byte at a time. So, what is the time advantage of RAM buffer read (loaded by the DMA activity) verses the existing Table read ? It is really a huge benefit?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Produce "DMA optimized" with new 18FXXQ43 PICs

Forums

Help

Produce "DMA optimized" with new 18FXXQ43 PICs

Produce "DMA optimized" with new 18FXXQ43 PICs

Forums

Help

Produce "DMA optimized" with new 18FXXQ43 PICs document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Produce "DMA optimized" with new 18FXXQ43 PICs