From: Bernard L. <le...@bo...> - 2003-03-04 08:03:38
|
[ CC'ed back to the list :) ] Hi Chandan, Those numbers are interesting, it looks like the audio driver is _really_ wasting that extra cpu! One experiment we could try is to just get the main cpu to do all the work. An extension of that would be to then put the decoding routines in fast ram (since we wouldnt be using it as the buffer). I'm not sure how big a win this would be but it is how the real firmware works... The "fast ram" is onboard ram so I guess its similar speed to the cache?? cheers, bern. On Tue, 2003-03-04 at 06:02, Chandan Kudige [home] wrote: > > Be careful about changing the optimisation level when compiling with > > gcc. -O3 should be faster than -O2, but in a lot of cases it isn't > > (infact -O1 or -Os are sometimes faster than either of them !!). I > > don't think it's going to make it 25% faster, but you should give > > each one a try if possible. > I will give this a try tomorrow. > > > > > Did you try madplay with '-o/dev/null' ?? > > That would give a clue about how much time is spent in madplay and > > how much in the other stuff (e.g. audio output driver etc). > > > > Here are the numbers with /dev/null > For the 128 second clip: > Without patch: 163.00 sec > With patch: 158.47 sec > |
From: <arm...@ya...> - 2003-03-04 08:44:13
|
Bernard Leach <le...@bo...> wrote: > > An extension of that would be to then put the decoding routines > in fast ram (since we wouldnt be using it as the buffer). > I'm not sure how big a win this would be but it is how the real > firmware works... > My guess is that this will be a _big_ win. The internal SRAM is likely to be 32bit wide, with low (probably zero) wait-states when accessed from the ARM. The system SDRAM on the other hand (Samsung part No. K4S561632 ??) is 16bit wide, has the overhead of refreshes etc and is probably clocked slower as well (anyone with an open ipod and a 'scope handy, please put a probe on pin 38 of the SDRAM chip and let us know for sure... :-). The most important code in madplay to speed up is the imdct36 assembler function and the contents of synth.c. The later may take a bit more work, but the former is fully PIC so a quick and dirty test would be to hack madplay to memcpy the whole of III_imdct_l into fast SRAM at program startup and then call it from there via a function pointer (its only called from 2 places in layer3.c). How big is the internal SRAM by the way ??? Andre -- __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com |
From: Chandan K. [home] <ch...@to...> - 2003-03-04 15:05:15
|
In the current implementation of cop code, everytime an application tries to write more than one frame of data to the /dev/dsp, it will be put to sleep as long as the kernel keeps feeding this data to the cop (I think the shared SRAM is 92K only) One way to handle this would be to implement a kernel thread which would take the user data and keeps feeding to the cop. This thread would be sleep most of the time, and the user program can continue its execution in parallel. I still havent tried this out. Intel's mp3player writes one frame at a time and hence does not incur this waiting. But I can verify by playing it to /dev/null just to be sure. So if we can do away with the cop and write to the audio in the main processor we can still follow this approach so that we do not tie up the user process. We can still move the decoding routine into the SRAM (I would assume that 92K should be more than enough!). I can give this a try over the weekend. -Chandan On Tue, 2003-03-04 at 03:44, Andre wrote: > Bernard Leach <le...@bo...> wrote: > > > > An extension of that would be to then put the decoding routines > > in fast ram (since we wouldnt be using it as the buffer). > > I'm not sure how big a win this would be but it is how the real > > firmware works... > > > > My guess is that this will be a _big_ win. The internal SRAM is > likely to be 32bit wide, with low (probably zero) wait-states when > accessed from the ARM. > > The system SDRAM on the other hand (Samsung part No. K4S561632 ??) is > 16bit wide, has the overhead of refreshes etc and is probably clocked > slower as well (anyone with an open ipod and a 'scope handy, please > put a probe on pin 38 of the SDRAM chip and let us know for sure... > :-). > > The most important code in madplay to speed up is the imdct36 > assembler function and the contents of synth.c. The later may take a > bit more work, but the former is fully PIC so a quick and dirty test > would be to hack madplay to memcpy the whole of III_imdct_l into fast > SRAM at program startup and then call it from there via a function > pointer (its only called from 2 places in layer3.c). > > How big is the internal SRAM by the way ??? > > Andre > -- > > > > > > __________________________________________________ > Do You Yahoo!? > Everything you'll ever need on one web page > from News and Sport to Email and Music Charts > http://uk.my.yahoo.com > > > ------------------------------------------------------- > This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger > for complex code. Debugging C/C++ programs can leave you feeling lost and > disoriented. TotalView can help you find your way. Available on major UNIX > and Linux platforms. Try it free. www.etnus.com > _______________________________________________ > iPodlinux-devel mailing list > iPo...@li... > https://lists.sourceforge.net/lists/listinfo/ipodlinux-devel > |
From: Bernard L. <le...@bo...> - 2003-03-17 15:28:24
|
Hi Chandan, Not sure if you got to do any experimentation, but I've just tried what you were suggesting. First up I reverted to just using a single cpu to drive the audio. The results were (using the mp3example) a noticible slowdown. The second experiment was to put some of the ipp code into the fast ram and try that. This turned out to be a little harder than first hoped (since I don't have the source code to that). Anyhow with those changes the performance was back up to near real-time. It is definitely less than the original cpu-cop version but this version is only using one processor! My current feeling is that using this cpu-only audio driver is the right way to go (and then to use the cop as a decoding unit..?). Or at least using the internal ram for buffering is a big waste. I suppose a cpu-cop driver using sdram and cache flushing would be one more experiment there... I'll see about getting the audio driver into CVS so it can be compiled as either. The mp3example program on the other hand is a bit of a nightmare! To get this to work I compiled up a dummy program that called the functions to relocate and compiled it to 0x40000000. This program is then copied to the right location and then stubs are used to call to the right spot. Does anyone have any suggestions as to how this could be simplified? cheers, bern. On Tue, 2003-03-04 at 16:04, Chandan Kudige [home] wrote: > In the current implementation of cop code, everytime an application > tries to write more than one frame of data to the /dev/dsp, it will be > put to sleep as long as the kernel keeps feeding this data to the cop > (I think the shared SRAM is 92K only) > > One way to handle this would be to implement a kernel thread which would > take the user data and keeps feeding to the cop. This thread would be > sleep most of the time, and the user program can continue its execution > in parallel. I still havent tried this out. > > Intel's mp3player writes one frame at a time and hence does not incur > this waiting. But I can verify by playing it to /dev/null just to be > sure. > > So if we can do away with the cop and write to the audio in the main > processor we can still follow this approach so that we do not tie up the > user process. We can still move the decoding routine into the SRAM (I > would assume that 92K should be more than enough!). > > I can give this a try over the weekend. > > -Chandan > > On Tue, 2003-03-04 at 03:44, Andre wrote: > > Bernard Leach <le...@bo...> wrote: > > > > > > An extension of that would be to then put the decoding routines > > > in fast ram (since we wouldnt be using it as the buffer). > > > I'm not sure how big a win this would be but it is how the real > > > firmware works... > > > > > > > My guess is that this will be a _big_ win. The internal SRAM is > > likely to be 32bit wide, with low (probably zero) wait-states when > > accessed from the ARM. > > > > The system SDRAM on the other hand (Samsung part No. K4S561632 ??) is > > 16bit wide, has the overhead of refreshes etc and is probably clocked > > slower as well (anyone with an open ipod and a 'scope handy, please > > put a probe on pin 38 of the SDRAM chip and let us know for sure... > > :-). > > > > The most important code in madplay to speed up is the imdct36 > > assembler function and the contents of synth.c. The later may take a > > bit more work, but the former is fully PIC so a quick and dirty test > > would be to hack madplay to memcpy the whole of III_imdct_l into fast > > SRAM at program startup and then call it from there via a function > > pointer (its only called from 2 places in layer3.c). > > > > How big is the internal SRAM by the way ??? > > > > Andre > > -- > > > > > > > > > > > > __________________________________________________ > > Do You Yahoo!? > > Everything you'll ever need on one web page > > from News and Sport to Email and Music Charts > > http://uk.my.yahoo.com > > > > > > ------------------------------------------------------- > > This SF.net email is sponsored by: Etnus, makers of TotalView, The debugger > > for complex code. Debugging C/C++ programs can leave you feeling lost and > > disoriented. TotalView can help you find your way. Available on major UNIX > > and Linux platforms. Try it free. www.etnus.com > > _______________________________________________ > > iPodlinux-devel mailing list > > iPo...@li... > > https://lists.sourceforge.net/lists/listinfo/ipodlinux-devel > > > > |
From: <arm...@ya...> - 2003-03-18 06:17:19
|
Hi Bernard, If the ipp code is structured in a way that would allow it, the best split between cpus for mp3 decoding might be to have the co-processor do subband synthesis and audio output and have the main cpu do everything else. (Subband synthesis is the final step in mp3 decoding, to transform uncompressed frequency domain data to uncompressed time domain data. It should be fairly well de-coupled from the rest of the decoder). This is going back towards a system where the cop runs only the audio output driver, except that now the audio output driver contains the final stage of the mp3 decoding (which just happens to require approx 50 % of the cpu bandwidth). If fast SRAM is big enough to hold the synthesis code + synthesis working buffer + some buffered audio data on its way to the DAC, then you should get almost double the throughput of a single cpu. Andre -- __________________________________________________ Do You Yahoo!? Everything you'll ever need on one web page from News and Sport to Email and Music Charts http://uk.my.yahoo.com |