From: Stuart M. <stu...@st...> - 2002-11-18 11:53:11
Attachments:
2.4.20-tlb
|
Folks Attached is my second attempt at an improved TLB miss handler. Unfortuntaly the previous patch was corrupted, so I've double checked this one, and it applies cleanly to 2.4 brach HEAD (2.4.20-pre10) and almost cleanly to 2.4.18. This should build for both SH3 and SH4, although it has only been tested on SH4. As always, all comments welcome Stuart |
From: Paul M. <pau...@ti...> - 2002-11-18 18:05:46
|
Stuart, Looks good. Do you have any performance numbers showing how this compares to the old handler? This looks like a good candidate to merge into the restructure branch as well. Can you roll a patch against that? (There's also some ST40 stuff in cache-sh4.c that wants your attention). On Mon, 2002-11-18 at 06:51, Stuart Menefy wrote: > Attached is my second attempt at an improved TLB miss handler. >=20 > Unfortuntaly the previous patch was corrupted, so I've double checked thi= s > one, and it applies cleanly to 2.4 brach HEAD (2.4.20-pre10) and almost > cleanly to 2.4.18. >=20 > This should build for both SH3 and SH4, although it has only been tested > on SH4. >=20 Regards, --=20 Paul Mundt pau...@ti... TimeSys Corporation |
From: Stuart M. <stu...@st...> - 2002-11-19 21:01:57
|
Hi Paul On Mon, 18 Nov 2002 13:09:06 -0500 pau...@ti... wrote: > Stuart, > > Looks good. Do you have any performance numbers showing how this > compares to the old handler? I did some performance testing while writing the code, but thought I'd better go back and get some figures from unmodified code for comparison. This is all on a 200MHz ST40GX1 + 112MB RAM 1. Build util-linux from clean: 2.4.17 14m37 2.4.18 14m08 2.4.18+tlb 8m55 All files accessed over NFS 2. Quake (timedemo demo1) 2.4.18 17.1 FPS 2.4.18+tlb 19.1 FPS 3. LMBench LMBench figures are less impressive. Most are unchanged excpt those relating to local communication, context switching (especially for processes with a big footprint) and memory latency (second line is with the TLB patch): Context switching - times in microseconds - smaller is better ------------------------------------------------------------- Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw --------- ------------- ----- ------ ------ ------ ------ ------- ------- iptest102 Linux 2.4.18_ 18.0 108.6 324.9 152.2 494.8 182.8 486.3 iptest102 Linux 2.4.18+ 16.0 106.9 314.0 109.4 354.4 112.0 348.9 *Local* Communication latencies in microseconds - smaller is better ------------------------------------------------------------------- Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn --------- ------------- ----- ----- ---- ----- ----- ----- ----- ---- iptest102 Linux 2.4.18_ 18.0 60.5 128. 241.7 539.4 371.0 722.8 1116 iptest102 Linux 2.4.18+ 16.0 54.2 126. 261.1 548.9 390.6 753.4 1144 *Local* Communication bandwidths in MB/s - bigger is better ----------------------------------------------------------- Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- ----- iptest102 Linux 2.4.18_ 21.0 23.7 39.2 28.8 120.1 48.1 45.5 115. 93.9 iptest102 Linux 2.4.18+ 23.5 24.0 39.9 28.1 136.0 53.0 49.6 136. 100.4 Memory latencies in nanoseconds - smaller is better (WARNING - may not be correct, check graphs) --------------------------------------------------- Host OS Mhz L1 $ L2 $ Main mem Guesses --------- ------------- ---- ----- ------ -------- ------- iptest102 Linux 2.4.18_ 200 10.0 182.9 295.7 iptest102 Linux 2.4.18+ 200 10.0 183.0 189.4 No L2 cache? The most intersting is actually the graph which LMBench produces showing memory latency. Previously this had a massive shoulder as soon as the data set became larger then 256K, ie as soon as as the data set became larger than the TLB size. Worst case we saw 200nS access times reduced to 1000uS Thats now down to a more reasonable 235nS. > This looks like a good candidate to merge into the restructure branch as > well. Can you roll a patch against that? (There's also some ST40 stuff > in cache-sh4.c that wants your attention). I'll give it a try, however there are a raft of other patches I'll need to roll forwards to 2.5 first for basic board support. Most should apply pretty cleanly, so it shouldn't take too long, and its something I've been putting off doing so this will be a good oportunity. Stuart > On Mon, 2002-11-18 at 06:51, Stuart Menefy wrote: > > Attached is my second attempt at an improved TLB miss handler. > > > > Unfortuntaly the previous patch was corrupted, so I've double checked this > > one, and it applies cleanly to 2.4 brach HEAD (2.4.20-pre10) and almost > > cleanly to 2.4.18. > > > > This should build for both SH3 and SH4, although it has only been tested > > on SH4. > > > > Regards, > > -- > Paul Mundt > pau...@ti... > TimeSys Corporation > |
From: Paul M. <pau...@ti...> - 2002-11-18 18:39:33
|
Stuart, On Mon, 2002-11-18 at 06:51, Stuart Menefy wrote: > This should build for both SH3 and SH4, although it has only been tested > on SH4. >=20 This will only build for the ST40 stuff at the moment, due to PCC_MASK being set on non-ST40 and then #error'ed on further down. Since its something that needs testing, lets just leave it unset for now and then turn it back on later once things work and the #error gets yanked. Thoughts? Regards, --=20 Paul Mundt pau...@ti... TimeSys Corporation |
From: Stuart M. <stu...@st...> - 2002-11-18 19:28:12
|
Leaving it unset for now should be fine. Its only needed if you have a system with PCMCIA implemented using the HD64465 support chip, because of the gross way the PCMCIA attributes are handled. If we think this is important, then a then a better solution would be to shift the entire pte up up or down a few bits, allowing the PTEA bits to be stored contiguously. Stuart On Mon, 18 Nov 2002 13:42:54 -0500 pau...@ti... wrote: > Stuart, > > On Mon, 2002-11-18 at 06:51, Stuart Menefy wrote: > > This should build for both SH3 and SH4, although it has only been tested > > on SH4. > > > This will only build for the ST40 stuff at the moment, due to PCC_MASK > being set on non-ST40 and then #error'ed on further down. > > Since its something that needs testing, lets just leave it unset for now > and then turn it back on later once things work and the #error gets > yanked. > > Thoughts? > > Regards, > > -- > Paul Mundt > pau...@ti... > TimeSys Corporation > |
From: Masahiro A. <m-...@aa...> - 2003-01-27 11:24:36
|
Menefy-san and all, sorry for relatively long post, On Mon, 18 Nov 2002 11:51:32 +0000 Stuart Menefy <stu...@st...> wrote: > Folks > > Attached is my second attempt at an improved TLB miss handler. > > Unfortuntaly the previous patch was corrupted, so I've double checked this > one, and it applies cleanly to 2.4 brach HEAD (2.4.20-pre10) and almost > cleanly to 2.4.18. > > This should build for both SH3 and SH4, although it has only been tested > on SH4. > > As always, all comments welcome > > Stuart I've tried your patch on "linux-2_4_18" tagged CVS source, and ran on SolutionEngine 7750S, but kernel oops at startup. I really want to see the speed-up of TLB-Miss handling, due to the likelihood of fact that it may be the source of large latency of RTLinux patch. I would be more than happy if you can spend a little time to read my report below, and give me some thought you might have. Only if you have such time, of course. 1. I've taken stock 2.4.18 kernel and dropped "linux-2_4_18" tagged CVS source in it. Then I've simply applied "2.4.20-tlb" file as patch. As Mundt-san pointed out, it #error in entry.S. I've simply put "!" before #error. 2. I've done make xconfig dep zImage, and put zImage onto CompactFlash. Kernel starts booting, but "Oops" in mem_init. I've put some debug printk and see some values. The output is like this: --- Memory: 63288k/65536k available (1044k kernel code, 2248k reserved, 32k data, 40k init) remap_area_pages begin -->remap_area_pages(address = 0xc0000000, phys_addr = 0x00000000, size = 0x00004000, flags = 0x00000008) -->remap_area_pmd(pmd = 0x8c11dc28, address = 0xc0000000, size = 0x00004000, phys_addr = 0x00000000, flags = 0x00000008) in pte_alloc() pmd_none(*pmd) is true new = 0x8c13f000 pmd_val(*pmd) = 0x8c13f000 pmd_page(*(pmd)) = 0x0c13f000 __pte_offset(address) = 0x00000000 -->remap_area_pte(pte = 0x0c13f000, address = 0x00000000, size = 0x00004000, phys_addr = 0x00000000, flags = 0x00000008) Unable to handle kernel paging request at virtual address 0c13f000 pc = 8c00ddde *pde = 8c001000 Oops: 0000 PC : 8c00ddde SP : 8c111f94 SR : 40008100 TEA : 0c13f000 Not tainted R0 : 00000026 R1 : 00000000 R2 : 40008100 R3 : 00400000 R4 : 8c1076c4 R5 : 00000001 R6 : 00000001 R7 : 8c011920 R8 : 00001000 R9 : 00004000 R10 : 00000000 R11 : 8c011b40 R12 : 00000000 R13 : 0c13f000 R14 : 00000000 MACH: 0004deb8 MACL: 00000000 GBR : b7e3fb6c PR : 8c00ddbe Kernel panic: Attempted to kill the idle task! In idle task - not syncing --- "pte" to remap_area_pte seems wrong. 3.I've put same printk to the flesh 2.4.18cvs source and the output was: --- in pte_alloc() pmd_none(*pmd) is true new = 0x8c13f000 pmd_val(*pmd) = 0x0c13f564 pmd_page(*(pmd)) = 0x8c13f000 __pte_offset(address) = 0x00000000 -->remap_area_pte(pte = 0x8c13f000, --- "pte" looks fine here. 4.I thought "pmd_populate" is generating invalid value, so changed it from: extern inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, pte_t *pte) { set_pmd(pmd, __pmd((unsigned long)pte)); } to: #define pmd_populate(mm, pmd, pte) \ set_pmd(pmd, __pmd(__pa(pte))) Then, --- in pte_alloc() pmd_none(*pmd) is true new = 0x8c13f000 pmd_val(pmd) = 0x0c13f000 pmd_page(*(pmd)) = 0x8c13f000 __pte_offset(address) = 0x00000000 -->remap_area_pte(pte = 0x8c13f000, --- kernel passes mem_init, but just around when init starts, system resets. 5.I reversed my thinking. Maybe pmd_populate is fine, but pmd_page is to blame. So I changed it from: #define pmd_page(pmd) ((unsigned long) __va(pmd_val(pmd))) to: #define pmd_page(pmd) ((unsigned long) pmd_val(pmd)) kernel passes mem_init again, but just around when init starts, system freezes. Thank you for reading till this point. I suspect that some changes are missing from the patch, or the patch works only for 2.4.20-pre10. I appreciate any input on this matter. ================================= Masahiro ABE, A&D Co., Ltd. Japan |
From: Masahiro A. <m-...@aa...> - 2003-01-27 12:29:19
|
On Mon, 27 Jan 2003 20:24:28 +0900 Masahiro Abe <m-...@aa...> wrote: > Thank you for reading till this point. > I suspect that some changes are missing from the patch, or the patch > works only for 2.4.20-pre10. I tried with patch with 2.4.20-pre11 and got the same result. --- LILO boot: first-image Loading linux............done. Uncompressing Linux... Ok, booting the kernel. Linux version 2.4.20-pre11-sh (ro...@ad...) (gcc version 3.0.3) #2 Mon Jan 27 21:20:31 JST 2003 On node 0 totalpages: 16384 zone(0): 16384 pages. zone(1): 0 pages. zone(2): 0 pages. Kernel command line: ro root=301 BOOT_FILE=/boot/zImage mem=64M sh_mv=SolutionEngine ide1=noprobe console=ttySC1,115200 ide_setup: ide1=noprobe Setting GDB trap vector to 0x80000100 SH RTC: invalid value, resetting to 1 Jan 2000 CPU clock: 200.01MHz Bus clock: 66.67MHz Module clock: 33.33MHz Interval = 83340 Calibrating delay loop... 199.47 BogoMIPS Memory: 63516k/65536k available (1069k kernel code, 2020k reserved, 33k data, 40k init) Unable to handle kernel paging request at virtual address 0c146000 pc = 8c00fd98 *pde = 8c001000 Oops: 0000 PC : 8c00fd98 SP : 8c117f9c SR : 40008100 TEA : 0c146000 Not tainted R0 : 0c146000 R1 : 00400000 R2 : 003fffff R3 : 00001000 R4 : 8c146000 R5 : 8c146000 R6 : ba2e8ba3 R7 : 8c013be0 R8 : 00000000 R9 : 00000000 R10 : 00004000 R11 : 00000000 R12 : 0c146000 R13 : 00004000 R14 : c0000000 MACH: 0004deb8 MACL: 00000146 GBR : ffffffff PR : 8c00fd5a Kernel panic: Attempted to kill the idle task! In idle task - not syncing --- TEA is a little different from 2.4.18 (was 0c13f000). ================================= Masahiro ABE, A&D Co., Ltd. Japan |