From: Jerome G. <gl...@fr...> - 2009-04-07 20:34:48
|
Hi Thomas, I think we should not ttm_bo_unreserve the bo in ttm_bo_move_accel_cleanup i am seeing double unreserve which likely lead to kernel list corruption and i think it's due to that one (i am checking through printk but the log is enormous and my script is not yet done with parsing it) I checked code path in via using ttm_bo_move_accel_cleanup and none seems to reserve the buffer before calling ttm_bo_move_accel_cleanup. (Btw my tree is in my fdo drm-next repo drm-next-radeon branch i think they are few more fixes you might have miss but i will do a diff latter and push to mesa/drm repo) Cheers, Jerome |
From: Thomas H. <the...@vm...> - 2009-04-08 14:48:01
|
Jerome Glisse wrote: > Hi Thomas, > > I think we should not ttm_bo_unreserve the bo in > ttm_bo_move_accel_cleanup i am seeing double unreserve > which likely lead to kernel list corruption and i > think it's due to that one (i am checking through > printk but the log is enormous and my script is not > yet done with parsing it) > > I checked code path in via using ttm_bo_move_accel_cleanup > and none seems to reserve the buffer before calling > ttm_bo_move_accel_cleanup. > > Jerome, All buffers that are touched by the move code need to be reserved. What happens in the above case is that the buffer is copied in its reserved state, and thus there will be an unreserve for each copy. Please make sure, however, that you got all of the buffer_object_transfer fixes from the newttm branch, in particular the one where we clear the fbo->swap list head. /Thomas > (Btw my tree is in my fdo drm-next repo drm-next-radeon > branch i think they are few more fixes you might have > miss but i will do a diff latter and push to mesa/drm > repo) > > Cheers, > Jerome > > |
From: Jerome G. <gl...@fr...> - 2009-04-08 17:53:22
|
On Wed, 2009-04-08 at 16:47 +0200, Thomas Hellstrom wrote: > Jerome Glisse wrote: > > Hi Thomas, > > > > I think we should not ttm_bo_unreserve the bo in > > ttm_bo_move_accel_cleanup i am seeing double unreserve > > which likely lead to kernel list corruption and i > > think it's due to that one (i am checking through > > printk but the log is enormous and my script is not > > yet done with parsing it) > > > > I checked code path in via using ttm_bo_move_accel_cleanup > > and none seems to reserve the buffer before calling > > ttm_bo_move_accel_cleanup. > > > > > Jerome, > > All buffers that are touched by the move code need to be reserved. > What happens in the above case is that the buffer is copied in its > reserved state, > and thus there will be an unreserve for each copy. > > Please make sure, however, that you got all of the > buffer_object_transfer fixes from the newttm branch, > in particular the one where we clear the fbo->swap list head. > > /Thomas There is a bug in cleanup: if (!(man->flags & TTM_MEMTYPE_FLAG_FIXED)) ghost_obj->ttm = NULL; else ghost_obj = NULL; Used to be if (!(man->flags & TTM_MEMTYPE_FLAG_FIXED)) ghost_obj->ttm = NULL; else bo->ttm = NULL; And i think it's the correct one. Note that current one lead to oups because then you unreserve a NULL ptr. Cheers, Jerome |
From: Thomas H. <th...@sh...> - 2009-04-10 08:14:19
|
Jerome Glisse wrote: > On Wed, 2009-04-08 at 16:47 +0200, Thomas Hellstrom wrote: > >> Jerome Glisse wrote: >> >>> Hi Thomas, >>> >>> I think we should not ttm_bo_unreserve the bo in >>> ttm_bo_move_accel_cleanup i am seeing double unreserve >>> which likely lead to kernel list corruption and i >>> think it's due to that one (i am checking through >>> printk but the log is enormous and my script is not >>> yet done with parsing it) >>> >>> I checked code path in via using ttm_bo_move_accel_cleanup >>> and none seems to reserve the buffer before calling >>> ttm_bo_move_accel_cleanup. >>> >>> >>> >> Jerome, >> >> All buffers that are touched by the move code need to be reserved. >> What happens in the above case is that the buffer is copied in its >> reserved state, >> and thus there will be an unreserve for each copy. >> >> Please make sure, however, that you got all of the >> buffer_object_transfer fixes from the newttm branch, >> in particular the one where we clear the fbo->swap list head. >> >> /Thomas >> > > There is a bug in cleanup: > if (!(man->flags & TTM_MEMTYPE_FLAG_FIXED)) > ghost_obj->ttm = NULL; > else > ghost_obj = NULL; > Used to be > if (!(man->flags & TTM_MEMTYPE_FLAG_FIXED)) > ghost_obj->ttm = NULL; > else > bo->ttm = NULL; > > And i think it's the correct one. Note that current one lead > to oups because then you unreserve a NULL ptr. > > Jerome, You're right. Does that fix your list corruption? /Thomas > Cheers, > Jerome > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > High Quality Requirements in a Collaborative Environment. > Download a free trial of Rational Requirements Composer Now! > http://p.sf.net/sfu/www-ibm-com > -- > _______________________________________________ > Dri-devel mailing list > Dri...@li... > https://lists.sourceforge.net/lists/listinfo/dri-devel > |
From: Jerome G. <gl...@fr...> - 2009-04-10 14:40:20
Attachments:
ttm.patch
|
On Fri, 2009-04-10 at 10:14 +0200, Thomas Hellström wrote: > Jerome Glisse wrote: > > On Wed, 2009-04-08 at 16:47 +0200, Thomas Hellstrom wrote: > > > >> Jerome Glisse wrote: > >> > >>> Hi Thomas, > >>> > >>> I think we should not ttm_bo_unreserve the bo in > >>> ttm_bo_move_accel_cleanup i am seeing double unreserve > >>> which likely lead to kernel list corruption and i > >>> think it's due to that one (i am checking through > >>> printk but the log is enormous and my script is not > >>> yet done with parsing it) > >>> > >>> I checked code path in via using ttm_bo_move_accel_cleanup > >>> and none seems to reserve the buffer before calling > >>> ttm_bo_move_accel_cleanup. > >>> > >>> > >>> > >> Jerome, > >> > >> All buffers that are touched by the move code need to be reserved. > >> What happens in the above case is that the buffer is copied in its > >> reserved state, > >> and thus there will be an unreserve for each copy. > >> > >> Please make sure, however, that you got all of the > >> buffer_object_transfer fixes from the newttm branch, > >> in particular the one where we clear the fbo->swap list head. > >> > >> /Thomas > >> > > > > There is a bug in cleanup: > > if (!(man->flags & TTM_MEMTYPE_FLAG_FIXED)) > > ghost_obj->ttm = NULL; > > else > > ghost_obj = NULL; > > Used to be > > if (!(man->flags & TTM_MEMTYPE_FLAG_FIXED)) > > ghost_obj->ttm = NULL; > > else > > bo->ttm = NULL; > > > > And i think it's the correct one. Note that current one lead > > to oups because then you unreserve a NULL ptr. > > > > > Jerome, > You're right. Does that fix your list corruption? > > /Thomas I have no more list corruption but my tree have few other fixes it would be nice if you could check if my not missunderstanding vm code (patch attached). Also i was reviewing the bo move code and i think we should only call ttm_tt_bind if mem_type is TTM_TT, right now you call bind if mem_type != SYSTEM, so far i never saw somethings else than TTM_TT being bind but better be safe than sorry. A related question is what does ttm do when moving a bo from vram to system ? My understanding is that it creates a tt object and bind it but can the driver move callback ever get old_memtype = VRAM & new_memtype=SYSTEM ? Btw i try to isolate all my ttm stuff in my drm-next-ttm branch on fdo: http://cgit.freedesktop.org/~glisse/drm-next/ Cheers, Jerome |
From: Thomas H. <the...@vm...> - 2009-04-12 09:05:18
|
Jerome Glisse wrote: > On Fri, 2009-04-10 at 10:14 +0200, Thomas Hellström wrote: > >> Jerome Glisse wrote: >> >>> On Wed, 2009-04-08 at 16:47 +0200, Thomas Hellstrom wrote: >>> >>> >>>> Jerome Glisse wrote: >>>> >>>> >>>>> Hi Thomas, >>>>> >>>>> I think we should not ttm_bo_unreserve the bo in >>>>> ttm_bo_move_accel_cleanup i am seeing double unreserve >>>>> which likely lead to kernel list corruption and i >>>>> think it's due to that one (i am checking through >>>>> printk but the log is enormous and my script is not >>>>> yet done with parsing it) >>>>> >>>>> I checked code path in via using ttm_bo_move_accel_cleanup >>>>> and none seems to reserve the buffer before calling >>>>> ttm_bo_move_accel_cleanup. >>>>> >>>>> >>>>> >>>>> >>>> Jerome, >>>> >>>> All buffers that are touched by the move code need to be reserved. >>>> What happens in the above case is that the buffer is copied in its >>>> reserved state, >>>> and thus there will be an unreserve for each copy. >>>> >>>> Please make sure, however, that you got all of the >>>> buffer_object_transfer fixes from the newttm branch, >>>> in particular the one where we clear the fbo->swap list head. >>>> >>>> /Thomas >>>> >>>> >>> There is a bug in cleanup: >>> if (!(man->flags & TTM_MEMTYPE_FLAG_FIXED)) >>> ghost_obj->ttm = NULL; >>> else >>> ghost_obj = NULL; >>> Used to be >>> if (!(man->flags & TTM_MEMTYPE_FLAG_FIXED)) >>> ghost_obj->ttm = NULL; >>> else >>> bo->ttm = NULL; >>> >>> And i think it's the correct one. Note that current one lead >>> to oups because then you unreserve a NULL ptr. >>> >>> >>> >> Jerome, >> You're right. Does that fix your list corruption? >> >> /Thomas >> > > Hi, Jerome. > I have no more list corruption but my tree have few other fixes > it would be nice if you could check if my not missunderstanding > vm code (patch attached). > > I'll try to have a lock early next week. > Also i was reviewing the bo move code and i think we should only > call ttm_tt_bind if mem_type is TTM_TT, right now you call bind > if mem_type != SYSTEM, so far i never saw somethings else than > TTM_TT being bind but better be safe than sorry. > Strictly, all mem types except those with TTM_MEMTYPE_FLAG_FIXED are assumed to be dynamic, not only TTM_PL_TT, so the code you're referring to is only called if it's a dynamic memory type, and then we need to bind if it's not SYSTEM placement, so it should be correct. A usage example would be a system with a distinct memory type for a particular aperture range. (Poulsbo for example). > A related question is what does ttm do when moving a bo from > vram to system ? My understanding is that it creates a tt object > and bind it but can the driver move callback ever get > old_memtype = VRAM & new_memtype=SYSTEM ? > Yes. TTM should hit the driver move hook, and if there is non fall back to move_memcpy. Openchrome uses its PCI DMA engines to move from VRAM in this way (via_move_dmablit()). However, Openchrome is probably different than the Radeon's in this respect since the PCI DMA engines are not programmed using the 3D fifo, so DRM has to synchronize the engines using fences. We're currently not using these engines for moves from SYSTEM to VRAM since it's limited to about 133MB/s and also completely saturates the PCI bus, (even if the processor is off-loaded). > Btw i try to isolate all my ttm stuff in my drm-next-ttm branch > on fdo: > http://cgit.freedesktop.org/~glisse/drm-next/ > > > Cheers, > Jerome > Thanks, Thomas |
From: Jerome G. <gl...@fr...> - 2009-04-18 16:54:02
|
Hi Thomas I am getting massive error on x86_64, things like : BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 pmd:321a6067 keep filling the log until very bad things happen. Do you have any idea what might cause that in ttm ? My assumption is that ttm vm code is guilty their. Note that on x86 exact same code seem to run fine. All this with 2.6.29 final. Cheers, Jerome Glisse |
From: Thomas H. <th...@sh...> - 2009-04-18 19:06:25
|
Jerome Glisse wrote: > Hi Thomas > > I am getting massive error on x86_64, things like : > BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 > pmd:321a6067 > keep filling the log until very bad things happen. > Do you have any idea what might cause that in ttm ? > My assumption is that ttm vm code is guilty their. > Note that on x86 exact same code seem to run fine. > All this with 2.6.29 final. > > Cheers, > Jerome Glisse > > Hi, Jerome! The TTM code may well be guilty here. I haven't tested x86-64 for a while, but I can probably give it a try on openChrome next week. /Thomas |
From: Jerome G. <gl...@fr...> - 2009-04-19 12:32:43
|
On Sat, 2009-04-18 at 21:06 +0200, Thomas Hellström wrote: > Jerome Glisse wrote: > > Hi Thomas > > > > I am getting massive error on x86_64, things like : > > BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 > > pmd:321a6067 > > keep filling the log until very bad things happen. > > Do you have any idea what might cause that in ttm ? > > My assumption is that ttm vm code is guilty their. > > Note that on x86 exact same code seem to run fine. > > All this with 2.6.29 final. > > > > Cheers, > > Jerome Glisse > > > > > Hi, Jerome! > > The TTM code may well be guilty here. I haven't tested x86-64 for a > while, but I can probably give it a try on openChrome next week. > > /Thomas The issue seems to be related with vmap (called through ttm_bo_kmap). on big area (1M). I am trying to find out why (btw i tried with 2.6.30-rc2 Linus' tree as of today). Cheers, Jerome |
From: Jerome G. <gl...@fr...> - 2009-04-19 13:14:47
|
On Sat, 2009-04-18 at 21:06 +0200, Thomas Hellström wrote: > Jerome Glisse wrote: > > Hi Thomas > > > > I am getting massive error on x86_64, things like : > > BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 > > pmd:321a6067 > > keep filling the log until very bad things happen. > > Do you have any idea what might cause that in ttm ? > > My assumption is that ttm vm code is guilty their. > > Note that on x86 exact same code seem to run fine. > > All this with 2.6.29 final. > > > > Cheers, > > Jerome Glisse > > > > > Hi, Jerome! > > The TTM code may well be guilty here. I haven't tested x86-64 for a > while, but I can probably give it a try on openChrome next week. > > /Thomas > Definitely seems to be vmap code, i striped down my driver to do device init, vram init, tt init, than i allocate 1M TT object pin it, kmap it, memset it, then if i start doing anythings which will involve memory allocation/deallocation in the system it will trigger corruption into process pte. If i have time today i will try to do a fake drm driver to exhibit this and commit it to my drm tree so you can test yourself. Cheers, Jerome |
From: Jerome G. <gl...@fr...> - 2009-04-19 15:57:37
|
On Sat, 2009-04-18 at 21:06 +0200, Thomas Hellström wrote: > Jerome Glisse wrote: > > Hi Thomas > > > > I am getting massive error on x86_64, things like : > > BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 > > pmd:321a6067 > > keep filling the log until very bad things happen. > > Do you have any idea what might cause that in ttm ? > > My assumption is that ttm vm code is guilty their. > > Note that on x86 exact same code seem to run fine. > > All this with 2.6.29 final. > > > > Cheers, > > Jerome Glisse > > > > > Hi, Jerome! > > The TTM code may well be guilty here. I haven't tested x86-64 for a > while, but I can probably give it a try on openChrome next week. > > /Thomas Okay so i really narrowed it down to asking for WC memory, so my guess is that either my CPU (AMD Athlon(tm) Dual Core Processor 4450e) have PAT issue either TTM PAT/WC code is wrong somehow. I haven't time yet to go deeper with this but i think it worked on an Intel Core2 CPU with x86-64. Also it seems other people doesn't have the issue with WC on x86-64. PS: Sorry for all the noise, the bug didn't always showed up quickly so i had false feeling. I am yet unsure it's fully fixed but so far all my test case which triggered it seems to work fine. Cheers, Jerome |
From: Thomas H. <th...@sh...> - 2009-04-19 16:21:29
|
Jerome Glisse wrote: > On Sat, 2009-04-18 at 21:06 +0200, Thomas Hellström wrote: > >> Jerome Glisse wrote: >> >>> Hi Thomas >>> >>> I am getting massive error on x86_64, things like : >>> BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 >>> pmd:321a6067 >>> keep filling the log until very bad things happen. >>> Do you have any idea what might cause that in ttm ? >>> My assumption is that ttm vm code is guilty their. >>> Note that on x86 exact same code seem to run fine. >>> All this with 2.6.29 final. >>> >>> Cheers, >>> Jerome Glisse >>> >>> >>> >> Hi, Jerome! >> >> The TTM code may well be guilty here. I haven't tested x86-64 for a >> while, but I can probably give it a try on openChrome next week. >> >> /Thomas >> > > Okay so i really narrowed it down to asking for WC memory, so my guess > is that either my CPU (AMD Athlon(tm) Dual Core Processor 4450e) have > PAT issue either TTM PAT/WC code is wrong somehow. I haven't time yet > to go deeper with this but i think it worked on an Intel Core2 CPU > with x86-64. Also it seems other people doesn't have the issue with > WC on x86-64. > > PS: Sorry for all the noise, the bug didn't always showed up quickly so > i had false feeling. I am yet unsure it's fully fixed but so far all my > test case which triggered it seems to work fine. > > Cheers, > Jerome > > Jerome, I think I missed what the fix was? Anyway, for PAT-aware kernels the drm git TTM code is always assuming PAT is enabled and working. For production kernels, the function pgprot_ttm_x86_wc should be replaced by an exported version of x86 pgprot_writecombine. FWIW x86-64 seems to work fine here on a 2.6.27 kernel Athlon-64 single-core on openChrome. /Thomas |
From: Jerome G. <gl...@fr...> - 2009-04-19 20:49:13
|
On Sun, 2009-04-19 at 18:21 +0200, Thomas Hellström wrote: > Jerome Glisse wrote: > > On Sat, 2009-04-18 at 21:06 +0200, Thomas Hellström wrote: > > > >> Jerome Glisse wrote: > >> > >>> Hi Thomas > >>> > >>> I am getting massive error on x86_64, things like : > >>> BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 > >>> pmd:321a6067 > >>> keep filling the log until very bad things happen. > >>> Do you have any idea what might cause that in ttm ? > >>> My assumption is that ttm vm code is guilty their. > >>> Note that on x86 exact same code seem to run fine. > >>> All this with 2.6.29 final. > >>> > >>> Cheers, > >>> Jerome Glisse > >>> > >>> > >>> > >> Hi, Jerome! > >> > >> The TTM code may well be guilty here. I haven't tested x86-64 for a > >> while, but I can probably give it a try on openChrome next week. > >> > >> /Thomas > >> > > > > Okay so i really narrowed it down to asking for WC memory, so my guess > > is that either my CPU (AMD Athlon(tm) Dual Core Processor 4450e) have > > PAT issue either TTM PAT/WC code is wrong somehow. I haven't time yet > > to go deeper with this but i think it worked on an Intel Core2 CPU > > with x86-64. Also it seems other people doesn't have the issue with > > WC on x86-64. > > > > PS: Sorry for all the noise, the bug didn't always showed up quickly so > > i had false feeling. I am yet unsure it's fully fixed but so far all my > > test case which triggered it seems to work fine. > > > > Cheers, > > Jerome > > > > > > Jerome, > I think I missed what the fix was? > > Anyway, for PAT-aware kernels the drm git TTM code is always assuming > PAT is enabled and working. > For production kernels, the function > > pgprot_ttm_x86_wc > > should be replaced by an exported version of x86 pgprot_writecombine. > > FWIW x86-64 seems to work fine here on a 2.6.27 kernel Athlon-64 > single-core on openChrome. > > /Thomas Fix is to not ask for WC memory, i will look at this after i cleanup all my hack to track down this. Cheers, Jerome |
From: Thomas H. <the...@vm...> - 2009-04-20 10:14:28
|
Jerome Glisse wrote: > On Sun, 2009-04-19 at 18:21 +0200, Thomas Hellström wrote: > >> Jerome Glisse wrote: >> >>> On Sat, 2009-04-18 at 21:06 +0200, Thomas Hellström wrote: >>> >>> >>>> Jerome Glisse wrote: >>>> >>>> >>>>> Hi Thomas >>>>> >>>>> I am getting massive error on x86_64, things like : >>>>> BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 >>>>> pmd:321a6067 >>>>> keep filling the log until very bad things happen. >>>>> Do you have any idea what might cause that in ttm ? >>>>> My assumption is that ttm vm code is guilty their. >>>>> Note that on x86 exact same code seem to run fine. >>>>> All this with 2.6.29 final. >>>>> >>>>> Cheers, >>>>> Jerome Glisse >>>>> >>>>> >>>>> >>>>> >>>> Hi, Jerome! >>>> >>>> The TTM code may well be guilty here. I haven't tested x86-64 for a >>>> while, but I can probably give it a try on openChrome next week. >>>> >>>> /Thomas >>>> >>>> >>> Okay so i really narrowed it down to asking for WC memory, so my guess >>> is that either my CPU (AMD Athlon(tm) Dual Core Processor 4450e) have >>> PAT issue either TTM PAT/WC code is wrong somehow. I haven't time yet >>> to go deeper with this but i think it worked on an Intel Core2 CPU >>> with x86-64. Also it seems other people doesn't have the issue with >>> WC on x86-64. >>> >>> PS: Sorry for all the noise, the bug didn't always showed up quickly so >>> i had false feeling. I am yet unsure it's fully fixed but so far all my >>> test case which triggered it seems to work fine. >>> >>> Cheers, >>> Jerome >>> >>> >>> >> Jerome, >> I think I missed what the fix was? >> >> Anyway, for PAT-aware kernels the drm git TTM code is always assuming >> PAT is enabled and working. >> For production kernels, the function >> >> pgprot_ttm_x86_wc >> >> should be replaced by an exported version of x86 pgprot_writecombine. >> >> FWIW x86-64 seems to work fine here on a 2.6.27 kernel Athlon-64 >> single-core on openChrome. >> >> /Thomas >> > > > Fix is to not ask for WC memory, i will look at this after i cleanup > all my hack to track down this. > > Cheers, > Jerome > > OK. If there was an erratum causing PAT not to be enabled on your processor, then definitely that may have cause strange inconsistencies. Thanks, Thomas. |
From: Jerome G. <gl...@fr...> - 2009-04-20 12:03:09
|
On Mon, 2009-04-20 at 12:13 +0200, Thomas Hellstrom wrote: > Jerome Glisse wrote: > > On Sun, 2009-04-19 at 18:21 +0200, Thomas Hellström wrote: > > > >> Jerome Glisse wrote: > >> > >>> On Sat, 2009-04-18 at 21:06 +0200, Thomas Hellström wrote: > >>> > >>> > >>>> Jerome Glisse wrote: > >>>> > >>>> > >>>>> Hi Thomas > >>>>> > >>>>> I am getting massive error on x86_64, things like : > >>>>> BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 > >>>>> pmd:321a6067 > >>>>> keep filling the log until very bad things happen. > >>>>> Do you have any idea what might cause that in ttm ? > >>>>> My assumption is that ttm vm code is guilty their. > >>>>> Note that on x86 exact same code seem to run fine. > >>>>> All this with 2.6.29 final. > >>>>> > >>>>> Cheers, > >>>>> Jerome Glisse > >>>>> > >>>>> > >>>>> > >>>>> > >>>> Hi, Jerome! > >>>> > >>>> The TTM code may well be guilty here. I haven't tested x86-64 for a > >>>> while, but I can probably give it a try on openChrome next week. > >>>> > >>>> /Thomas > >>>> > >>>> > >>> Okay so i really narrowed it down to asking for WC memory, so my guess > >>> is that either my CPU (AMD Athlon(tm) Dual Core Processor 4450e) have > >>> PAT issue either TTM PAT/WC code is wrong somehow. I haven't time yet > >>> to go deeper with this but i think it worked on an Intel Core2 CPU > >>> with x86-64. Also it seems other people doesn't have the issue with > >>> WC on x86-64. > >>> > >>> PS: Sorry for all the noise, the bug didn't always showed up quickly so > >>> i had false feeling. I am yet unsure it's fully fixed but so far all my > >>> test case which triggered it seems to work fine. > >>> > >>> Cheers, > >>> Jerome > >>> > >>> > >>> > >> Jerome, > >> I think I missed what the fix was? > >> > >> Anyway, for PAT-aware kernels the drm git TTM code is always assuming > >> PAT is enabled and working. > >> For production kernels, the function > >> > >> pgprot_ttm_x86_wc > >> > >> should be replaced by an exported version of x86 pgprot_writecombine. > >> > >> FWIW x86-64 seems to work fine here on a 2.6.27 kernel Athlon-64 > >> single-core on openChrome. > >> > >> /Thomas > >> > > > > > > Fix is to not ask for WC memory, i will look at this after i cleanup > > all my hack to track down this. > > > > Cheers, > > Jerome > > > > > OK. > > If there was an erratum causing PAT not to be enabled on your processor, > then definitely that > may have cause strange inconsistencies. > > Thanks, > Thomas. I think ttm_tt caching stuff does follow kernel policies outlined in Documentation/x86/pat.txt well at least from understanding of code i have right now through (call chains being sometimes hard to fully follow). As also have another issue it seems that calling set_memory_(uc|wc) while suspending lockup the cpu or at least doesn't return, is this somethings i should expect ? Cheers, Jerome |
From: Jerome G. <gl...@fr...> - 2009-04-20 12:18:18
|
On Mon, 2009-04-20 at 14:02 +0200, Jerome Glisse wrote: > On Mon, 2009-04-20 at 12:13 +0200, Thomas Hellstrom wrote: > > > > If there was an erratum causing PAT not to be enabled on your processor, > > then definitely that > > may have cause strange inconsistencies. > > > > Thanks, > > Thomas. > > I think ttm_tt caching stuff does follow kernel policies outlined > in Documentation/x86/pat.txt well at least from understanding of > code i have right now through (call chains being sometimes hard > to fully follow). As also have another issue it seems that calling > set_memory_(uc|wc) while suspending lockup the cpu or at least doesn't > return, is this somethings i should expect ? > > Cheers, > Jerome > Speaking about caching i think we should only call ttm_tt_set_placement_caching in mmap function as otherwise it's useless to waste time changing cache policy, even for bo_move_memcpy it should be fine as it's mostly about moving from wc (vram io mapped) to system or system to (wc io mapped) region, maybe move from system to tt might be problematic in terms of cache. Cheers, Jerome |
From: Thomas H. <th...@sh...> - 2009-04-20 12:28:07
|
Jerome Glisse wrote: > On Mon, 2009-04-20 at 12:13 +0200, Thomas Hellstrom wrote: > >> Jerome Glisse wrote: >> >>> On Sun, 2009-04-19 at 18:21 +0200, Thomas Hellström wrote: >>> >>> >>>> Jerome Glisse wrote: >>>> >>>> >>>>> On Sat, 2009-04-18 at 21:06 +0200, Thomas Hellström wrote: >>>>> >>>>> >>>>> >>>>>> Jerome Glisse wrote: >>>>>> >>>>>> >>>>>> >>>>>>> Hi Thomas >>>>>>> >>>>>>> I am getting massive error on x86_64, things like : >>>>>>> BUG: Bad page map in process gnome-session pte:1f1d1d1d01000000 >>>>>>> pmd:321a6067 >>>>>>> keep filling the log until very bad things happen. >>>>>>> Do you have any idea what might cause that in ttm ? >>>>>>> My assumption is that ttm vm code is guilty their. >>>>>>> Note that on x86 exact same code seem to run fine. >>>>>>> All this with 2.6.29 final. >>>>>>> >>>>>>> Cheers, >>>>>>> Jerome Glisse >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> Hi, Jerome! >>>>>> >>>>>> The TTM code may well be guilty here. I haven't tested x86-64 for a >>>>>> while, but I can probably give it a try on openChrome next week. >>>>>> >>>>>> /Thomas >>>>>> >>>>>> >>>>>> >>>>> Okay so i really narrowed it down to asking for WC memory, so my guess >>>>> is that either my CPU (AMD Athlon(tm) Dual Core Processor 4450e) have >>>>> PAT issue either TTM PAT/WC code is wrong somehow. I haven't time yet >>>>> to go deeper with this but i think it worked on an Intel Core2 CPU >>>>> with x86-64. Also it seems other people doesn't have the issue with >>>>> WC on x86-64. >>>>> >>>>> PS: Sorry for all the noise, the bug didn't always showed up quickly so >>>>> i had false feeling. I am yet unsure it's fully fixed but so far all my >>>>> test case which triggered it seems to work fine. >>>>> >>>>> Cheers, >>>>> Jerome >>>>> >>>>> >>>>> >>>>> >>>> Jerome, >>>> I think I missed what the fix was? >>>> >>>> Anyway, for PAT-aware kernels the drm git TTM code is always assuming >>>> PAT is enabled and working. >>>> For production kernels, the function >>>> >>>> pgprot_ttm_x86_wc >>>> >>>> should be replaced by an exported version of x86 pgprot_writecombine. >>>> >>>> FWIW x86-64 seems to work fine here on a 2.6.27 kernel Athlon-64 >>>> single-core on openChrome. >>>> >>>> /Thomas >>>> >>>> >>> Fix is to not ask for WC memory, i will look at this after i cleanup >>> all my hack to track down this. >>> >>> Cheers, >>> Jerome >>> >>> >>> >> OK. >> >> If there was an erratum causing PAT not to be enabled on your processor, >> then definitely that >> may have cause strange inconsistencies. >> >> Thanks, >> Thomas. >> > > I think ttm_tt caching stuff does follow kernel policies outlined > in Documentation/x86/pat.txt well at least from understanding of > code i have right now through (call chains being sometimes hard > to fully follow). As also have another issue it seems that calling > set_memory_(uc|wc) while suspending lockup the cpu or at least doesn't > return, is this somethings i should expect ? > > I'm not 100% sure about whether it's save to call set_memory_xx while suspending. Certainly the Moorestown and openChrome drivers swap out all buffers before suspending, which will involve calls to set_memory_wb(), and that seems to work, but I'm not 100% sure it's legal. /Thomas > Cheers, > Jerome > > |
From: Thomas H. <the...@vm...> - 2009-04-14 08:13:29
|
Jerome, looks good overall. Some comments inline. Jerome Glisse wrote: > > I have no more list corruption but my tree have few other fixes > it would be nice if you could check if my not missunderstanding > vm code (patch attached). > > > Cheers, > Jerome > > ------------------------------------------------------------------------ > > +static void ttm_bo_device_print_free_tt_size(struct ttm_bo_device *bdev) > +{ > + unsigned max_size = 0; > + unsigned free_size = 0; > + struct drm_mm_node *entry; > + struct list_head *list; > + struct list_head *free_stack; > + > + if (!bdev->man[TTM_PL_TT].has_type || !bdev->man[TTM_PL_TT].use_type) { > + return; > + } > + > + spin_lock(&bdev->lru_lock); > + free_stack = &bdev->man[TTM_PL_TT].manager.fl_entry; > + list_for_each(list, free_stack) { > + entry = list_entry(list, struct drm_mm_node, fl_entry); > + if (entry->size > max_size) { > + max_size = entry->size; > + } > + free_size += entry->size; > + } > + spin_unlock(&bdev->lru_lock); > + max_size = max_size << PAGE_SHIFT; > + free_size = free_size << PAGE_SHIFT; > + printk(KERN_ERR "Aperture biggest free block %ub (%ukb, %uMb)\n", > + max_size, max_size >> 10, max_size >> 20); > + printk(KERN_ERR "Aperture total free size %u bytes (%ukb, %uMb)\n", > + free_size, free_size >> 10, free_size >> 20); > +} > + > Could we put the core of this function in drm_mm.c, so we can use it for any memory type? > int ttm_buffer_object_validate(struct ttm_buffer_object *bo, > bool interruptible, bool no_wait) > { > @@ -937,12 +973,18 @@ > interruptible, no_wait); > if (ret) { > if (ret != -ERESTART) > - printk(KERN_ERR "Failed moving buffer. " > - "Proposed placement 0x%08x\n", > + printk(KERN_ERR "Failed moving buffer of %lub" > + "(%lukb, %luMb). Proposed placement " > + "0x%08x\n", > + bo->num_pages << PAGE_SHIFT, > + (bo->num_pages << PAGE_SHIFT) >> 10, > + (bo->num_pages << PAGE_SHIFT) >> 20, > bo->mem.proposed_flags); > - if (ret == -ENOMEM) > + if (ret == -ENOMEM) { > printk(KERN_ERR "Out of aperture space or " > "DRM memory quota.\n"); > + ttm_bo_device_print_free_tt_size(bo->bdev); > + } > return ret; > Here, I think we should move the error printout to the function caller. In the case we error here because of fragmentation, the correct action of the execbuf code would be to restart buffer validation with the ttm lock in write mode and evict all buffers in the relevant memory types before trying to validate, and then we don't really want any error message until that validation also fails. > > int ttm_fbdev_mmap(struct vm_area_struct *vma, struct ttm_buffer_object *bo) > Looks like the patch generator got a bit confused. This should be ttm_bo_io(), right? Anyway, that function has never been tested, since it appears it confuses the X server polling from the drm device node. I was thinking that we perhaps need a separate TTM device node to implement pread / pwrite correctly. Also one should probably take a look at Intel's implementation if performance with this trivial implementation is an issue. Thanks, Thomas |
From: Thomas H. <th...@sh...> - 2009-04-20 12:50:22
|
Jerome Glisse wrote: > On Mon, 2009-04-20 at 14:02 +0200, Jerome Glisse wrote: > >> On Mon, 2009-04-20 at 12:13 +0200, Thomas Hellstrom wrote: >> >>> If there was an erratum causing PAT not to be enabled on your processor, >>> then definitely that >>> may have cause strange inconsistencies. >>> >>> Thanks, >>> Thomas. >>> >> I think ttm_tt caching stuff does follow kernel policies outlined >> in Documentation/x86/pat.txt well at least from understanding of >> code i have right now through (call chains being sometimes hard >> to fully follow). As also have another issue it seems that calling >> set_memory_(uc|wc) while suspending lockup the cpu or at least doesn't >> return, is this somethings i should expect ? >> >> Cheers, >> Jerome >> >> > > Speaking about caching i think we should only call > ttm_tt_set_placement_caching in mmap function as otherwise it's > useless to waste time changing cache policy, > Jerome, The current code should never change caching policy when it's not strictly needed. If you hit such a case, it's probably a bug. Could you illustrate a case where you're seeing this? /Thomas > Cheers, > Jerome > > > ------------------------------------------------------------------------------ > Stay on top of everything new and different, both inside and > around Java (TM) technology - register by April 22, and save > $200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco. > 300 plus technical and hands-on sessions. Register today. > Use priority code J9JMT32. http://p.sf.net/sfu/p > -- > _______________________________________________ > Dri-devel mailing list > Dri...@li... > https://lists.sourceforge.net/lists/listinfo/dri-devel > |
From: Jerome G. <gl...@fr...> - 2009-04-20 12:58:27
|
On Mon, 2009-04-20 at 14:50 +0200, Thomas Hellström wrote: > Jerome Glisse wrote: > > On Mon, 2009-04-20 at 14:02 +0200, Jerome Glisse wrote: > > > >> On Mon, 2009-04-20 at 12:13 +0200, Thomas Hellstrom wrote: > >> > >>> If there was an erratum causing PAT not to be enabled on your processor, > >>> then definitely that > >>> may have cause strange inconsistencies. > >>> > >>> Thanks, > >>> Thomas. > >>> > >> I think ttm_tt caching stuff does follow kernel policies outlined > >> in Documentation/x86/pat.txt well at least from understanding of > >> code i have right now through (call chains being sometimes hard > >> to fully follow). As also have another issue it seems that calling > >> set_memory_(uc|wc) while suspending lockup the cpu or at least doesn't > >> return, is this somethings i should expect ? > >> > >> Cheers, > >> Jerome > >> > >> > > > > Speaking about caching i think we should only call > > ttm_tt_set_placement_caching in mmap function as otherwise it's > > useless to waste time changing cache policy, > > > Jerome, > The current code should never change caching policy when it's not > strictly needed. > If you hit such a case, it's probably a bug. > > Could you illustrate a case where you're seeing this? > > /Thomas > My problem is on vram eviction on suspend, vram memory is flag as WC so it allocate a ttm and try to set it to wc which fails badly. This why i was thinking of setting cache properties in mmap and not in validate or a path hit by bo_move. Btw i think set_memory_wb is fine only wc|uc isn't i will try to take a look at kernel code see if i see anythings obvious or then ask on lkml what to know if it should work or not. Cheers, Jerome |
From: Thomas H. <th...@sh...> - 2009-04-20 13:09:35
|
Jerome Glisse wrote: > On Mon, 2009-04-20 at 14:50 +0200, Thomas Hellström wrote: > >> Jerome Glisse wrote: >> >>> On Mon, 2009-04-20 at 14:02 +0200, Jerome Glisse wrote: >>> >>> >>>> On Mon, 2009-04-20 at 12:13 +0200, Thomas Hellstrom wrote: >>>> >>>> >>>>> If there was an erratum causing PAT not to be enabled on your processor, >>>>> then definitely that >>>>> may have cause strange inconsistencies. >>>>> >>>>> Thanks, >>>>> Thomas. >>>>> >>>>> >>>> I think ttm_tt caching stuff does follow kernel policies outlined >>>> in Documentation/x86/pat.txt well at least from understanding of >>>> code i have right now through (call chains being sometimes hard >>>> to fully follow). As also have another issue it seems that calling >>>> set_memory_(uc|wc) while suspending lockup the cpu or at least doesn't >>>> return, is this somethings i should expect ? >>>> >>>> Cheers, >>>> Jerome >>>> >>>> >>>> >>> Speaking about caching i think we should only call >>> ttm_tt_set_placement_caching in mmap function as otherwise it's >>> useless to waste time changing cache policy, >>> >>> >> Jerome, >> The current code should never change caching policy when it's not >> strictly needed. >> If you hit such a case, it's probably a bug. >> >> Could you illustrate a case where you're seeing this? >> >> /Thomas >> >> > > My problem is on vram eviction on suspend, vram memory is flag as WC > so it allocate a ttm and try to set it to wc which fails badly. This > why i was thinking of setting cache properties in mmap and not in > validate or a path hit by bo_move. > > Ah, OK. You could adjust this behavior in bo_driver::evict_flags. For VRAM eviction, you can do: return (cur_placement &~TTM_PL_MASK_CACHING) | TTM_PL_FLAG_SYSTEM | TTM_PL_FLAG_CACHED; Looks like openChrome may also benefit from this... /Thomas |