Hi.
When I run CUDA-Z, it is able to detect all 4 GPUs in my system, but only able to run tests on the first GPU. Since it seems to get information on the other GPUs, and since I can run bandwidthTest on all the GPUs, I suspect that it should work. Any ideas?
I'm running the latest Linux x64 nvidia driver, 310.32, CentOS 6.3.
I'm running CUDA 5.0.35.
Any assistance would be great..
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can you see other GPUs in a list box on a bottom of the CUDA-Z window?
If you select other (non-first) GPU in this list, how does it react? Does it show info about the selected GPU? Does it run tests? What logging in console says? CUDA-Z drops a lot of logging there.
What version of CUDA-Z do you run? Check latest beta from today.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yep.. in the listbox, I see all the GPUs.
CUDA-Z prints the details for the other cards for Core and Memory, but when you go to the performance tab for them, you just see "--" for all the numbers for anything but the first card.
I'm trying the CUDA-Z from today, and tried the earlier one too with the same result.
I'm sure it's a simple bug. Obviously CUDA-Z can see the cards -- just not sure why it's not getting the performance numbers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
By the way, I did a "dmesg -c" as root to clear console log, then ran CUDA-Z again, and now in dmesg I only see:
NVRM: GPU at 0000:02:00: GPU-0e71757b-9243-6f48-2742-8179496beb14
NVRM: GPU at 0000:03:00: GPU-57f66cc2-a378-f0b9-88a8-e67c2d4251f3
NVRM: GPU at 0000:83:00: GPU-e7104aa1-bb8b-90ae-eb7c-2f19139812c8
NVRM: GPU at 0000:84:00: GPU-feeeedcb-5f3b-0fa7-96db-d1b2e33799f7
... but same result.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually, I have 4 x GTX 670 in the system, but earlier today, I tried one GTX 580 along with 3 GTX 670, and the results were the same. What I can do is give you access to the system if you're interested. Email me at jas at cse.yorku.ca, and we will set something up.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yo! Really heavy config! ;) I only can guess what you need this monster for.
So, I've made a small test at home with
0:GTX 285 (yes i still sit one this one)
1:8400GS (some backup card)
on Windows, Linux and MacOSX.
And the second card shows performance values, as first as well. And it shows some realistic values, not the same numbers two times.
From attached file:
Version: 0.7.175 SVN Built Feb 25 2013 10:04:59 http://cuda-z.sf.net/
OS Version: Linux 3.5.0-25-generic #38-Ubuntu SMP Mon Feb 18 23:27:42 UTC 2013 x86_64
Driver Version: 310.14
Driver Dll Version: 5.0 (310.14)
Runtime Dll Version: 4.20 (4)
I have Ubuntu 12.10 x64 desktop OS.
What I can conclude:
1. double card configuration works well: there is no failure in algorithm here.
2. my configuration is not symmetric: values are different and it proves the algorithm is ok.
3. my configuration is not so heavy as yours: it can be that we have problems here.
3.a. we have bug in CUDA-Z which shows only on heavy configurations.
3.b. there are problems in nvidia driver that prevents from working it well.
3.c. there are problems in linux kernel that prevents from working it well.
3.d. etc..
What I can suggest now:
1. strip your monster to one card configuration and test CUDA-Z working well.
2. enhance your monster with double card configuration and test CUDA-Z working well.
3. ... thee ...
4. ... four ...
Try to find out when it stops working. I tested CUDA-Z myself only with three card config - I simply have only three PCIe slots in my PC. It might be that four card config is affected by black magic of a full moon today, but I don't really believe in this. It most probably must work for four cards too.
If you can catch CUDA-Z logging from console, drop it here with export information of your four cards configuration.
WBR,
AG
PS: Your logging from before with [Hardware Error] looks unrelated at first glance but...
Actually, the machine is a GPU compute machine for a faculty member at our university. We're having some problems with the performance of the cards in the system, which is why I'm trying to use CUDA-Z to help diagnose. In particular..
GTX 580 on old core i7 system from 2009: Host Pageable to Device: 3993.62 MiB/s
GTX670 on new system: Host Pageable to Device: 2615.68 MiB/s
GTX670 on old core i7 system from 2009: Host Pageable to Device: 4253.14 MiB/s
The new system has 64 GB of DDR3 1600 Mhz memory with dual Xeon E5 processors.. so should be fast! The CPUs aren't quite the same, but I don't think this result is due to CPU.
It's a lot of work to get rid of the cards from the system because the system is rack mounted, and I have to take it down to get into it. If I gave you access to the system, I wonder if you might be able to insert debugging code into cuda-z to help with the diagnostic...
Jason.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Interesting to note that GPU 0, 1, and 3 are not recognized by performance, but only 2.
Here's what I see on the console for the ones not working:
Waiting for new loop...
Timer shot -> update performance for device 3 in mode 0
Rising update action for device 3
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF28FF008.
Alloc host pinned for GeForce GTX 670.
Waiting for new loop...
Timer shot -> update performance for device 3 in mode 0
Rising update action for device 3
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF28FF008.
Alloc host pinned for GeForce GTX 670.
Waiting for new loop...
From the logging you put here it looks like only one card can get a proper host pinned buffer - not enough kernel RAM ???.
This is a normal start up initialization:
Thread started
Selecting GeForce GTX 285.
Alloc local buffers for GeForce GTX 285.
Alloc host pageable for GeForce GTX 285.
Host pageable is at 0xF4AFF008.
Alloc host pinned for GeForce GTX 285. <<-- after this your logging brakes.
Rising update action for device -1
Host pinned is at 0xF25B5000.
Alloc device buffer 1 for GeForce GTX 285.
Device buffer 1 is at 0x00210000.
Alloc device buffer 2 for GeForce GTX 285.
Device buffer 2 is at 0x01210000.
Waiting for new loop...
This point somehow to problem 3.b. and 3.c.
To help you with debugging is a good opportunity to me, bit I can't use it now because it's a midnight at my location (Europe)... :(
I do really believe it's a HW issue. You'll have to open this PC and try card one by one I believe... At least It's what I would do in such a case.
See you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF2DFD008.
Alloc host pinned for GeForce GTX 670.
Waiting for new loop...
Timer shot -> update performance for device 1 in mode 0
Rising update action for device 1
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF2DFD008.
Alloc host pinned for GeForce GTX 670.
Waiting for new loop...
So...
With 2 cards in the system, same problem as before.
It's not the specific card because I had the same problem when I had one GTX580 and one GTX670 in the system as well.
It's not the kernel or the CUDA version because I have the same kernel running on another board (different motherboard with Corei7) and it has a 580 and a Tesla C2050 and it recognizes both. In fact, I swapped the 580 from there with a 670, and it works there as well.
The issue is probably a motherboard issue, but the motherboard passes Intels full test suite successfully.
I'll bet it's a hardware bug that needs to be fixed, but I have no idea how to present this problem in a way that it they can identify it, and correct it.
sigh.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Did you try to play with BIOS settings?
If my assumption is correct, there is no enough paged memory available even for 2 cards (and "heavy" cards), and you may try to tweak this in BIOS setting.
Kernel may have another parameters/characteristics on another PC. If your HW is too new for your current kernel, the kernel may use some kind of "safe" settings which is not really performance-optimized.
Do you have problems with other CUDA programs or only with CUDA-Z? And if yes, what problems do you have with other CUDA SW?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The MB is a server motherboard (W2600CR) that supports the 4 x16 slots (W2600CR), so I suspect it shouldn't be a problem..
I couldn't find anything in the BIOS to adjust though...
We're finding that the cards all work from CUDA okay - there is an issue with SPEED, but not an issue with the ability to access the devices.
Yet, bandwidthTest from CUDA works on all 4 cards... here's the results from the 580/670 from both systems..
GTX580
PINNED Memory Transfers
Transfer (Bytes) Bandwidth on OLD system (MB/s) Bandwidth on NEW system (MB/s)
1024 234.2 186.5
67186688 5716.4 5746.8
Device to Host Bandwidth
Transfer (Bytes) Bandwidth on OLD system (MB/S) Bandwidth on NEW system (MB/s)
1024 303.1 289.8
67186688 6097.6 6323.6
Device to Device Bandwidth
Transfer (Bytes) Bandwidth on OLD sytsem (MB/s) Bandwidth on NEW system (MB/s)
1024 402.9 269.8
67186688 169949.5 169761.5
GTX670
PINNED Memory Transfers
Transfer (Bytes) Bandwidth on OLD system (MB/s) Bandwidth on NEW system (MB/s)
1024 134.7 118.6
67186688 5902.0 5981.4
Device to Host Bandwidth
Transfer (Bytes) Bandwidth on OLD system (MB/S) Bandwidth on NEW system (MB/s)
1024 190.9 193.0
67186688 6095.2 6380.1
Device to Device Bandwidth
Transfer (Bytes) Bandwidth on OLD sytsem (MB/s) Bandwidth on NEW system (MB/s)
1024 622.8 599.6
67186688 151662.7 151589.2
If I went purely on the data above, everything "sort of" seems okay given that you get the number for all the GTX670, but CUDA-Z gave this result for host pageable to device.. (note -- pinned tests above work which is what CUDA-Z is failing on!)
GTX580 on old system: Host Pageable to Device: 3993.62 MiB/s
GTX580 on new system: Host Pageable to Device: 2500 MiB/s
GTX670 on old system: Host Pageable to Device: 4253.14 MiB/s
GTX670 on new system: Host Pageable to Device: 2615.68 MiB/s
CUDA-Z result was no different for this test with only 1 card installed either.
Now, the old system has a core i7-950 3.06 Ghz, and the new one is E5-2620 2.00 Ghz... but not sure how much of this would be dependent on slower clock of the processor...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It seams like there is a problem in CUDA-Z too that show up in such a corner case as yours.
I'll try to analyze the code and may be activate some extra logging too. Let's see how can I help here.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I ran it with 2 cards in the system (back to 2 at the moment)..
card 0 works of course...
then when switching to card 1, I only see:
Waiting for new loop...
Switch device -> update performance for device 1
Rising update action for device 1
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF0CFF008.
Alloc host pinned for GeForce GTX 670.
CUDA Error: all CUDA-capable devices are busy or unavailable
Waiting for new loop...
Timer shot -> update performance for device 1 in mode 0
Rising update action for device 1
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF0CFF008.
Alloc host pinned for GeForce GTX 670.
CUDA Error: all CUDA-capable devices are busy or unavailable
By the way, it was just identified that while the W2600CR motherboard has 4 x16 slots, only 1 of them is x16 electrical.. the others are x8. I figured that was the reason for the performance issue with paged memory, but when I put only one 670 in the system, and put it into the x16 slot, the numbers are the same.. argh.
jas.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah.. Disappointing. Newer believe in PCIe x16 for second or third slot, fourth slot.
I've got a similar issue on my GA EX58-UD5 - second slot works at x16 if third is not populated, otherwise x8 both. Sad but true. Funniest is the fact that if second slot is not populated, third is dead.
Similar joke is implemented on GA 870A-UD3 - if second x16 (electrically x4) slot is populated with x4 card, two extra x1 ports are disabled. All three of them could work only when second x4 is populated with x1 card.
I'm really surprised to see server motherboards have the same issue.
Regarding CUDA-Z output, I'll try to find out what could be a reason. I presume in quad-card configuration three cards give the same error message.
Thanks!
WBR,
AG
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, with this board, you do get 1 x16 and 3 x8 according to the technical spec.. It's not clear for CUDA how much x8 affects the overall performance of the card. But with only 1 card in the system at the moment in the x16 slot (slot 1), I still don't understand the performance differences in paged memory transfers... I don't think I ever will. argh.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It turns out the problem re: reduced paged memory performance had to do with having a second CPU in the system. I removed the second CPU, and performance is back to normal! Imagine that!
I had actually removed all but 1 GTX 670, ensured it was in SLOT 1 which is managed by CPU 1, used taskset to ensure that the bandwidthTest was running on CPU 1... didn't work... only removing the second CPU. I wish someone could explain why it works that way.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi.
When I run CUDA-Z, it is able to detect all 4 GPUs in my system, but only able to run tests on the first GPU. Since it seems to get information on the other GPUs, and since I can run bandwidthTest on all the GPUs, I suspect that it should work. Any ideas?
I'm running the latest Linux x64 nvidia driver, 310.32, CentOS 6.3.
I'm running CUDA 5.0.35.
Any assistance would be great..
Thanks for report!
Can you see other GPUs in a list box on a bottom of the CUDA-Z window?
If you select other (non-first) GPU in this list, how does it react? Does it show info about the selected GPU? Does it run tests? What logging in console says? CUDA-Z drops a lot of logging there.
What version of CUDA-Z do you run? Check latest beta from today.
Yep.. in the listbox, I see all the GPUs.
CUDA-Z prints the details for the other cards for Core and Memory, but when you go to the performance tab for them, you just see "--" for all the numbers for anything but the first card.
I'm trying the CUDA-Z from today, and tried the earlier one too with the same result.
I'm sure it's a simple bug. Obviously CUDA-Z can see the cards -- just not sure why it's not getting the performance numbers.
Sorry.. checked the console log and I do see some errors as follows...
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 2, corrected
[Hardware Error]: section: 0, severity: 2, corrected
[Hardware Error]: flags: 0x01
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 4, root port
[Hardware Error]: version: 1.16
[Hardware Error]: command: 0x0010, status: 0x0547
[Hardware Error]: device_id: 0000:00:02.0
[Hardware Error]: slot: 0
[Hardware Error]: secondary_bus: 0x02
[Hardware Error]: vendor_id: 0x8086, device_id: 0x3c04
[Hardware Error]: class_code: 000406
[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[Hardware Error]: aer_status: 0x00000000, aer_mask: 0x00002000
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Receiver ID
NVRM: GPU at 0000:02:00: GPU-0e71757b-9243-6f48-2742-8179496beb14
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 2, corrected
[Hardware Error]: section: 0, severity: 2, corrected
[Hardware Error]: flags: 0x01
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 4, root port
[Hardware Error]: version: 1.16
[Hardware Error]: command: 0x0010, status: 0x0547
[Hardware Error]: device_id: 0000:00:03.0
[Hardware Error]: slot: 0
[Hardware Error]: secondary_bus: 0x03
[Hardware Error]: vendor_id: 0x8086, device_id: 0x3c08
[Hardware Error]: class_code: 000406
[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[Hardware Error]: aer_status: 0x00000000, aer_mask: 0x00002000
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Receiver ID
NVRM: GPU at 0000:03:00: GPU-57f66cc2-a378-f0b9-88a8-e67c2d4251f3
NVRM: GPU at 0000:83:00: GPU-e7104aa1-bb8b-90ae-eb7c-2f19139812c8
NVRM: GPU at 0000:84:00: GPU-feeeedcb-5f3b-0fa7-96db-d1b2e33799f7
__ratelimit: 2 callbacks suppressed
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 2, corrected
[Hardware Error]: section: 0, severity: 2, corrected
[Hardware Error]: flags: 0x01
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 4, root port
[Hardware Error]: version: 1.16
[Hardware Error]: command: 0x0010, status: 0x0547
[Hardware Error]: device_id: 0000:00:02.0
[Hardware Error]: slot: 0
[Hardware Error]: secondary_bus: 0x02
[Hardware Error]: vendor_id: 0x8086, device_id: 0x3c04
[Hardware Error]: class_code: 000406
[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[Hardware Error]: aer_status: 0x00000000, aer_mask: 0x00002000
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Receiver ID
[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[Hardware Error]: APEI generic hardware error status
[Hardware Error]: severity: 2, corrected
[Hardware Error]: section: 0, severity: 2, corrected
[Hardware Error]: flags: 0x01
[Hardware Error]: section_type: PCIe error
[Hardware Error]: port_type: 4, root port
[Hardware Error]: version: 1.16
[Hardware Error]: command: 0x0010, status: 0x0547
[Hardware Error]: device_id: 0000:80:03.0
[Hardware Error]: slot: 0
[Hardware Error]: secondary_bus: 0x84
[Hardware Error]: vendor_id: 0x8086, device_id: 0x3c08
[Hardware Error]: class_code: 000406
[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[Hardware Error]: aer_status: 0x00000000, aer_mask: 0x00002000
[Hardware Error]: aer_layer=Transaction Layer, aer_agent=Receiver ID
but it happened once, and I don't see it happening again either...
By the way, I did a "dmesg -c" as root to clear console log, then ran CUDA-Z again, and now in dmesg I only see:
NVRM: GPU at 0000:02:00: GPU-0e71757b-9243-6f48-2742-8179496beb14
NVRM: GPU at 0000:03:00: GPU-57f66cc2-a378-f0b9-88a8-e67c2d4251f3
NVRM: GPU at 0000:83:00: GPU-e7104aa1-bb8b-90ae-eb7c-2f19139812c8
NVRM: GPU at 0000:84:00: GPU-feeeedcb-5f3b-0fa7-96db-d1b2e33799f7
... but same result.
Hmmm... Interesting.
What GPUs are used by you in this system? I need to know how I can reproduce similar to yours conditions.
Actually, I have 4 x GTX 670 in the system, but earlier today, I tried one GTX 580 along with 3 GTX 670, and the results were the same. What I can do is give you access to the system if you're interested. Email me at jas at cse.yorku.ca, and we will set something up.
Yo! Really heavy config! ;) I only can guess what you need this monster for.
So, I've made a small test at home with
0:GTX 285 (yes i still sit one this one)
1:8400GS (some backup card)
on Windows, Linux and MacOSX.
And the second card shows performance values, as first as well. And it shows some realistic values, not the same numbers two times.
From attached file:
Version: 0.7.175 SVN Built Feb 25 2013 10:04:59 http://cuda-z.sf.net/
OS Version: Linux 3.5.0-25-generic #38-Ubuntu SMP Mon Feb 18 23:27:42 UTC 2013 x86_64
Driver Version: 310.14
Driver Dll Version: 5.0 (310.14)
Runtime Dll Version: 4.20 (4)
I have Ubuntu 12.10 x64 desktop OS.
What I can conclude:
1. double card configuration works well: there is no failure in algorithm here.
2. my configuration is not symmetric: values are different and it proves the algorithm is ok.
3. my configuration is not so heavy as yours: it can be that we have problems here.
3.a. we have bug in CUDA-Z which shows only on heavy configurations.
3.b. there are problems in nvidia driver that prevents from working it well.
3.c. there are problems in linux kernel that prevents from working it well.
3.d. etc..
What I can suggest now:
1. strip your monster to one card configuration and test CUDA-Z working well.
2. enhance your monster with double card configuration and test CUDA-Z working well.
3. ... thee ...
4. ... four ...
Try to find out when it stops working. I tested CUDA-Z myself only with three card config - I simply have only three PCIe slots in my PC. It might be that four card config is affected by black magic of a full moon today, but I don't really believe in this. It most probably must work for four cards too.
If you can catch CUDA-Z logging from console, drop it here with export information of your four cards configuration.
WBR,
AG
PS: Your logging from before with [Hardware Error] looks unrelated at first glance but...
0x8086:0x3c04:Intel Corporation:Intel(R) Xeon(R) Processor E5 Product Family/Core i7 IIO PCI Express Root Port 2a - 3C04
... it can be related somehow. Bad contacts, dust in case, full moon, etc.
Hi Andriy,
Actually, the machine is a GPU compute machine for a faculty member at our university. We're having some problems with the performance of the cards in the system, which is why I'm trying to use CUDA-Z to help diagnose. In particular..
GTX 580 on old core i7 system from 2009: Host Pageable to Device: 3993.62 MiB/s
GTX670 on new system: Host Pageable to Device: 2615.68 MiB/s
GTX670 on old core i7 system from 2009: Host Pageable to Device: 4253.14 MiB/s
The new system has 64 GB of DDR3 1600 Mhz memory with dual Xeon E5 processors.. so should be fast! The CPUs aren't quite the same, but I don't think this result is due to CPU.
It's a lot of work to get rid of the cards from the system because the system is rack mounted, and I have to take it down to get into it. If I gave you access to the system, I wonder if you might be able to insert debugging code into cuda-z to help with the diagnostic...
Jason.
Interesting to note that GPU 0, 1, and 3 are not recognized by performance, but only 2.
Here's what I see on the console for the ones not working:
Waiting for new loop...
Timer shot -> update performance for device 3 in mode 0
Rising update action for device 3
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF28FF008.
Alloc host pinned for GeForce GTX 670.
Waiting for new loop...
Timer shot -> update performance for device 3 in mode 0
Rising update action for device 3
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF28FF008.
Alloc host pinned for GeForce GTX 670.
Waiting for new loop...
I included the log file from the one not working.
From the logging you put here it looks like only one card can get a proper host pinned buffer - not enough kernel RAM ???.
This is a normal start up initialization:
Thread started
Selecting GeForce GTX 285.
Alloc local buffers for GeForce GTX 285.
Alloc host pageable for GeForce GTX 285.
Host pageable is at 0xF4AFF008.
Alloc host pinned for GeForce GTX 285. <<-- after this your logging brakes.
Rising update action for device -1
Host pinned is at 0xF25B5000.
Alloc device buffer 1 for GeForce GTX 285.
Device buffer 1 is at 0x00210000.
Alloc device buffer 2 for GeForce GTX 285.
Device buffer 2 is at 0x01210000.
Waiting for new loop...
This point somehow to problem 3.b. and 3.c.
To help you with debugging is a good opportunity to me, bit I can't use it now because it's a midnight at my location (Europe)... :(
I do really believe it's a HW issue. You'll have to open this PC and try card one by one I believe... At least It's what I would do in such a case.
See you!
With 1 card:
Starting CUDA-Z...
Version of libcuda.so is 310.32.
Version of libcudart.so is 4.
Driver version 5000.
Runtime version 4020.
Thread created
Thread started
Selecting GeForce GTX 670.
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF62FF008.
Alloc host pinned for GeForce GTX 670.
Rising update action for device -1
Host pinned is at 0xF4200000.
Alloc device buffer 1 for GeForce GTX 670.
Device buffer 1 is at 0x20560000.
Alloc device buffer 2 for GeForce GTX 670.
Device buffer 2 is at 0x21560000.
Waiting for new loop...
Waiting for results...
Waiting for beginnig of test...
Thread loop started
Starting host to device test (pageable) on GeForce GTX 670.
Waiting for end of test...
Test complete in 43.922462 ms.
Starting host to device test (pinned) on GeForce GTX 670.
Test complete in 21.571871 ms.
Starting device to host test (pageable) on GeForce GTX 670.
Test complete in 45.791843 ms.
Starting device to host test (pinned) on GeForce GTX 670.
Test complete in 20.216320 ms.
Starting device to device test (pageable) on GeForce GTX 670.
Test complete in 1.799904 ms.
Starting single-precision float test on GeForce GTX 670 on 1 block(s) 1024 thread(s) each.
Test complete in 5.298592 ms.
Starting double-precision float test on GeForce GTX 670 on 1 block(s) 1024 thread(s) each.
Test complete in 73.521347 ms.
Starting 32-bit integer test on GeForce GTX 670 on 1 block(s) 1024 thread(s) each.
Test complete in 18.503967 ms.
Starting 24-bit integer test on GeForce GTX 670 on 1 block(s) 1024 thread(s) each.
Test complete in 18.526625 ms.
Waiting for new loop...
Got results!
Requesting http://cuda-z.sf.net//history.txt!
Get version connection state changed to 2
Get version connection state changed to 3
Get version connection state changed to 4
Get version connection state changed to 5
Get version redirected to http://cuda-z.sourceforge.net/history.txt
Requesting http://cuda-z.sourceforge.net/history.txt!
Get version connection state changed to 2
Get version connection state changed to 3
Get version connection state changed to 4
Get version connection state changed to 5
Get version request done successfully
0 version 0.1.19
1 release-notes http://sourceforge.net/forum/forum.php?forum_id=842177
2 download-src http://downloads.sourceforge.net/cuda-z/cuda-z-0.1.zip
3 download-win32 http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.1.19.exe
4 version 0.2.27
5 release-notes http://sourceforge.net/forum/forum.php?forum_id=845905
6 download-src http://downloads.sourceforge.net/cuda-z/cuda-z-0.2.zip
7 download-win32 http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.2.27.exe
8 version 0.3.50
9 release-notes http://sourceforge.net/forum/forum.php?forum_id=851960
10 download-src http://downloads.sourceforge.net/cuda-z/cuda-z-0.3.zip
11 download-win32 http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.3.50.exe
12 version 0.4.74
13 release-notes http://sourceforge.net/forum/forum.php?forum_id=892379
14 download-src http://downloads.sourceforge.net/cuda-z/cuda-z-0.4.zip
15 download-win32 http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.4.74.exe
16 download-linux http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.4.74-i686-dynamic.run
17 version 0.5.95
18 release-notes http://sourceforge.net/forum/forum.php?forum_id=940814
19 download-src http://downloads.sourceforge.net/cuda-z/cuda-z-0.5.zip
20 download-win32 http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.5.95.exe
21 download-linux http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.5.95-i686.run
22 version 0.6.163
23 release-notes http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/
24 download-src http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/cuda-z-0.6.zip/download
25 download-win32 http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/CUDA-Z-0.6.163.exe/download
26 download-linux http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/CUDA-Z-0.6.163.run/download
27 download-macosx http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/CUDA-Z-0.6.163.dmg/download
28 version 0.6.169
29 release-notes http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/
30 download-src http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/cuda-z-0.6-f1.zip/download
31 download-linux http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/CUDA-Z-0.6.169.run/download
32
Version found: 0.1.19
Notes found: http://sourceforge.net/forum/forum.php?forum_id=842177
Version found: 0.2.27
Notes found: http://sourceforge.net/forum/forum.php?forum_id=845905
Version found: 0.3.50
Notes found: http://sourceforge.net/forum/forum.php?forum_id=851960
Version found: 0.4.74
Notes found: http://sourceforge.net/forum/forum.php?forum_id=892379
Valid URL found: http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.4.74-i686-dynamic.run
Version found: 0.5.95
Notes found: http://sourceforge.net/forum/forum.php?forum_id=940814
Valid URL found: http://downloads.sourceforge.net/cuda-z/CUDA-Z-0.5.95-i686.run
Version found: 0.6.163
Notes found: http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/
Valid URL found: http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/CUDA-Z-0.6.163.run/download
Version found: 0.6.169
Notes found: http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/
Valid URL found: http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/CUDA-Z-0.6.169.run/download
Last valid version: 0.6.169
http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/
http://sourceforge.net/projects/cuda-z/files/cuda-z/0.6/CUDA-Z-0.6.169.run/download
Timer shot -> update performance for device 0 in mode 0
Rising update action for device 0
Thread loop started
Starting host to device test (pageable) on GeForce GTX 670.
Test complete in 45.925472 ms.
Starting host to device test (pinned) on GeForce GTX 670.
Test complete in 21.568319 ms.
Starting device to host test (pageable) on GeForce GTX 670.
Test complete in 44.516285 ms.
Starting device to host test (pinned) on GeForce GTX 670.
Test complete in 20.237410 ms.
Starting device to device test (pageable) on GeForce GTX 670.
Test complete in 1.788064 ms.
Starting single-precision float test on GeForce GTX 670 on 1 block(s) 1024 thread(s) each.
Test complete in 4.769728 ms.
Starting double-precision float test on GeForce GTX 670 on 1 block(s) 1024 thread(s) each.
Test complete in 61.664608 ms.
Starting 32-bit integer test on GeForce GTX 670 on 1 block(s) 1024 thread(s) each.
Test complete in 14.905249 ms.
Starting 24-bit integer test on GeForce GTX 670 on 1 block(s) 1024 thread(s) each.
Test complete in 14.915199 ms.
Waiting for new loop...
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF2DFD008.
Alloc host pinned for GeForce GTX 670.
Waiting for new loop...
Timer shot -> update performance for device 1 in mode 0
Rising update action for device 1
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF2DFD008.
Alloc host pinned for GeForce GTX 670.
Waiting for new loop...
So...
With 2 cards in the system, same problem as before.
It's not the specific card because I had the same problem when I had one GTX580 and one GTX670 in the system as well.
It's not the kernel or the CUDA version because I have the same kernel running on another board (different motherboard with Corei7) and it has a 580 and a Tesla C2050 and it recognizes both. In fact, I swapped the 580 from there with a 670, and it works there as well.
The issue is probably a motherboard issue, but the motherboard passes Intels full test suite successfully.
I'll bet it's a hardware bug that needs to be fixed, but I have no idea how to present this problem in a way that it they can identify it, and correct it.
sigh.
Did you try to play with BIOS settings?
If my assumption is correct, there is no enough paged memory available even for 2 cards (and "heavy" cards), and you may try to tweak this in BIOS setting.
Kernel may have another parameters/characteristics on another PC. If your HW is too new for your current kernel, the kernel may use some kind of "safe" settings which is not really performance-optimized.
Do you have problems with other CUDA programs or only with CUDA-Z? And if yes, what problems do you have with other CUDA SW?
The MB is a server motherboard (W2600CR) that supports the 4 x16 slots (W2600CR), so I suspect it shouldn't be a problem..
I couldn't find anything in the BIOS to adjust though...
We're finding that the cards all work from CUDA okay - there is an issue with SPEED, but not an issue with the ability to access the devices.
Yet, bandwidthTest from CUDA works on all 4 cards... here's the results from the 580/670 from both systems..
GTX580
PINNED Memory Transfers
Transfer (Bytes) Bandwidth on OLD system (MB/s) Bandwidth on NEW system (MB/s)
1024 234.2 186.5
67186688 5716.4 5746.8
Device to Host Bandwidth
Transfer (Bytes) Bandwidth on OLD system (MB/S) Bandwidth on NEW system (MB/s)
1024 303.1 289.8
67186688 6097.6 6323.6
Device to Device Bandwidth
Transfer (Bytes) Bandwidth on OLD sytsem (MB/s) Bandwidth on NEW system (MB/s)
1024 402.9 269.8
67186688 169949.5 169761.5
GTX670
PINNED Memory Transfers
Transfer (Bytes) Bandwidth on OLD system (MB/s) Bandwidth on NEW system (MB/s)
1024 134.7 118.6
67186688 5902.0 5981.4
Device to Host Bandwidth
Transfer (Bytes) Bandwidth on OLD system (MB/S) Bandwidth on NEW system (MB/s)
1024 190.9 193.0
67186688 6095.2 6380.1
Device to Device Bandwidth
Transfer (Bytes) Bandwidth on OLD sytsem (MB/s) Bandwidth on NEW system (MB/s)
1024 622.8 599.6
67186688 151662.7 151589.2
If I went purely on the data above, everything "sort of" seems okay given that you get the number for all the GTX670, but CUDA-Z gave this result for host pageable to device.. (note -- pinned tests above work which is what CUDA-Z is failing on!)
GTX580 on old system: Host Pageable to Device: 3993.62 MiB/s
GTX580 on new system: Host Pageable to Device: 2500 MiB/s
GTX670 on old system: Host Pageable to Device: 4253.14 MiB/s
GTX670 on new system: Host Pageable to Device: 2615.68 MiB/s
CUDA-Z result was no different for this test with only 1 card installed either.
Now, the old system has a core i7-950 3.06 Ghz, and the new one is E5-2620 2.00 Ghz... but not sure how much of this would be dependent on slower clock of the processor...
By the way, for the record, bandwidthTest --memory=pageable (which I didn't realize existed), works for all the cards as well.
It seams like there is a problem in CUDA-Z too that show up in such a corner case as yours.
I'll try to analyze the code and may be activate some extra logging too. Let's see how can I help here.
You can try: bandwidthTest --memory=pinned -wc
Running on...
Device 0: GeForce GTX 670
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Write-Combined Memory Writes are Enabled Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5971.0
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Write-Combined Memory Writes are Enabled Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6372.7
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Write-Combined Memory Writes are Enabled Transfer Size (Bytes) Bandwidth(MB/s)
33554432 151089.3
... works for --device 0 through --device 3.
jas.
I have prepared a special version of CUDA-Z with a lot more logging then usual. At least this version will show all errors.
http://sourceforge.net/projects/cuda-z/files/cuda-z/0.7-Beta/CUDA-Z-0.7.176-SVN-logging.run/download
Please send me the logging while selecting all cards one after another.
sh ..../CUDA-Z-0.7.176-SVN-logging.run 2> logfile.txt
Thanks!
Someone is running a job on the server... when it's done, I'll give it a try...
Thanks!
I ran it with 2 cards in the system (back to 2 at the moment)..
card 0 works of course...
then when switching to card 1, I only see:
Waiting for new loop...
Switch device -> update performance for device 1
Rising update action for device 1
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF0CFF008.
Alloc host pinned for GeForce GTX 670.
CUDA Error: all CUDA-capable devices are busy or unavailable
Waiting for new loop...
Timer shot -> update performance for device 1 in mode 0
Rising update action for device 1
Thread loop started
Alloc local buffers for GeForce GTX 670.
Alloc host pageable for GeForce GTX 670.
Host pageable is at 0xF0CFF008.
Alloc host pinned for GeForce GTX 670.
CUDA Error: all CUDA-capable devices are busy or unavailable
By the way, it was just identified that while the W2600CR motherboard has 4 x16 slots, only 1 of them is x16 electrical.. the others are x8. I figured that was the reason for the performance issue with paged memory, but when I put only one 670 in the system, and put it into the x16 slot, the numbers are the same.. argh.
jas.
Yeah.. Disappointing. Newer believe in PCIe x16 for second or third slot, fourth slot.
I've got a similar issue on my GA EX58-UD5 - second slot works at x16 if third is not populated, otherwise x8 both. Sad but true. Funniest is the fact that if second slot is not populated, third is dead.
Similar joke is implemented on GA 870A-UD3 - if second x16 (electrically x4) slot is populated with x4 card, two extra x1 ports are disabled. All three of them could work only when second x4 is populated with x1 card.
I'm really surprised to see server motherboards have the same issue.
Regarding CUDA-Z output, I'll try to find out what could be a reason. I presume in quad-card configuration three cards give the same error message.
Thanks!
WBR,
AG
Well, with this board, you do get 1 x16 and 3 x8 according to the technical spec.. It's not clear for CUDA how much x8 affects the overall performance of the card. But with only 1 card in the system at the moment in the x16 slot (slot 1), I still don't understand the performance differences in paged memory transfers... I don't think I ever will. argh.
It turns out the problem re: reduced paged memory performance had to do with having a second CPU in the system. I removed the second CPU, and performance is back to normal! Imagine that!
I had actually removed all but 1 GTX 670, ensured it was in SLOT 1 which is managed by CPU 1, used taskset to ensure that the bandwidthTest was running on CPU 1... didn't work... only removing the second CPU. I wish someone could explain why it works that way.