From: <gs...@us...> - 2009-06-18 15:07:04
|
Revision: 7 http://cudagpumemtest.svn.sourceforge.net/cudagpumemtest/?rev=7&view=rev Author: gshi Date: 2009-06-18 15:05:43 +0000 (Thu, 18 Jun 2009) Log Message: ----------- Update the README, added the descriptions for all test The test desctiptions in the code is updated as well Modified Paths: -------------- README tests.cu Modified: README =================================================================== --- README 2009-06-16 21:46:22 UTC (rev 6) +++ README 2009-06-18 15:05:43 UTC (rev 7) @@ -62,6 +62,100 @@ %cuda_memtest --help +========================================Test descriptions ================================================ +(I) list of all tests +Running +%cuda_memtest --list_tests +will print out all tests and their short descriptions, as of 6/18/2009, we implemented 11 tests +Test0 [Walking 1 bit] +Test1 [Own address test] +Test2 [Moving inversions, ones&zeros] +Test3 [Moving inversions, 8 bit pat] +Test4 [Moving inversions, random pattern] +Test5 [Block move, 64 moves] +Test6 [Moving inversions, 32 bit pat] +Test7 [Random number sequence] +Test8 [Modulo 20, random pattern] +Test9 [Bit fade test] ==disabled by default== +Test10 [Memory stress test] + + +(II) detailed description + +Test0 [Walking 1 bit] + This test changes one bit a time in memory address to see it + goes to a different memory location. It is designed to test + the address wires. + +Test1 [Own address test] + Each Memory location is filled with its own address. The next kernel checks if the + value in each memory location still agrees with the address. + +Test 2 [Moving inversions, ones&zeros] + This test uses the moving inversions algorithm with patterns of all + ones and zeros. + + +Test 3 [Moving inversions, 8 bit pat] + This is the same as test 1 but uses a 8 bit wide pattern of + "walking" ones and zeros. This test will better detect subtle errors + in "wide" memory chips. + +Test 4 [Moving inversions, random pattern] + Test 4 uses the same algorithm as test 1 but the data pattern is a + random number and it's complement. This test is particularly effective + in finding difficult to detect data sensitive errors. The random number + sequence is different with each pass so multiple passes increase effectiveness. + +Test 5 [Block move, 64 moves] + This test stresses memory by moving block memories. Memory is initialized + with shifting patterns that are inverted every 8 bytes. Then blocks + of memory are moved around. After the moves + are completed the data patterns are checked. Because the data is checked + only after the memory moves are completed it is not possible to know + where the error occurred. The addresses reported are only for where the + bad pattern was found. + +Test 6 [Moving inversions, 32 bit pat] + This is a variation of the moving inversions algorithm that shifts the data + pattern left one bit for each successive address. The starting bit position + is shifted left for each pass. To use all possible data patterns 32 passes + are required. This test is quite effective at detecting data sensitive + errors but the execution time is long. + +Test 7 [Random number sequence] + This test writes a series of random numbers into memory. A block (1 MB) of memory + is initialized with random patterns. These patterns and their complements are + used in moving inversions test with rest of memory. + +Test 8 [Modulo 20, random pattern] + A random pattern is generated. This pattern is used to set every 20th memory location + in memory. The rest of the memory location is set to the complimemnt of the pattern. + Repeat this for 20 times and each time the memory location to set the pattern is shifted right. + +Test 9 [Bit fade test, 90 min, 2 patterns] + The bit fade test initializes all of memory with a pattern and then + sleeps for 90 minutes. Then memory is examined to see if any memory bits + have changed. All ones and all zero patterns are used. This test takes + 3 hours to complete. The Bit Fade test is disabled by default + +Test10 [memory stress test] + Stress memory as much as we can. A random pattern is generated and a kernel of large grid size + and block size is launched to set all memory to the pattern. A new read and write kernel is launched + immediately after the previous write kernel to check if there is any errors in memory and set the + memory to the compliment. This process is repeated for 1000 times for one pattern. The kernel is + written as to achieve the maximum bandwidth between the global memory and GPU. + This will increase the chance of catching software error. In practice, we found this test quite useful + to flush hardware errors as well. + + + + + + + + + Modified: tests.cu =================================================================== --- tests.cu 2009-06-16 21:46:22 UTC (rev 6) +++ tests.cu 2009-06-18 15:05:43 UTC (rev 7) @@ -329,13 +329,15 @@ } +/* + * Test0 [Walking 1 bit] + * This test changes one bit a time in memory address to see it + * goes to a different memory location. It is designed to test + * the address wires. + */ -/**************************************************************************** - * test0 - * walking 1 bit - * - ****************************************************************************/ + /* __global__ void kernel_test0_write(char* _ptr, char* end_ptr, unsigned int pattern, @@ -604,7 +606,8 @@ /********************************************************************************* * test1 - * Memory address test, own address + * Each Memory location is filled with its own address. The next kernel checks if the + * value in each memory location still agrees with the address. * ********************************************************************************/ @@ -683,10 +686,7 @@ /****************************************************************************** * Test 2 [Moving inversions, ones&zeros] * This test uses the moving inversions algorithm with patterns of all - * ones and zeros. Cache is enabled even though it interferes to some - * degree with the test algorithm. With cache enabled this test does not - * take long and should quickly find all "hard" errors and some more - * subtle errors. This section is only a quick check. + * ones and zeros. * ****************************************************************************/ @@ -710,7 +710,7 @@ * Test 3 [Moving inversions, 8 bit pat] * This is the same as test 1 but uses a 8 bit wide pattern of * "walking" ones and zeros. This test will better detect subtle errors - * in "wide" memory chips. A total of 20 data patterns are used. + * in "wide" memory chips. * **************************************************************************/ @@ -777,17 +777,13 @@ /************************************************************************************ * Test 5 [Block move, 64 moves] - * This test stresses memory by using block move (movsl) instructions - * and is based on Robert Redelmeier's burnBX test. Memory is initialized - * with shifting patterns that are inverted every 8 bytes. Then 4MB blocks - * of memory are moved around using the movsl instruction. After the moves + * This test stresses memory by moving block memories. Memory is initialized + * with shifting patterns that are inverted every 8 bytes. Then blocks + * of memory are moved around. After the moves * are completed the data patterns are checked. Because the data is checked * only after the memory moves are completed it is not possible to know * where the error occurred. The addresses reported are only for where the - * bad pattern was found. Since the moves are constrained to a 8MB segment - * of memory the failing address will always be lest than 8MB away from the - * reported address. Errors from this test are not used to calculate - * BadRAM patterns. + * bad pattern was found. * * *************************************************************************************/ @@ -1094,13 +1090,12 @@ /****************************************************************************** * Test 7 [Random number sequence] - * This test writes a series of random numbers into memory. By resetting the - * seed for the random number the same sequence of number can be created for - * a reference. The initial pattern is checked and then complemented and - * checked again on the next pass. However, unlike the moving inversions test - * writing and checking can only be done in the forward direction. - * + * This test writes a series of random numbers into memory. A block (1 MB) of memory + * is initialized with random patterns. These patterns and their complements are + * used in moving inversions test with rest of memory. + * + * *******************************************************************************/ @@ -1234,11 +1229,13 @@ /*********************************************************************************** - * Test 8 [Modulo 20, ones&zeros] - * Using the Modulo-X algorithm should uncover errors that are not - * detected by moving inversions due to cache and buffering interference - * with the the algorithm. All ones and zeros are used for data patterns. + * Test 8 [Modulo 20, random pattern] * + * A random pattern is generated. This pattern is used to set every 20th memory location + * in memory. The rest of the memory location is set to the complimemnt of the pattern. + * Repeat this for 20 times and each time the memory location to set the pattern is shifted right. + * + * **********************************************************************************/ @@ -1351,12 +1348,11 @@ /************************************************************************************ * - *Test 9 [Bit fade test, 90 min, 2 patterns] + * Test 9 [Bit fade test, 90 min, 2 patterns] * The bit fade test initializes all of memory with a pattern and then * sleeps for 90 minutes. Then memory is examined to see if any memory bits * have changed. All ones and all zero patterns are used. This test takes - * 3 hours to complete. The Bit Fade test is not included in the normal test - * sequence and must be run manually via the runtime configuration menu. + * 3 hours to complete. The Bit Fade test is disabled by default * **********************************************************************************/ @@ -1404,7 +1400,14 @@ /************************************************************************************** * Test10 [memory stress test] - * Stress memory as much as we can. This will increase the chance of catching software error + * + * Stress memory as much as we can. A random pattern is generated and a kernel of large grid size + * and block size is launched to set all memory to the pattern. A new read and write kernel is launched + * immediately after the previous write kernel to check if there is any errors in memory and set the + * memory to the compliment. This process is repeated for 1000 times for one pattern. The kernel is + * written as to achieve the maximum bandwidth between the global memory and GPU. + * This will increase the chance of catching software error. In practice, we found this test quite useful + * to flush hardware errors as well. * */ @@ -1530,7 +1533,7 @@ {test5, "Test5 [Block move, 64 moves]", 1}, {test6, "Test6 [Moving inversions, 32 bit pat]", 1}, {test7, "Test7 [Random number sequence]", 1}, - {test8, "Test8 [Modulo 20, ones&zeros]", 1}, + {test8, "Test8 [Modulo 20, random pattern]", 1}, {test9, "Test9 [Bit fade test]", 0}, {test10, "Test10 [Memory stress test]", 1}, This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |