|
From: Wuweijia <wuw...@hu...> - 2018-01-23 03:03:09
Attachments:
log.rar
|
Hi
I ran the program with mem-check, 99% is okay, it will not last long. But sometimes it last very long, at least one hour in one function. And then I add the -trace-signals=yes to find what it happen. Valgrind show me some signal happened. Is there something related to the time that the mem-check last too long. Or can you show me some ways to analyze why the mem-check run too long sometimes.
The log as below:
[2018-01-19 14:10:38] [HDRP_AP][I] getSkinMask : 2660 [MedianFilter3x3_v2] BaseSize=770048, FilterSize=761856
[2018-01-19 14:10:38] [HDRP_AP][I] getSkinMask : 2661 [MedianFilter3x3_v2] BaseStride=1024,BaseHeight_16=752
[2018-01-19 14:10:38] [HDRP_AP][I] getSkinMask : 2662 [MedianFilter3x3_v2] width=992, height=744, stride=1024
--4195-- sync signal handler: signal=11, si_code=1, EIP=0x41753d4, eip=0x8046b2504, from kernel
--4195-- SIGSEGV: si_code=1 faultaddr=0xffefedf30 tid=1 ESP=0xffefedef0 seg=0xffe801000-0xffefedfff
--4195-- -> extended stack base to 0xffefed000
--4195-- sys_sigaltstack: tid 10, ss 0x3191C430{0x4B0A000,sz=16384,flags=0x0}, oss 0x0 (current SP 0x3191C410)
--4195-- sys_sigaltstack: tid 10, ss 0x3191C400{0x0,sz=0,flags=0x2}, oss 0x0 (current SP 0x3191C3B0)
--4195-- sys_sigaltstack: tid 10, ss 0x3191C430{0x4B0A000,sz=16384,flags=0x0}, oss 0x0 (current SP 0x3191C410)
--4195-- sys_sigaltstack: tid 10, ss 0x3191C400{0x0,sz=0,flags=0x2}, oss 0x0 (current SP 0x3191C3B0)
--4195-- sys_sigaltstack: tid 10, ss 0x3191C430{0x4B0A000,sz=16384,flags=0x0}, oss 0x0 (current SP 0x3191C410)
--4195-- sys_sigaltstack: tid 10, ss 0x3191C400{0x0,sz=0,flags=0x2}, oss 0x0 (current SP 0x3191C3B0)
--4195-- sys_sigaltstack: tid 10, ss 0x3191C430{0x4B0A000,sz=16384,flags=0x0}, oss 0x0 (current SP 0x3191C410)
--4195-- sys_sigaltstack: tid 10, ss 0x3191C400{0x0,sz=0,flags=0x2}, oss 0x0 (current SP 0x3191C3B0)
[2018-01-19 15:29:39] [HDRP_AP][I] rawnr_guidedfilter_self_u16 : 1310 enter.
--4195-- sys_sigaltstack: tid 10, ss 0x3191C430{0x4B0F000,sz=16384,flags=0x0}, oss 0x0 (current SP 0x3191C410)
--4195-- sys_sigaltstack: tid 11, ss 0x31A19430{0x4B19000,sz=16384,flags=0x0}, oss 0x0 (current SP 0x31A19410)
Env: android -arm64
Valgrind-3.12.
BR
owen
|
|
From: Ivo R. <iv...@iv...> - 2018-01-23 09:42:54
|
2018-01-23 4:02 GMT+01:00 Wuweijia <wuw...@hu...>: > Hi > > I ran the program with mem-check, 99% is okay, it will not > last long. But sometimes it last very long, at least one hour in one > function. And then I add the –trace-signals=yes to find what it happen. > Valgrind show me some signal happened. Is there something related to the > time that the mem-check last too long. Or can you show me some ways to > analyze why the mem-check run too long sometimes. Perhaps you will find useful a simple progress reporting facility recently integrated into Valgrind repo by Julian. https://bugs.kde.org/show_bug.cgi?id=384633 Remark: you need to build Valgrind from the latest source as per http://valgrind.org/downloads/repository.html Excerpt from the documentation: --------------------------------------------- A new command line flag, --progress-interval=number, causes Valgrind to print a 1-line summary of progress every |number| seconds. For example, when starting Firefox with --progress-interval=10, I get lines like this: --32411-- PROGRESS: U 110s, W 113s, 97.3% CPU, EvC 414.79M, TIn 616.7k, TOut 0.5k, #thr 67 --32411-- PROGRESS: U 120s, W 124s, 96.8% CPU, EvC 505.27M, TIn 636.6k, TOut 3.0k, #thr 64 --32411-- PROGRESS: U 130s, W 134s, 97.0% CPU, EvC 574.90M, TIn 657.5k, TOut 3.0k, #thr 63 --32411-- PROGRESS: U 140s, W 144s, 97.2% CPU, EvC 636.34M, TIn 659.9k, TOut 3.0k, #thr 62 --32411-- PROGRESS: U 150s, W 155s, 96.8% CPU, EvC 710.21M, TIn 664.0k, TOut 17.7k, #thr 61 --32411-- PROGRESS: U 160s, W 201s, 79.6% CPU, EvC 822.38M, TIn 669.9k, TOut 75.8k, #thr 60 Each line shows: U: total user time W: total wallclock time CPU: overall average cpu use EvC: number of event checks. An event check is a backwards branch in the simulated program, so this is a measure of forward progress of the program TIn: number of code blocks instrumented by the JIT TOut: number of instrumented code blocks that have been thrown away #thr: number of threads in the program >From the progress of these, it is possible to observe: * when the program is compute bound (TIn rises slowly, EvC rises rapidly) * when the program is in a spinloop (TIn/TOut fixed, EvC rises rapidly) * when the program is JIT-bound (TIn rises rapidly) * when the program is rapidly discarding code (TOut rises rapidly) * when the program is about to achieve some expected state (EvC arrives at some value you expect) * when the program is idling (U rises more slowly than W) I. |
|
From: Wuweijia <wuw...@hu...> - 2018-01-23 11:27:49
|
Hi
Thanks for your advise. You mean I need to download valgrind 3.13. and merge your commit, compile it.
And Is there any api that I can dump the call stack of the threads.
I want to add it to the code to show me more information to debug the program.
BR
Owen
-----邮件原件-----
发件人: Ivo Raisr [mailto:iv...@iv...]
发送时间: 2018年1月23日 17:43
收件人: Wuweijia <wuw...@hu...>
抄送: val...@li...; Fanbohao <fan...@hu...>
主题: Re: [Valgrind-users] [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it?
2018-01-23 4:02 GMT+01:00 Wuweijia <wuw...@hu...>:
> Hi
>
> I ran the program with mem-check, 99% is okay, it will
> not last long. But sometimes it last very long, at least one hour in
> one function. And then I add the –trace-signals=yes to find what it happen.
> Valgrind show me some signal happened. Is there something related to
> the time that the mem-check last too long. Or can you show me some
> ways to analyze why the mem-check run too long sometimes.
Perhaps you will find useful a simple progress reporting facility recently integrated into Valgrind repo by Julian.
https://bugs.kde.org/show_bug.cgi?id=384633
Remark: you need to build Valgrind from the latest source as per http://valgrind.org/downloads/repository.html
Excerpt from the documentation:
---------------------------------------------
A new command line flag, --progress-interval=number, causes Valgrind to print a 1-line summary of progress every |number| seconds.
For example, when starting Firefox with --progress-interval=10, I get lines like this:
--32411-- PROGRESS: U 110s, W 113s, 97.3% CPU, EvC 414.79M, TIn 616.7k, TOut 0.5k, #thr 67
--32411-- PROGRESS: U 120s, W 124s, 96.8% CPU, EvC 505.27M, TIn 636.6k, TOut 3.0k, #thr 64
--32411-- PROGRESS: U 130s, W 134s, 97.0% CPU, EvC 574.90M, TIn 657.5k, TOut 3.0k, #thr 63
--32411-- PROGRESS: U 140s, W 144s, 97.2% CPU, EvC 636.34M, TIn 659.9k, TOut 3.0k, #thr 62
--32411-- PROGRESS: U 150s, W 155s, 96.8% CPU, EvC 710.21M, TIn 664.0k, TOut 17.7k, #thr 61
--32411-- PROGRESS: U 160s, W 201s, 79.6% CPU, EvC 822.38M, TIn 669.9k, TOut 75.8k, #thr 60
Each line shows:
U: total user time
W: total wallclock time
CPU: overall average cpu use
EvC: number of event checks. An event check is a backwards branch
in the simulated program, so this is a measure of forward progress
of the program
TIn: number of code blocks instrumented by the JIT
TOut: number of instrumented code blocks that have been thrown away
#thr: number of threads in the program
From the progress of these, it is possible to observe:
* when the program is compute bound (TIn rises slowly, EvC rises rapidly)
* when the program is in a spinloop (TIn/TOut fixed, EvC rises rapidly)
* when the program is JIT-bound (TIn rises rapidly)
* when the program is rapidly discarding code (TOut rises rapidly)
* when the program is about to achieve some expected state (EvC arrives
at some value you expect)
* when the program is idling (U rises more slowly than W)
I.
|
|
From: Ivo R. <iv...@iv...> - 2018-01-23 13:54:21
|
2018-01-23 12:27 GMT+01:00 Wuweijia <wuw...@hu...>: > Hi > Thanks for your advise. You mean I need to download valgrind 3.13. and merge your commit, compile it. Actually not. I meant you should build Valgrind from source code as per instructions at: http://valgrind.org/downloads/repository.html I. |
|
From: Wuweijia <wuw...@hu...> - 2018-01-24 00:36:32
|
Hi: But I can not access the git of valgrind. Is there any way to get the newest source code; Maybe the network configuration do not allow it; BR Owen -----邮件原件----- 发件人: Ivo Raisr [mailto:iv...@iv...] 发送时间: 2018年1月23日 21:54 收件人: Wuweijia <wuw...@hu...> 抄送: val...@li...; Fanbohao <fan...@hu...> 主题: Re: 答复: [Valgrind-users] [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it? 2018-01-23 12:27 GMT+01:00 Wuweijia <wuw...@hu...>: > Hi > Thanks for your advise. You mean I need to download valgrind 3.13. and merge your commit, compile it. Actually not. I meant you should build Valgrind from source code as per instructions at: http://valgrind.org/downloads/repository.html I. |
|
From: Ivo R. <iv...@iv...> - 2018-01-24 04:36:00
|
2018-01-24 1:36 GMT+01:00 Wuweijia <wuw...@hu...>: > Hi: > But I can not access the git of valgrind. Is there any way to get the newest source code; Maybe the network configuration do not allow it; It is hard to help you without any specific error shown. If this is indeed a network configuration, there is a git mirror of Valgrind accessible over http or https. Have a look at http://repo.or.cz/w/valgrind.git I. |
|
From: Wuweijia <wuw...@hu...> - 2018-01-24 07:15:50
Attachments:
config.h
|
Hi
I download the source code , and I build arm32 version. That is some different with 3.12.
There is onequestion:
Question:
I build the android -arm32 version, that is some compile error . I need to sure there is no need to include this file?
Error as below:
external/valgrind-3.14-GIT/coregrind/m_syswrap/syscall-arm-linux.S:35:34: fatal error: libvex_guest_offsets.h: No such file or directory
#include "libvex_guest_offsets.h"
The compile cmd:
out/debug/target/product/kirin970/obj_arm/STATIC_LIBRARIES/libcoregrind-arm-linux_intermediates/coregrind/m_syswrap/syscall-arm-linux.o
/bin/bash -c "PWD=/proc/self/cwd prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.9/bin/arm-linux-androideabi-gcc -I external/valgrind-3.14-GIT -I external/valgrind-3.14-GIT/include -I external/valgrind-3.14-GIT/VEX/pub -I external/valgrind-3.14-GIT/coregrind -I external/valgrind-3.14-GIT -I out/debug/target/product/kirin970/obj_arm/STATIC_LIBRARIES/libcoregrind-arm-linux_intermediates -I out/debug/target/product/kirin970/gen/STATIC_LIBRARIES/libcoregrind-arm-linux_intermediates -I libnativehelper/include/nativehelper \$(cat out/debug/target/product/kirin970/obj_arm/STATIC_LIBRARIES/libcoregrind-arm-linux_intermediates/import_includes) -I system/core/include -I system/media/audio/include -I hardware/libhardware/include -I hardware/libhardware_legacy/include -I hardware/ril/include -I libnativehelper/include -I frameworks/native/include -I frameworks/native/opengl/include -isystem frameworks/av/include -isystem out/debug/target/product/kirin970/obj/include -isystem bionic/libc/arch-arm/include -isystem bionic/libc/include -isystem bionic/libc/kernel/uapi -isystem bionic/libc/kernel/uapi/asm-arm -isystem bionic/libc/kernel/android/uapi -c -fno-exceptions -Wno-multichar -ffunction-sections -fdata-sections -funwind-tables -fstack-protector-strong -Wa,--noexecstack -Werror=format-security -D_FORTIFY_SOURCE=2 -fno-short-enums -no-canonical-prefixes -fno-canonical-system-headers -fno-builtin-sin -fno-strict-volatile-bitfields -DNDEBUG -g -Wstrict-aliasing=2 -fgcse-after-reload -frerun-cse-after-loop -frename-registers -DANDROID -DOEMINFO_VERSION6 -fmessage-length=0 -W -Wall -Wno-unused -Winit-self -Wpointer-arith -DNDEBUG -UDEBUG -DBOARD_VENDORIMAGE_FILE_SYSTEM_TYPE -Wformat -fdebug-prefix-map=/proc/self/cwd= -fdiagnostics-color -Werror=return-type -Werror=non-virtual-dtor -Werror=address -Werror=sequence-point -Werror=date-time -mthumb-interwork -msoft-float -mfloat-abi=softfp -mfpu=neon -mcpu=cortex-a15 -D__ARM_FEATURE_LPAE=1 -std=gnu99 -O2 -fomit-frame-pointer -fstrict-aliasing -funswitch-loops -Wall -Wmissing-prototypes -Wshadow -Wpointer-arith -Wmissing-declarations -Wno-pointer-sign -Wno-sign-compare -Wno-unused-parameter -Wno-shadow -fno-strict-aliasing -fno-stack-protector -DVGO_linux=1 -v -DANDROID_SYMBOLS_DIR=\\\"/data/local/symbols\\\" -std=gnu99 -DANDROID_HARDWARE_generic -DVGA_arm=1 -DVGP_arm_linux=1 -DVGPV_arm_linux_android=1 -DVG_LIBDIR=\\\"/system/lib64/valgrind\\\" -DVG_PLATFORM=\\\"arm-linux\\\" -D__ASSEMBLY__ -MD -MF out/debug/target/product/kirin970/obj_arm/STATIC_LIBRARIES/libcoregrind-arm-linux_intermediates/coregrind/m_syswrap/syscall-arm-linux.d -o out/debug/target/product/kirin970/obj_arm/STATIC_LIBRARIES/libcoregrind-arm-linux_intermediates/coregrind/m_syswrap/syscall-arm-linux.o external/valgrind-3.14-GIT/coregrind/m_syswrap/syscall-arm-linux.S"
BR
Owen
-----邮件原件-----
发件人: Ivo Raisr [mailto:iv...@iv...]
发送时间: 2018年1月24日 12:36
收件人: Wuweijia <wuw...@hu...>
抄送: val...@li...; Fanbohao <fan...@hu...>
主题: Re: 答复: 答复: [Valgrind-users] [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it?
2018-01-24 1:36 GMT+01:00 Wuweijia <wuw...@hu...>:
> Hi:
> But I can not access the git of valgrind. Is there any way to
> get the newest source code; Maybe the network configuration do not
> allow it;
It is hard to help you without any specific error shown.
If this is indeed a network configuration, there is a git mirror of Valgrind accessible over http or https.
Have a look at http://repo.or.cz/w/valgrind.git
I.
|
|
From: Wuweijia <wuw...@hu...> - 2018-01-24 10:07:56
Attachments:
compilation.txt
|
Hi I compile the source in arm32 mode, there is some error occurred, some function lack of param. BR Owen -----邮件原件----- 发件人: Ivo Raisr [mailto:iv...@iv...] 发送时间: 2018年1月24日 12:36 收件人: Wuweijia <wuw...@hu...> 抄送: val...@li...; Fanbohao <fan...@hu...> 主题: Re: 答复: 答复: [Valgrind-users] [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it? 2018-01-24 1:36 GMT+01:00 Wuweijia <wuw...@hu...>: > Hi: > But I can not access the git of valgrind. Is there any way to > get the newest source code; Maybe the network configuration do not > allow it; It is hard to help you without any specific error shown. If this is indeed a network configuration, there is a git mirror of Valgrind accessible over http or https. Have a look at http://repo.or.cz/w/valgrind.git I. |
|
From: Wuweijia <wuw...@hu...> - 2018-01-26 03:38:12
|
Hi
About this problem how to analyze why the valgrind slow, I analyze the source code. I found the source to use atomic function __sync_fetch_and_add to finish the job.
There are four thread use __sync_fetch_and_add to sync the compute status.
The source as below:
Function1:
bool CDynamicScheduling::GetProcLoop(
int& nBegin,
int& nEndPlusOne)
{
int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep);
if (curr > m_nEnd)
{
return false;
}
nBegin = curr;
int limit = m_nEnd + 1;
nEndPlusOne = curr + m_nStep;
return true;
}
Function2:
....
int beginY, endY;
while (pDS->GetProcLoop(beginY, endY)){
for (y = beginY; y < endY; y++){
for(x = 0; x < dstWDiv2-7; x+=8){
vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]);
vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]);
vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2));
}
for(; x < dstWDiv2; x++){
pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2;
}
}
}
return;
}
Function 2 call function1 (GetProcLoop) to get start and end index, Is there something relate to problem of valgrind last too long.
Before I heard some guy who maintained the valgrind said the valgrind 3.13 modify the lock algorithm. So the lock algorithm of valgirnd 3.13 is different from valgirnd 3.12's.
Can you show me some details why you modify the lock algorithm, what improvement. I need to know whether I need to upgrade the valgrind . Upgrade is the hard work, a lot of testing.
Background, I have down load the source code , but I still compile the source code failed in arm32, I need both aarch64 and arm32. They run together.
So I want to try new way, build the valgrind 3.13 and try it.
BR
Owen
-----邮件原件-----
发件人: Ivo Raisr [mailto:iv...@iv...]
发送时间: 2018年1月24日 12:36
收件人: Wuweijia <wuw...@hu...>
抄送: val...@li...; Fanbohao <fan...@hu...>
主题: Re: 答复: 答复: [Valgrind-users] [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it?
2018-01-24 1:36 GMT+01:00 Wuweijia <wuw...@hu...>:
> Hi:
> But I can not access the git of valgrind. Is there any way to
> get the newest source code; Maybe the network configuration do not
> allow it;
It is hard to help you without any specific error shown.
If this is indeed a network configuration, there is a git mirror of Valgrind accessible over http or https.
Have a look at http://repo.or.cz/w/valgrind.git
I.
|
|
From: John R. <jr...@bi...> - 2018-01-26 04:44:19
|
On 01/25/2018 15:37 UTC, Wuweijia wrote:
> Function1:
> bool CDynamicScheduling::GetProcLoop(
> int& nBegin,
> int& nEndPlusOne)
> {
> int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep);
How large is 'm_nStep'? [Are you sure?]
The overhead expense of switching threads in valgrind would be reduced
by making m_nStep as large as possible. It looks like the code
in Function2 would produce the same values regardless.
> if (curr > m_nEnd)
> {
> return false;
> }
>
> nBegin = curr;
> int limit = m_nEnd + 1;
Local variable 'limit' is unused. By itself this is unimportant,
but it might be a clue to something that is not shown here.
> nEndPlusOne = curr + m_nStep;
> return true;
> }
>
>
> Function2:
> ....
> int beginY, endY;
> while (pDS->GetProcLoop(beginY, endY)){
> for (y = beginY; y < endY; y++){
> for(x = 0; x < dstWDiv2-7; x+=8){
> vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]);
> vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]);
I hope the actual source contains a comment such as:
Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[].
> vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2));
> }
> for(; x < dstWDiv2; x++){
> pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2;
> }
> }
> }
>
> return;
> }
|
|
From: Wuweijia <wuw...@hu...> - 2018-01-26 06:58:10
|
Hi:
How large is 'm_nStep'? [Are you sure?]
The source as below, all are the integer. Do you care what value ?.
class CDynamicScheduling
{
public:
static const int m_nDefaultStepUnit;
static const int m_nDefaultStepFactor;
private:
int m_nBegin;
int m_nEnd;
int m_nStep;
#if defined(_MSC_VER)
std::atomic<int> m_nCurrent;
#else
int m_nCurrent;
#endif
I hope the actual source contains a comment such as:
Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[].
Yes, you are right. It just compute the average of 2 * 2 blocks
I show you just the aarch64 neon code:
This is same function, but implement is x86.
UINT16 *pDstL;
UINT16 *pSrcL;
INT32 dstWDiv2 = srcW >> 1;
// INT32 dstHDiv2 = srcH >> 1;
INT32 x, y;
INT32 posDst,posSrc;
pSrcL = pSrc;
pDstL = pDst;
int beginY, endY;
while (pDS->GetProcLoop(beginY, endY))
{
// for (y = 0; y < dstHDiv2; y++)
for (y = beginY; y < endY; y++)
{
for (x = 0; x < dstWDiv2; x++)
{
posDst = y*dstStride + x;
posSrc = (y<<1)*srcStride + (x<<1);
pDstL[posDst] = (pSrcL[posSrc] + pSrcL[posSrc + 1] + pSrcL[posSrc+srcStride] + pSrcL[posSrc+srcStride + 1] + 2) >> 2;
}
}
}
pSrc is image buffer, about 11m. Width:3968 Height: 2976 srcStride: 3968
It meant four thread compute the average of 2 * 2 blocks
pSrc is divided into many small pieces , and compute the average of every piceces, not by designed, by status of the running threads, maybe some threads hold the cpu ,so they compute more pieces, Maybe some thread not hold the cpu, compute less pieces ;
BR
Owen
-----邮件原件-----
发件人: John Reiser [mailto:jr...@bi...]
发送时间: 2018年1月26日 12:44
收件人: val...@li...
主题: Re: [Valgrind-users] 答复: 答复: 答复: [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it?
On 01/25/2018 15:37 UTC, Wuweijia wrote:
> Function1:
> bool CDynamicScheduling::GetProcLoop(
> int& nBegin,
> int& nEndPlusOne)
> {
> int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep);
How large is 'm_nStep'? [Are you sure?] The overhead expense of switching threads in valgrind would be reduced by making m_nStep as large as possible. It looks like the code in Function2 would produce the same values regardless.
> if (curr > m_nEnd)
> {
> return false;
> }
>
> nBegin = curr;
> int limit = m_nEnd + 1;
Local variable 'limit' is unused. By itself this is unimportant, but it might be a clue to something that is not shown here.
> nEndPlusOne = curr + m_nStep;
> return true;
> }
>
>
> Function2:
> ....
> int beginY, endY;
> while (pDS->GetProcLoop(beginY, endY)){
> for (y = beginY; y < endY; y++){
> for(x = 0; x < dstWDiv2-7; x+=8){
> vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]);
> vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]);
I hope the actual source contains a comment such as:
Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[].
> vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2));
> }
> for(; x < dstWDiv2; x++){
> pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2;
> }
> }
> }
>
> return;
> }
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________
Valgrind-users mailing list
Val...@li...
https://lists.sourceforge.net/lists/listinfo/valgrind-users
|
|
From: Philippe W. <phi...@sk...> - 2018-01-26 20:28:28
|
It might be worth trying with --fair-sched=yes, just in case what you see
is due to the unfairness of thread scheduling.
Philippe
On Fri, 2018-01-26 at 06:57 +0000, Wuweijia wrote:
> Hi:
>
> How large is 'm_nStep'? [Are you sure?]
>
> The source as below, all are the integer. Do you care what value ?.
> class CDynamicScheduling
> {
> public:
> static const int m_nDefaultStepUnit;
> static const int m_nDefaultStepFactor;
>
> private:
> int m_nBegin;
> int m_nEnd;
> int m_nStep;
> #if defined(_MSC_VER)
> std::atomic<int> m_nCurrent;
> #else
> int m_nCurrent;
> #endif
>
>
> I hope the actual source contains a comment such as:
> Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[].
>
> Yes, you are right. It just compute the average of 2 * 2 blocks
>
> I show you just the aarch64 neon code:
> This is same function, but implement is x86.
>
> UINT16 *pDstL;
> UINT16 *pSrcL;
> INT32 dstWDiv2 = srcW >> 1;
> // INT32 dstHDiv2 = srcH >> 1;
> INT32 x, y;
> INT32 posDst,posSrc;
>
> pSrcL = pSrc;
> pDstL = pDst;
>
> int beginY, endY;
> while (pDS->GetProcLoop(beginY, endY))
> {
> // for (y = 0; y < dstHDiv2; y++)
> for (y = beginY; y < endY; y++)
> {
> for (x = 0; x < dstWDiv2; x++)
> {
> posDst = y*dstStride + x;
> posSrc = (y<<1)*srcStride + (x<<1);
> pDstL[posDst] = (pSrcL[posSrc] + pSrcL[posSrc + 1] + pSrcL[posSrc+srcStride] + pSrcL[posSrc+srcStride + 1] + 2) >> 2;
> }
> }
> }
>
> pSrc is image buffer, about 11m. Width:3968 Height: 2976 srcStride: 3968
> It meant four thread compute the average of 2 * 2 blocks
> pSrc is divided into many small pieces , and compute the average of every piceces, not by designed, by status of the running threads, maybe some threads hold the cpu ,so they compute more pieces, Maybe some thread not hold the cpu, compute less pieces ;
>
>
> BR
> Owen
>
> -----邮件原件-----
> 发件人: John Reiser [mailto:jr...@bi...]
> 发送时间: 2018年1月26日 12:44
> 收件人: val...@li...
> 主题: Re: [Valgrind-users] 答复: 答复: 答复: [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it?
>
> On 01/25/2018 15:37 UTC, Wuweijia wrote:
>
> > Function1:
> > bool CDynamicScheduling::GetProcLoop(
> > int& nBegin,
> > int& nEndPlusOne)
> > {
> > int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep);
>
> How large is 'm_nStep'? [Are you sure?] The overhead expense of switching threads in valgrind would be reduced by making m_nStep as large as possible. It looks like the code in Function2 would produce the same values regardless.
>
>
> > if (curr > m_nEnd)
> > {
> > return false;
> > }
> >
> > nBegin = curr;
> > int limit = m_nEnd + 1;
>
> Local variable 'limit' is unused. By itself this is unimportant, but it might be a clue to something that is not shown here.
>
> > nEndPlusOne = curr + m_nStep;
> > return true;
> > }
> >
> >
> > Function2:
> > ....
> > int beginY, endY;
> > while (pDS->GetProcLoop(beginY, endY)){
> > for (y = beginY; y < endY; y++){
> > for(x = 0; x < dstWDiv2-7; x+=8){
> > vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]);
> > vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]);
>
> I hope the actual source contains a comment such as:
> Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[].
>
> > vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2));
> > }
> > for(; x < dstWDiv2; x++){
> > pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2;
> > }
> > }
> > }
> >
> > return;
> > }
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________
> Valgrind-users mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-users
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Valgrind-users mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-users
|
|
From: Wuweijia <wuw...@hu...> - 2018-02-07 07:52:53
|
Hi:
There are some news about this question. The new code as below, I change from __sync_fetch_and_add to pthread_mutex_xxx
pthread_mutex_lock(&g_mutex);
int curr = m_nCurrent;
m_nCurrent += m_nStep;
pthread_mutex_unlock(&g_mutex);
Now there is no testcases with valgrind running too long, and failed.
But pthread_mutex_lock is not efficient as __sync_fetch_and_add, so the pthread_mutex_lock is just for now, for testing.
And I think there is something related to schedule module of valgrind . why it last too long?
BR
Owen
-----邮件原件-----
发件人: John Reiser [mailto:jr...@bi...]
发送时间: 2018年1月26日 12:44
收件人: val...@li...
主题: Re: [Valgrind-users] 答复: 答复: 答复: [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it?
On 01/25/2018 15:37 UTC, Wuweijia wrote:
> Function1:
> bool CDynamicScheduling::GetProcLoop(
> int& nBegin,
> int& nEndPlusOne)
> {
> int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep);
How large is 'm_nStep'? [Are you sure?] The overhead expense of switching threads in valgrind would be reduced by making m_nStep as large as possible. It looks like the code in Function2 would produce the same values regardless.
> if (curr > m_nEnd)
> {
> return false;
> }
>
> nBegin = curr;
> int limit = m_nEnd + 1;
Local variable 'limit' is unused. By itself this is unimportant, but it might be a clue to something that is not shown here.
> nEndPlusOne = curr + m_nStep;
> return true;
> }
>
>
> Function2:
> ....
> int beginY, endY;
> while (pDS->GetProcLoop(beginY, endY)){
> for (y = beginY; y < endY; y++){
> for(x = 0; x < dstWDiv2-7; x+=8){
> vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]);
> vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]);
I hope the actual source contains a comment such as:
Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[].
> vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2));
> }
> for(; x < dstWDiv2; x++){
> pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2;
> }
> }
> }
>
> return;
> }
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________
Valgrind-users mailing list
Val...@li...
https://lists.sourceforge.net/lists/listinfo/valgrind-users
|
|
From: Philippe W. <phi...@sk...> - 2018-02-07 20:49:31
|
Have you tried with --fair-sched=yes, as suggested in an earlier mail ?
What were/are the results ?
Philippe
On Wed, 2018-02-07 at 07:52 +0000, Wuweijia wrote:
> Hi:
> There are some news about this question. The new code as below, I change from __sync_fetch_and_add to pthread_mutex_xxx
>
> pthread_mutex_lock(&g_mutex);
> int curr = m_nCurrent;
> m_nCurrent += m_nStep;
> pthread_mutex_unlock(&g_mutex);
>
> Now there is no testcases with valgrind running too long, and failed.
>
> But pthread_mutex_lock is not efficient as __sync_fetch_and_add, so the pthread_mutex_lock is just for now, for testing.
>
> And I think there is something related to schedule module of valgrind . why it last too long?
>
> BR
> Owen
>
>
> -----邮件原件-----
> 发件人: John Reiser [mailto:jr...@bi...]
> 发送时间: 2018年1月26日 12:44
> 收件人: val...@li...
> 主题: Re: [Valgrind-users] 答复: 答复: 答复: [Help] Valgrind sometime run the program very slowly sometimes , it last at least one hour. can you show me why or some way to analyze it?
>
> On 01/25/2018 15:37 UTC, Wuweijia wrote:
>
> > Function1:
> > bool CDynamicScheduling::GetProcLoop(
> > int& nBegin,
> > int& nEndPlusOne)
> > {
> > int curr = __sync_fetch_and_add(&m_nCurrent, m_nStep);
>
> How large is 'm_nStep'? [Are you sure?] The overhead expense of switching threads in valgrind would be reduced by making m_nStep as large as possible. It looks like the code in Function2 would produce the same values regardless.
>
>
> > if (curr > m_nEnd)
> > {
> > return false;
> > }
> >
> > nBegin = curr;
> > int limit = m_nEnd + 1;
>
> Local variable 'limit' is unused. By itself this is unimportant, but it might be a clue to something that is not shown here.
>
> > nEndPlusOne = curr + m_nStep;
> > return true;
> > }
> >
> >
> > Function2:
> > ....
> > int beginY, endY;
> > while (pDS->GetProcLoop(beginY, endY)){
> > for (y = beginY; y < endY; y++){
> > for(x = 0; x < dstWDiv2-7; x+=8){
> > vtmp0 = vld2q_u16(&pSrc[(y<<1)*srcStride+(x<<1)]);
> > vtmp1 = vld2q_u16(&pSrc[((y<<1)+1)*srcStride+(x<<1)]);
>
> I hope the actual source contains a comment such as:
> Compute pDst[] as the rounded average of non-overlapping 2x2 blocks of pixels in pSrc[].
>
> > vst1q_u16(&pDst[y*dstStride+x], (vtmp0.val[0] + vtmp0.val[1] + vtmp1.val[0] + vtmp1.val[1] + vdupq_n_u16(2)) >> vdupq_n_u16(2));
> > }
> > for(; x < dstWDiv2; x++){
> > pDst[y*dstStride+x] = (pSrc[(y<<1)*srcStride+(x<<1)] + pSrc[(y<<1)*srcStride+(x<<1)+1] + pSrc[((y<<1)+1)*srcStride+(x<<1)] + pSrc[((y<<1)+1)*srcStride+((x<<1)+1)] + 2) >> 2;
> > }
> > }
> > }
> >
> > return;
> > }
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________
> Valgrind-users mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-users
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Valgrind-users mailing list
> Val...@li...
> https://lists.sourceforge.net/lists/listinfo/valgrind-users
|
|
From: John R. <jr...@bi...> - 2018-02-07 21:21:15
|
> Have you tried with --fair-sched=yes, as suggested in an earlier mail ?
> What were/are the results ?
When I tried it, then --fair-sched=yes gave even more lop-sided results than without.
I used 2 threads, with slice size of m_nStep = 16 rasters, and kept track of how many
slices were processed by each thread. Without --fair-sched=yes, then the division
was mostly 43 versus 50 or closer, with a few 93 versus 0. With --fair-sched=yes,
then the division was often 93 versus 0, with fewer 40 versus 53 or closer.
(Core i5-2500K CPU @ 3.30GHz; 4 cores, otherwise "idle")
I have never encountered the extremely-slow "hang" that the original post described.
This code below takes 31 seconds elapsed bare, and 36 seconds elapsed under valgrind.
The charged CPU time is 119 seconds bare, 35 seconds under valgrind.
Actual hardware contention is brutally slow.
===== build with: g++ -g -O sync-fetch-and-add.cpp -lpthread
#include <pthread.h>
int m_nCurrent;
void *
start_th(void *argstr) // top-level function for thread
{
unsigned t = 0;
for (unsigned j = 0; j < 1000*1000; ++j) {
t += __sync_fetch_and_add(&m_nCurrent, 0x1);
}
return (void *)(long)t;
}
pthread_t thread1, thread2, thread3, thread4;
int
main(int argc, char *argv[])
{
for (unsigned j=0; j < 400; ++j) {
m_nCurrent = 0; // start over each time
int rv1 = pthread_create(&thread1, NULL, start_th, 0);
int rv2 = pthread_create(&thread2, NULL, start_th, 0);
int rv3 = pthread_create(&thread3, NULL, start_th, 0);
int rv4 = pthread_create(&thread4, NULL, start_th, 0);
void *res1, *res2, *res3, *res4;
int rvE1 = pthread_join(thread1, &res1);
int rvE2 = pthread_join(thread2, &res2);
int rvE3 = pthread_join(thread3, &res3);
int rvE4 = pthread_join(thread4, &res4);
}
return 0;
}
=====
|
|
From: John R. <jr...@bi...> - 2018-02-07 21:34:29
|
> This code below takes 31 seconds elapsed bare, and 36 seconds elapsed under valgrind. > The charged CPU time is 119 seconds bare, 35 seconds under valgrind. That was valgrind-3.12.0. With valgrind-3.13.0 on Core i5-6500 CPU @ 3.20 GHz, I see 34 seconds bare (129 CPU seconds), and 25 seconds valgrind ( 25 CPU seconds). |
|
From: John R. <jr...@bi...> - 2018-02-07 23:11:40
|
>> This code below takes 31 seconds elapsed bare, and 36 seconds elapsed under valgrind. >> The charged CPU time is 119 seconds bare, 35 seconds under valgrind. > > That was valgrind-3.12.0. > With valgrind-3.13.0 on Core i5-6500 CPU @ 3.20 GHz, > I see 34 seconds bare (129 CPU seconds), > and 25 seconds valgrind ( 25 CPU seconds). With --fair-sched=yes: 3.12.0: 50 seconds elapsed, 50 seconds CPU [i5-2500. 3.3GHz] 3.13.0: 54 seconds elapsed, 54 seconds CPU [i5-6500, 3.2GHz] |