|
From: Oswald, M. <mic...@si...> - 2008-03-20 12:15:40
|
Hello,
I am using valgrind (ver. 3.3.0 on SuSE Linux Enterprise Server 9, gcc 3.3.3) on a large project which uses the POST++ persistent object library. In principle, it imports some data from files and creates a lot of (modified) STL containers of objects in a shared memory segment. The binary image of this segment is then saved and, when needed from a process, loaded and mmapped to a fixed address. The objects and containers can then be normally accessed.
When using valgrind on a process which uses POST (I added some valgrind client requests to tell valgrind about the shared memory), the program crashes when accessing a specific part of the shared memory. It doesn't do this when running the program without valgrind and most of the runs with valgrind are fine too (if they are in another range of the shared memory).
Valgrind reports something like this:
==10251== Invalid read of size 4
==10251== at 0x804EBCA: main (TESTmib.C:127)
==10251== Address 0x40103d48 is not stack'd, malloc'd or (recently) free'd
==10251==
==10251== Jump to the invalid address stated on the next line
==10251== at 0x40103D40: ???
==10251== by 0x694B20F: (below main) (in /lib/tls/libc.so.6)
==10251== Address 0x40103d40 is not stack'd, malloc'd or (recently) free'd
Program catch signal 6.
Whereas the given problematic address (0x40103d48) seems to be rather in the code segment.
After some research it turned out, that I can get the same error with gdb (running the program without valgrind) when the link order of the libraries is invalidated. This means for example, that I have to link a program with libPOST libA libB libC and so on in this explicit order which has to be the same as from the process, who generated the binary image. Only with the right link order the addresses match when the code of the C++ objects in the shared memory is executed.
Now it seems that valgrind, since it provides a slightly different memory model, runs into problems because even when the link order of the libraries is the same, the addresses of some objects may not be the same and the code of one library (say libB) then jumps into the void.
So a few questions:
- How does valgrind handle mmap calls with MAP_FIXED?
- Does valgrind respect the link order of the libraries when loading these (I would assume this)?
- Does anybody have an idea how to get valgrind to work with such a process?
lg,
Michael
|
|
From: Igmar P. <mai...@jd...> - 2008-03-20 12:51:25
|
> ==10251== Invalid read of size 4 > ==10251== at 0x804EBCA: main (TESTmib.C:127) > ==10251== Address 0x40103d48 is not stack'd, malloc'd or (recently) free'd > ==10251== > ==10251== Jump to the invalid address stated on the next line > ==10251== at 0x40103D40: ??? > ==10251== by 0x694B20F: (below main) (in /lib/tls/libc.so.6) > ==10251== Address 0x40103d40 is not stack'd, malloc'd or (recently) free'd > Program catch signal 6. A signal 6 is an SIGABRT. Are you sure you're not bouncing your head agains an internal consitancy check ? Igmar |
|
From: Oswald, M. <mic...@si...> - 2008-03-20 13:41:19
|
> ==10251== Invalid read of size 4 > ==10251== at 0x804EBCA: main (TESTmib.C:127) > ==10251== Address 0x40103d48 is not stack'd, malloc'd or (recently) free'd > ==10251== > ==10251== Jump to the invalid address stated on the next line > ==10251== at 0x40103D40: ??? > ==10251== by 0x694B20F: (below main) (in /lib/tls/libc.so.6) > ==10251== Address 0x40103d40 is not stack'd, malloc'd or (recently) free'd > Program catch signal 6. >A signal 6 is an SIGABRT. Are you sure you're not bouncing your head >agains an internal consitancy check ? Well, normally it crashes with SIGSEGV. This was really the first time where it crashed with SIGABRT. In the indicated code, there is no assertion near the crash. Maybe because of the invalid jump, the code where it jumped to was interpreted as an abort? I really don't know. lg, Michael |
|
From: Bart V. A. <bar...@gm...> - 2008-03-20 13:46:24
|
On Thu, Mar 20, 2008 at 2:40 PM, Oswald, Michael <mic...@si...> wrote: > > > ==10251== Invalid read of size 4 > > ==10251== at 0x804EBCA: main (TESTmib.C:127) > > ==10251== Address 0x40103d48 is not stack'd, malloc'd or (recently) free'd > > ==10251== > > ==10251== Jump to the invalid address stated on the next line > > ==10251== at 0x40103D40: ??? > > ==10251== by 0x694B20F: (below main) (in /lib/tls/libc.so.6) > > ==10251== Address 0x40103d40 is not stack'd, malloc'd or (recently) free'd > > Program catch signal 6. > > >A signal 6 is an SIGABRT. Are you sure you're not bouncing your head > >agains an internal consitancy check ? > > Well, normally it crashes with SIGSEGV. This was really the first time where it > crashed with SIGABRT. In the indicated code, there is no assertion near the > crash. Maybe because of the invalid jump, the code where it jumped to was > interpreted as an abort? I really don't know. Did you already try to add --trace-signals=yes to Valgrind's command line options ? This should tell you more about the cause of the crash. Bart. |
|
From: Oswald, M. <mic...@si...> - 2008-03-20 14:51:51
|
> Did you already try to add --trace-signals=yes to Valgrind's command
> line options ? This should tell you more about the cause of the crash.
Ok, this did put out this:
==21657== Invalid read of size 4
==21657== at 0x804EBCA: main (TESTmib.C:127)
==21657== Address 0x40103d48 is not stack'd, malloc'd or (recently) free'd
--21657-- signal 11 arrived ... si_code=1, EIP=0x804EBCA, eip=0x6883659C
--21657-- SIGSEGV: si_code=1 faultaddr=0x40103D48 tid=1 ESP=0xBEFFD270 seg=NULL
--21657-- delivering signal 11 (SIGSEGV):1 to thread 1
--21657-- push_signal_frame (thread 1): signal 11
==21657== at 0x804EBCA: main (TESTmib.C:127)
--21657-- Async handler got signal 6 for tid 2 info 0
--21657-- kill: sent signal 6 to pid 21657
--21657-- VG_(signal_return) (thread 1): isRT=0 valid magic; EIP=0x804EBCA
==21657==
==21657== Jump to the invalid address stated on the next line
==21657== at 0x40103D40: ???
==21657== by 0x694B20F: (below main) (in /lib/tls/libc.so.6)
==21657== Address 0x40103d40 is not stack'd, malloc'd or (recently) free'd
--21657-- translations not allowed here (0x40103d40) - throwing SEGV
--21657-- delivering signal 11 (SIGSEGV):1 to thread 1
--21657-- push_signal_frame (thread 1): signal 11
==21657== at 0x40103D40: ???
==21657== by 0x694B20F: (below main) (in /lib/tls/libc.so.6)
--21657-- Async handler got signal 6 for tid 3 info 0
--21657-- kill: sent signal 6 to pid 21657
--21657-- VG_(signal_return) (thread 1): isRT=0 valid magic; EIP=0x40103D40
--21657-- translations not allowed here (0x40103d40) - throwing SEGV
--21657-- delivering signal 11 (SIGSEGV):1 to thread 1
--21657-- push_signal_frame (thread 1): signal 11
==21657== at 0x40103D40: ???
==21657== by 0x694B20F: (below main) (in /lib/tls/libc.so.6)
--21657-- Async handler got signal 6 for tid 4 info 0
--21657-- kill: sent signal 6 to pid 21657
--21657-- VG_(signal_return) (thread 1): isRT=0 valid magic; EIP=0x40103D40
--21657-- translations not allowed here (0x40103d40) - throwing SEGV
--21657-- delivering signal 11 (SIGSEGV):1 to thread 1
--21657-- push_signal_frame (thread 1): signal 11
==21657== at 0x40103D40: ???
==21657== by 0x694B20F: (below main) (in /lib/tls/libc.so.6)
--21657-- Async handler got signal 6 for tid 5 info 0
--21657-- kill: sent signal 6 to pid 21657
--21657-- VG_(signal_return) (thread 1): isRT=0 valid magic; EIP=0x40103D40
--21657-- translations not allowed here (0x40103d40) - throwing SEGV
--21657-- delivering signal 11 (SIGSEGV):1 to thread 1
--21657-- push_signal_frame (thread 1): signal 11
==21657== at 0x40103D40: ???
==21657== by 0x694B20F: (below main) (in /lib/tls/libc.so.6)
--21657-- Async handler got signal 6 for tid 6 info 0
--21657-- kill: sent signal 6 to pid 21657
--21657-- VG_(signal_return) (thread 1): isRT=0 valid magic; EIP=0x40103D40
--21657-- translations not allowed here (0x40103d40) - throwing SEGV
--21657-- delivering signal 11 (SIGSEGV):1 to thread 1
--21657-- push_signal_frame (thread 1): signal 11
==21657== at 0x40103D40: ???
==21657== by 0x694B20F: (below main) (in /lib/tls/libc.so.6)
--21657-- kill: sent signal 6 to pid 21657
--21657-- poll_signals: got signal 6 for thread 1
--21657-- Polling found signal 6 for tid 1
--21657-- delivering signal 6 (SIGABRT):0 to thread 1
--21657-- push_signal_frame (thread 1): signal 6
==21657== at 0x695D8B6: kill (in /lib/tls/libc.so.6)
==21657== by 0x62EDA77: (within /lib/tls/libpthread.so.0)
==21657== by 0x694B20F: (below main) (in /lib/tls/libc.so.6)
--21657-- delivering signal 6 (SIGABRT):0 to thread 6
--21657-- push_signal_frame (thread 6): signal 6
==21657== at 0x69BEE66: (within /lib/tls/libc.so.6)
==21657== by 0x49F4FD9: APPLItaskManagerInterface::checkTaskManager() (APPLItaskManagerInterface.C:1043)
==21657== by 0x49F4419: APPLItaskManagerInterface::f_checkTaskManger(void*) (APPLItaskManagerInterface.C:942)
==21657== by 0x49F55A1: APPLItaskManagerInterface_f_checkTaskMangerWrapper (APPLItaskManagerInterface.C:1064)
==21657== by 0x62E7CF6: start_thread (in /lib/tls/libpthread.so.0)
==21657== by 0x69F02ED: clone (in /lib/tls/libc.so.6)
==21657== by 0xA85EBAF: ???
--21657-- delivering signal 6 (SIGABRT):0 to thread 2
--21657-- push_signal_frame (thread 2): signal 6
==21657== at 0x62E9F7C: pthread_cond_timedwait@@GLIBC_2.3.2 (in /lib/tls/libpthread.so.0)
--21657-- delivering signal 6 (SIGABRT):0 to thread 3
--21657-- push_signal_frame (thread 3): signal 6
==21657== at 0x62E9D06: pthread_cond_wait@@GLIBC_2.3.2 (in /lib/tls/libpthread.so.0)
==21657== by 0x65EFA1C: omniOrbORB::run() (in /opt/omniORB-4.1.0/lib/libomniORB4.so.1.0)
==21657== by 0x4889281: MISCcorba::startEventLoop(void (*)(void*), void*) (MISCcorba.C:1588)
==21657== by 0x488E63E: MISCcorbaLoopThread::threadMethod() (MISCcorba.C:94)
==21657== by 0x4892FE0: MISCthread::threadFunc(void*) (MISCthread.C:244)
==21657== by 0x48931A7: MISCthread_threadFuncWrapper (MISCthread.C:347)
==21657== by 0x62E7CF6: start_thread (in /lib/tls/libpthread.so.0)
==21657== by 0x69F02ED: clone (in /lib/tls/libc.so.6)
==21657== by 0x905BBAF: ???
--21657-- delivering signal 6 (SIGABRT):0 to thread 5
--21657-- push_signal_frame (thread 5): signal 6
==21657== at 0x62E9F7C: pthread_cond_timedwait@@GLIBC_2.3.2 (in /lib/tls/libpthread.so.0)
==21657== by 0x489125F: _CORBA_Sequence<unsigned char>::copybuffer(unsigned long) (seqTemplatedecls.h:296)
--21657-- delivering signal 6 (SIGABRT):0 to thread 4
--21657-- push_signal_frame (thread 4): signal 6
==21657== at 0x69E6E44: poll (in /lib/tls/libc.so.6)
==21657== by 0x667E5ED: omni::SocketCollection::Select() (in /opt/omniORB-4.1.0/lib/libomniORB4.so.1.0)
==21657== by 0x66A5002: omni::tcpEndpoint::AcceptAndMonitor(void (*)(void*, omni::giopConnection*), void*) (in /opt/omniORB-4.1.0/lib/libomniORB4.so.1.0)
==21657== by 0x666451F: omni::giopRendezvouser::execute() (in /opt/omniORB-4.1.0/lib/libomniORB4.so.1.0)
==21657== by 0x660AB7F: omniAsyncWorker::real_run() (in /opt/omniORB-4.1.0/lib/libomniORB4.so.1.0)
==21657== by 0x660A4CC: omniAsyncWorkerInfo::run() (in /opt/omniORB-4.1.0/lib/libomniORB4.so.1.0)
==21657== by 0x660ADE8: omniAsyncWorker::run(void*) (in /opt/omniORB-4.1.0/lib/libomniORB4.so.1.0)
==21657== by 0x6578982: omni_thread_wrapper (in /opt/omniORB-4.1.0/lib/libomnithread.so.3.3)
==21657== by 0x62E7CF6: start_thread (in /lib/tls/libpthread.so.0)
==21657== by 0x69F02ED: clone (in /lib/tls/libc.so.6)
Program catch signal 6.
Hm, so it initially throws a signal 11, but the Async handler gets a signal 6? Or does this mean, on signal 11 it sends signal 6 to abort the other threads?
Somehow confusing...
lg,
Michael
|
|
From: Bart V. A. <bar...@gm...> - 2008-03-20 13:04:58
|
On Thu, Mar 20, 2008 at 1:11 PM, Oswald, Michael <mic...@si...> wrote: > - Does anybody have an idea how to get valgrind to work with such a process? Hello Michael, If I understood your e-mail correcty, the memory of the process that you are analyzing with Valgrind has been initialized by another process ? In that case you will have to include the file valgrind.h in your program and declare the shared memory segment as initialized (see also http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs). The InfiniBand people are also doing this (InfiniBand is a networking technology that allows one process to write in the memory of a process on another server). Bart. |
|
From: Oswald, M. <mic...@si...> - 2008-03-20 13:33:27
|
>Hello Michael, >If I understood your e-mail correcty, the memory of the process that >you are analyzing with Valgrind has been initialized by another >process ? Yes. First an importer process is run, which generates the objects in the shared memory. The content of the shared memory is then stored into a file. Later, when the system is started and one of it's processes needs the object information, it loads the file, mmaps it and then uses the objects like normal C++ code. >In that case you will have to include the file valgrind.h in your >program and declare the shared memory segment as initialized (see also >http://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs). >The InfiniBand people are also doing this (InfiniBand is a networking >technology that allows one process to write in the memory of a process >on another server). Yes, I did this. After the mmap call I declared the whole block with VALGRIND_MALLOCLIKE_BLOCK. And it works for some of the objects (e.g. I can access the TM objects whereas the crash appears on accessing the TC objects). Unfortunately, the requirements for using the persistent object store are the fixed addresses, so if for some reason the loading of the shared libraries is in a different order, they get allocated to a different address and the code doesn't work anymore. This is really annoying (and a really outdated behaviour for a system) but I have to live with it. lg, Michael |
|
From: Christoph B. <bar...@or...> - 2008-03-20 14:06:01
|
Am Donnerstag, 20. März 2008 schrieb Oswald, Michael: > Yes, I did this. After the mmap call I declared the whole block with > VALGRIND_MALLOCLIKE_BLOCK. And it works for some of the objects (e.g. I can > access the TM objects whereas the crash appears on accessing the TC > objects). Unfortunately, the requirements for using the persistent object > store are the fixed addresses, so if for some reason the loading of the > shared libraries is in a different order, they get allocated to a different > address and the code doesn't work anymore. This is really annoying (and a > really outdated behaviour for a system) but I have to live with it. Do you check that the mmap succeeds to load the objects at the desired address? I would suspect that this is not the case. Why is not proper serializing of the containers used? I would bet that is is still fast enough but safe to use. Christoph |
|
From: Oswald, M. <mic...@si...> - 2008-03-20 14:30:16
|
> Do you check that the mmap succeeds to load the objects at the desired
> address? I would suspect that this is not the case.
Yup, is checked. The testprogram, which generated the valgrind output I posted simply loops through the containers and dumps out all classes. It runs ok for the first half (the TM classes), but fails for the TC classes. I did put some logging output into some of the classes to print out the adresses of *this and some members. With this logging it runs even over the first few TC classes without problems and then crashes. Somehow strange.
The code, where valgrind points to is like this:
....
const CMDpktSlotDef* sd = (*(m_def->getPkt()->getPktParamSlots()))[i];
if(!sd) continue;
unsigned short length = sd->getLength(); <--- crash appears here
....
The CMDpktSlotDef lies in the shared memory. The getLength() is like this:
virtual const unsigned short getLength() const
{return (m_pktParam) ? m_pktParam->getLength() : 0;};
When I print the pointers, they are all valid (with addresses in the shared memory), still it crashes with the 0x40xxxxxx address, which seems to be in the code segment (the shared mem is mapped to 0x71000000).
The only hint, that I have about what's going on, is the manual which told me about the link order of the libraries and that I can reproduce the error when I change the link order for the testprogram even without valgrind.
With an earlier version of the system, it was possible to use purify on Solaris. But you had to purify the importer, import the data into the shared mem and then purify the process to debug which then uses the (purified) memory image. Unfortunately, this doesn't even work with purify under Linux, so valgrind is some kind of last resort.
> Why is not proper serializing of the containers used? I would bet that is is
> still fast enough but safe to use.
I am completely with you on this and I would be the first volunteer to change it but this is not in my hand. This system has evolved from 1996 onwards and went through numerous changes but sadly not in that range.
lg,
Michael
|
|
From: Christoph B. <bar...@or...> - 2008-03-20 14:48:11
|
Am Donnerstag, 20. März 2008 schrieb Oswald, Michael:
> The code, where valgrind points to is like this:
>
> ....
>
> const CMDpktSlotDef* sd =
> (*(m_def->getPkt()->getPktParamSlots()))[i]; if(!sd) continue;
> unsigned short length = sd->getLength(); <--- crash appears
> here
>
> ....
>
> The CMDpktSlotDef lies in the shared memory. The getLength() is like this:
>
> virtual const unsigned short getLength() const
> {return (m_pktParam) ? m_pktParam->getLength() : 0;};
>
> When I print the pointers, they are all valid (with addresses in the shared
> memory), still it crashes with the 0x40xxxxxx address, which seems to be in
> the code segment (the shared mem is mapped to 0x71000000).
>
Is m_pktParam a variable of a class with a virtual table? If yes, is the
virtual table loaded correctly?
Do you have a small testprogramm to look at the problem?
Christoph
|
|
From: Oswald, M. <mic...@si...> - 2008-03-20 15:23:50
|
> Is m_pktParam a variable of a class with a virtual table? If yes, is the > virtual table loaded correctly? Yes it is (CMDpktParDef). It inherits from a class "object" which has only compiler generated default constructor/destructor. It's only virtual function is the virtual destructor. No other classes inherit from it. Funny. I did a small test: removed the virtual from the destructor, so the class shouldn't be virtual anymore. The result was exactly the same. But CMDpktSlotDef has some virtual functions and the getLength() is one of them, so I would say we are going into the right direction. I can't tell, if the virtual table is loaded correctly, it should be part of the shared lib where CMDpktSlotDef is compiled in, right? So in principle it should be possible to get the vptr from an instance and dump the vtable somehow. Hm, maybe I should look, if gcc provides something which supports this. > Do you have a small testprogramm to look at the problem? I use this small program, which simply loads the shared mem, loops over it and dumps the classes on cout. lg, Michael |
|
From: Christoph B. <bar...@or...> - 2008-03-20 15:33:32
|
Am Donnerstag, 20. März 2008 schrieb Oswald, Michael: > > Is m_pktParam a variable of a class with a virtual table? If yes, is the > > virtual table loaded correctly? > > Yes it is (CMDpktParDef). It inherits from a class "object" which has only > compiler generated default constructor/destructor. It's only virtual > function is the virtual destructor. No other classes inherit from it. > Funny. > > I did a small test: removed the virtual from the destructor, so the class > shouldn't be virtual anymore. The result was exactly the same. > > But CMDpktSlotDef has some virtual functions and the getLength() is one of > them, so I would say we are going into the right direction. > > I can't tell, if the virtual table is loaded correctly, it should be part > of the shared lib where CMDpktSlotDef is compiled in, right? > > So in principle it should be possible to get the vptr from an instance and > dump the vtable somehow. Hm, maybe I should look, if gcc provides something > which supports this. You can look at the vptr in gdb. Just create an object of type CMDpktSlotDef and CMDpktParDef in your main and check that the vptr point to the same address in both cases. > > Do you have a small testprogramm to look at the problem? > > I use this small program, which simply loads the shared mem, loops over it > and dumps the classes on cout. You do not want to provide it? |
|
From: Oswald, M. <mic...@si...> - 2008-03-20 16:34:24
|
> You can look at the vptr in gdb. Just create an object of type CMDpktSlotDef > and CMDpktParDef in your main and check that the vptr point to the same > address in both cases. Ok, I think we have a hit: Local object in the beginning of main: _vptr.CMDpktSlotDef = 0x401038e0 _vptr.CMDpktParDef = 0x401038c0 Objects in shared mem: _vptr.CMDpktSlotDef = 0x40103d40 _vptr.CMDpktParDef = 0x401038c0 So the CMDpktParDef are identical, but the CMDpktSlotDef not. Strange thing anyway is, that the program runs through normally. Maybe it doesn't access critical virtual functions... So I'll have to investigate, where this mismatch of addresses comes from... > > I use this small program, which simply loads the shared mem, loops over it > > and dumps the classes on cout. > You do not want to provide it? It's simply not possible. Though it is a small program, it pulls in a lot of libraries which it depends on. And you would need the importer pgrogram too and it's libraries, together with the data files to import and the configuration of a lot variables which is needed by most libraries (last time it took me a week to correctly configure it for the first startup). So sadly, it is not feasible. Thanks very much to all and a happy easter! lg, Michael |
|
From: Julian S. <js...@ac...> - 2008-03-20 16:41:12
|
I would say first that in my view using MAP_FIXED for anything is a bad idea. It silently replaces or truncates any existing mapping which overlaps the requested range, but there is no easy way to know beforehand if this will happened. The only way to use it safely is to have some way to know what the process' address space layout is, like reading /proc/self/maps, or in some very specialised situations, as ld.so does. I worked for a while on a compiler runtime (http://haskell.org/ghc) that used MAP_FIXED to place the heap at certain locations. This caused enough portability and reliability problems that we eventually stopped using it. > So a few questions: > - How does valgrind handle mmap calls with MAP_FIXED? It respects MAP_FIXED if it can, but will reject calls which could overwrite Valgrind's code or data mappings. So it will likely fail vs succeed differently on Valgrind than natively. Note that Valgrind changes the process' address space layout a lot compared to natively, and so assumptions about what-is-where or what areas are free that might appear to work natively may not work when running in Valgrind. > - Does valgrind respect the link order of the libraries when loading these > (I would assume this)? Yes. But, uh, requiring the libraries to load in a particular order seems to me to be a sign of fragileness. > - Does anybody have an idea how to get valgrind to work with such a > process? Best thing is to send a small test case which shows the problem. I read through the rest of the thread but can't see from that enough info to say anything much else. How does POST deal with address space randomization that modern kernels commonly do? Even when not using Valgrind, wouldn't address space randomization cause it problems? J |
|
From: Oswald, M. <mic...@si...> - 2008-03-21 18:14:06
|
> I would say first that in my view using MAP_FIXED for anything is a > bad idea. It silently replaces or truncates any existing mapping which > overlaps the requested range, but there is no easy way to know beforehand > if this will happened. The only way to use it safely is to have some > way to know what the process' address space layout is, like reading > /proc/self/maps, or in some very specialised situations, as ld.so does. I totally agree with that. The system I am working on was developed by many companies and we proposed a few times to drop POST and use something different, more portable and safe, but the proposal was never accepted. The current approach is, that there is some small test-program, which uses some kind of heuristic to determine an address, which is then fixed with an environment variable. Still a rather lousy approach. > It respects MAP_FIXED if it can, but will reject calls which could > overwrite Valgrind's code or data mappings. So it will likely fail > vs succeed differently on Valgrind than natively. Note that Valgrind > changes the process' address space layout a lot compared to natively, and > so assumptions about what-is-where or what areas are free that might > appear to work natively may not work when running in Valgrind. That's what I was afraid of... > Yes. But, uh, requiring the libraries to load in a particular order > seems to me to be a sign of fragileness. Yes, of course. I think you can imagine, that you run into very funny crashes, if you recompile some of the libraries and forget to import into POST and try to run the system afterwards... Or you add some new library and did forget about the link order... Some people already spent days debugging crashes which were caused on this... > Best thing is to send a small test case which shows the problem. I > read through the rest of the thread but can't see from that enough > info to say anything much else. I don't know, if I am able to strip down the code to something like that. I will try. POST itself is free (http://www.ispras.ru/~knizhnik/post.html). > How does POST deal with address space randomization that modern > kernels commonly do? Even when not using Valgrind, wouldn't address > space randomization cause it problems? Yes. Normally this doesn't represent problems, since the system is only supported for older kernels. I myself did a port to Suse Linux Enterprise Server 10 where I ran exactly into this problem. The solution was quite simple, we added the disable_rand_maps kernel parameter at startup which disables this feature. thanks, Michael |
|
From: Nicholas N. <nj...@cs...> - 2008-03-22 00:32:44
|
On Fri, 21 Mar 2008, Oswald, Michael wrote: >> I would say first that in my view using MAP_FIXED for anything is a >> bad idea. It silently replaces or truncates any existing mapping which >> overlaps the requested range, but there is no easy way to know beforehand >> if this will happened. The only way to use it safely is to have some >> way to know what the process' address space layout is, like reading >> /proc/self/maps, or in some very specialised situations, as ld.so does. > > I totally agree with that. The system I am working on was developed by > many companies and we proposed a few times to drop POST and use something > different, more portable and safe, but the proposal was never accepted. So you run the program natively, write out a data structure from memory to a file, and then try to read it in from a program running under Valgrind? And it doesn't work because the reading-in expects the data structure to be at exactly the same address as when it was written out? Assuming that's right, I don't see how it's ever going to work -- Valgrind provides an environment that is similar to native, but not identical, and any program that relies so much on things such as memory layout is a hopeless case for Valgrind, IMO. Nick |
|
From: Julian S. <js...@ac...> - 2008-03-22 00:54:03
|
On Saturday 22 March 2008 01:32, Nicholas Nethercote wrote: > On Fri, 21 Mar 2008, Oswald, Michael wrote: > >> I would say first that in my view using MAP_FIXED for anything is a > >> bad idea. It silently replaces or truncates any existing mapping which > >> overlaps the requested range, but there is no easy way to know > >> beforehand if this will happened. The only way to use it safely is to > >> have some way to know what the process' address space layout is, like > >> reading /proc/self/maps, or in some very specialised situations, as > >> ld.so does. > > > > I totally agree with that. The system I am working on was developed by > > many companies and we proposed a few times to drop POST and use something > > different, more portable and safe, but the proposal was never accepted. > > So you run the program natively, write out a data structure from memory to > a file, and then try to read it in from a program running under Valgrind? > And it doesn't work because the reading-in expects the data structure to be > at exactly the same address as when it was written out? > > Assuming that's right, I don't see how it's ever going to work -- Valgrind > provides an environment that is similar to native, but not identical, and > any program that relies so much on things such as memory layout is a > hopeless case for Valgrind, IMO. I agree; and so (in contradition to my previous comments) I don't think there's much point in you making a test case. (sorry) GNU emacs has some similar kind of weirdness, resulting in it being one of the few programs that won't run on Valgrind (at least with a default build of emacs). It complains that it has run out of memory as soon as it starts up, and quits. J |
|
From: Oswald, M. <mic...@si...> - 2008-03-25 09:56:19
|
> > Assuming that's right, I don't see how it's ever going to work -- Valgrind > > provides an environment that is similar to native, but not identical, and > > any program that relies so much on things such as memory layout is a > > hopeless case for Valgrind, IMO. > I agree; and so (in contradition to my previous comments) I don't think > there's much point in you making a test case. (sorry) > GNU emacs has some similar kind of weirdness, resulting in it being > one of the few programs that won't run on Valgrind (at least with > a default build of emacs). It complains that it has run out of > memory as soon as it starts up, and quits. Ok, I was afraid of that. So I'll keep banging my head against the keyboard and hope that the responsible people will allow us to replace this trash somwhere in the near future. Many thanks to all, anyway! Michael |