From: Durk T. <d.t...@xs...> - 2009-05-15 17:21:04
|
Hi All, Thanks to a report by regular Forum poster MD-TERP (a.k.a. Rob), we found a way to trigger the infamous NaN warning in a rather reliable manner. Following up on this lead, I modified OpenSceneGraph so that it deliberately segfaults when the warning message is triggered (I know, I could have set a break point there in GDB, but I was lazy). Although I don't have a fix yet, I've collected quite a bit of information about the problem, and have a pretty good insight as to what's going on. Rob suggested trying a Flight from KMTN to KSKY, using the Citation-Bravo, cruising at FL240, enabling AI Traffic, Traffic Manager, ATC, and Multiplayer. The com1 radio was tuned to KMTN tower (121.3), but this is probably not relevant. The actual aircraft also probably doesn't matter. Using this setup, a NaN error is triggered rather reliably while crossing the Pittsburgh area. OpenSceneGraph was modified to include a deliberately malicious piece of code designed to trigger a segmentation fault upon printing the error message. Two stack traces were obtained, and a third one was examined in more detail using GDB, and saved as a core file. Stack Trace #1 showed that the warning was triggered inside the AI Traffic subsystem, in particular in a call to globals->get_scenery()->get_elevation_m() Stack Trace #2 showed that the warning was triggered in src/Environment/ridge_lift.cxx again in a call to get_elevation_m() So, in conclusion, it looks like the get_elevation_m() is the culprit. However, it also seemed that the NaN warning was only triggered when AI Traffic was activated, which seemed to be at odds with it being triggered by ridge_list as well. At this point, Mathias Froelich pointed out to me that 3D model objects included in the scene graph are involved in the ground elevation scanning process. Mathias also suggested to me that it should be possible to determine the offending model, by printing out the this->_name variable at various levels of the stack. Traversing the stack, I found that at least one possible source of the problem can be found in Aircraft/c172p/Models/c172p.xml, more particularly so in the included model Aircraft/Instruments-3d/mag-compass.xml, in particular in the the object with the name "Interior". I'm not a 3d modeling expert enough to determine what could possible wrong in this object. Mathias suggested that a triangle with a zero sized surface or something like that could blow the math. If anybody with more expertise than me could have a look at it, please be my guest. The fact that an included component of the c172 causes the math of the ground elevation code to blow perfectly explains why a host of subsystems (at least in AI Traffic itself, and in ridge_lift) are vulnerable to this error, while disabling AI Traffic does not affect these other system. By placing a c172 model in the scene, AI Traffic creates a vulnerability, not only for itself, but also for other systems. Although I haven't seen an error triggered by traffic generated by the traffic manager (which is, I stress once again, an entirely separate system), this is probably coincidental, and due to the fact that we currently don't have much Traffic Manager related activity in the test area, and therefore the probability of seeing an error generated by the traffic manager is, in this particular case, low.. Had traffic schedules involving the c172 model existed, the traffic manager would likely have caused a similar error/vulnerability itself. So, to conclude, the mag-compass.ac instrument should be checked for possible modeling errors.. Also, I would suggest that a meaningful message could be printed by the ground elevation code in case of an error, so that future errors could be avoided. This probably can't be done in FlightGear, because it would require some modification of the OSG code. Until a full fix is in place, a special purpose AI c172 model could be created, which doesn't contain the interior. Users not regularly flying the c172 could remove the 3d compass, so that it becomes usable again as an AI aircraft. Obviously, I can't rule out that there aren't more possible causes for the NaN warning, however. given that it's occurrence seems to be strongly related to activating AI Traffic, I suspect they are caused by a relatively narrow set of circumstances. Cheers, Durk |
From: Curtis O. <cur...@gm...> - 2009-05-15 17:58:20
|
Here is a quick thought (not having thought this all the way through.) Originally we only queried the altitude of a single point beneath out aircraft. As we've move forward, we now have created a cache of local triangles and can query the altitude of each wheel and contact point. But also we have added nasal and C++ interfaces to query the altitude of any arbitrary point. I wonder though, if there is no scenery tile loaded for the requesting query location, what happens? Are these tiles somehow scheduled for loading? But the cache size is fixed so if we have too many far ranging altitude queries, could we be running into a situation where the requested tiles are flushed to make room for something else? I did a long haul flight recently from Boston to NY to MSP in the alphajet (current CVS version) and not far out of Boston started dropping tiles right and left ... it was a mess. I got to the point where I had no visible tiles loaded, just flying over empty space as far as the eye could see. Then later I tried another long cross country to verify the problem, and I didn't see one dropped tile. So there is some random goofiness somewhere in our tile caching scheme, and I don't personally have a good understanding of how these arbitrary position queries play with our tile caching/scheduling/loading scheme ... I suspect there could be some contention there. Best regards, Curt. On Fri, May 15, 2009 at 12:21 PM, Durk Talsma wrote: > Hi All, > > Thanks to a report by regular Forum poster MD-TERP (a.k.a. Rob), we found a > way to trigger the infamous NaN warning in a rather reliable manner. > Following up on this lead, I modified OpenSceneGraph so that it deliberately > segfaults when the warning message is triggered (I know, I could have set a > break point there in GDB, but I was lazy). > > Although I don't have a fix yet, I've collected quite a bit of > information about the problem, and have a pretty good insight as to what's > going on. > > Rob suggested trying a Flight from KMTN to KSKY, using the Citation-Bravo, > cruising at FL240, enabling AI Traffic, Traffic Manager, ATC, and > Multiplayer. The com1 radio was tuned to KMTN tower (121.3), but this is > probably not relevant. The actual aircraft also probably doesn't matter. > Using this setup, a NaN error is triggered rather reliably while crossing > the Pittsburgh area. OpenSceneGraph was modified to include a deliberately > malicious piece of code designed to trigger a segmentation fault upon > printing the error message. > > Two stack traces were obtained, and a third one was examined in more detail > using GDB, and saved as a core file. > > Stack Trace #1 showed that the warning was triggered inside the AI Traffic > subsystem, in particular in a call to > > globals->get_scenery()->get_elevation_m() > > Stack Trace #2 showed that the warning was triggered in > src/Environment/ridge_lift.cxx again in a call to get_elevation_m() > > So, in conclusion, it looks like the get_elevation_m() is the culprit. > However, it also seemed that the NaN warning was only triggered when AI > Traffic was activated, which seemed to be at odds with it being triggered by > ridge_list as well. > > At this point, Mathias Froelich pointed out to me that 3D model objects > included in the scene graph are involved in the ground elevation scanning > process. Mathias also suggested to me that it should be possible to > determine the offending model, by printing out the this->_name variable at > various levels of the stack. Traversing the stack, I found that at least one > possible source of the problem can be found in > > Aircraft/c172p/Models/c172p.xml, > > more particularly so in the included model > > Aircraft/Instruments-3d/mag-compass.xml, > > in particular in the the object with the name "Interior". I'm not a 3d > modeling expert enough to determine what could possible wrong in this > object. Mathias suggested that a triangle with a zero sized surface or > something like that could blow the math. If anybody with more expertise than > me could have a look at it, please be my guest. > > The fact that an included component of the c172 causes the math of the > ground elevation code to blow perfectly explains why a host of subsystems > (at least in AI Traffic itself, and in ridge_lift) are vulnerable to this > error, while disabling AI Traffic does not affect these other system. By > placing a c172 model in the scene, AI Traffic creates a vulnerability, not > only for itself, but also for other systems. Although I haven't seen an > error triggered by traffic generated by the traffic manager (which is, I > stress once again, an entirely separate system), this is probably > coincidental, and due to the fact that we currently don't have much Traffic > Manager related activity in the test area, and therefore the probability of > seeing an error generated by the traffic manager is, in this particular > case, low.. Had traffic schedules involving the c172 model existed, the > traffic manager would likely have caused a similar error/vulnerability > itself. > > So, to conclude, the mag-compass.ac instrument should be checked for > possible modeling errors.. Also, I would suggest that a meaningful message > could be printed by the ground elevation code in case of an error, so that > future errors could be avoided. This probably can't be done in FlightGear, > because it would require some modification of the OSG code. > > Until a full fix is in place, a special purpose AI c172 model could be > created, which doesn't contain the interior. Users not regularly flying the > c172 could remove the 3d compass, so that it becomes usable again as an AI > aircraft. > > Obviously, I can't rule out that there aren't more possible causes for the > NaN warning, however. given that it's occurrence seems to be strongly > related to activating AI Traffic, I suspect they are caused by a relatively > narrow set of circumstances. > > Cheers, > > Durk > > > ------------------------------------------------------------------------------ > Crystal Reports - New Free Runtime and 30 Day Trial > Check out the new simplified licensing option that enables > unlimited royalty-free distribution of the report engine > for externally facing server and web deployment. > http://p.sf.net/sfu/businessobjects > _______________________________________________ > Flightgear-devel mailing list > Fli...@li... > https://lists.sourceforge.net/lists/listinfo/flightgear-devel > > -- Curtis Olson: http://baron.flightgear.org/~curt/ |
From: Heiko S. <aei...@ya...> - 2009-05-15 18:10:05
|
Hi, interesting to see that a small instruments can make such big troubles. What me wonders: Aircraft/Instruments-3d/mag-compass.xml isn't only used by the c172p than by other aircrafts as well. I think the pa24-250 is another user of this. Does this aircrafts also cause this NaN-errors? I just check the interior object, but it is impossible for me now to see where the mistake is. Any idea hwo to check? Regards HHS ...Traversing the stack, I found that at least one possible source of the problem can be found in Aircraft/c172p/Models/c172p.xml, more particularly so in the included model Aircraft/Instruments-3d/mag-compass.xml, in particular in the the object with the name "Interior". I'm not a 3d modeling expert enough to determine what could possible wrong in this object. Mathias suggested that a triangle with a zero sized surface or something like that could blow the math. If anybody with more expertise than me could have a look at it, please be my guest. .... Until a full fix is in place, a special purpose AI c172 model could be created, which doesn't contain the interior. Users not regularly flying the c172 could remove the 3d compass, so that it becomes usable again as an AI aircraft. Obviously, I can't rule out that there aren't more possible causes for the NaN warning, however. given that it's occurrence seems to be strongly related to activating AI Traffic, I suspect they are caused by a relatively narrow set of circumstances. Cheers, Durk |
From: Durk T. <d.t...@xs...> - 2009-05-15 18:30:40
|
Hi, On Friday 15 May 2009 20:10:01 Heiko Schulz wrote: > Hi, > > interesting to see that a small instruments can make such big troubles. > > What me wonders: Aircraft/Instruments-3d/mag-compass.xml isn't only used by > the c172p than by other aircrafts as well. I think the pa24-250 is another > user of this. Does this aircrafts also cause this NaN-errors? I'm not sure about this, but my estimate is that the trouble doesnt arise when the mag-compass is part of the user aircraft, but only when it's part of the exterior world, i.e. when part of an AI aircraft. Also, it's possible that the instrument by itself may be okay, but triggers an error in interaction with other scene elements. Cheers, Durk |
From: Heiko S. <aei...@ya...> - 2009-05-15 18:54:27
|
Seems my first answer failed with the c172p.ac attached.. I removed some double vertices on the interior-object, and I hope this was the cause for the trouble. You will find the improved here: http://gitorious.org/c172p command: git clone git://gitorious.org/c172p/mainline.git Regards HHS |
From: Torsten D. <To...@t3...> - 2009-05-15 19:08:13
Attachments:
mag-compass.ac.bz2
|
> I'm not sure about this, but my estimate is that the trouble doesnt arise > when the mag-compass is part of the user aircraft, but only when it's part > of the exterior world, i.e. when part of an AI aircraft. Also, it's > possible that the instrument by itself may be okay, but triggers an error > in interaction with other scene elements. The model looks good at first glance. The Interior object has several single sided 4 vertex, non-planar surfaces, all facing inwards. That is perfectly legal. Just in case, any of these "anomalies" causes trouble, I have attached a modified version of the mag-compass.ac with all "Interior" surfaces two-sided and triangulated. Maybe you want to give it a try. Torsten |
From: Durk T. <d.t...@xs...> - 2009-05-16 16:42:33
|
On Friday 15 May 2009 20:31:17 Durk Talsma wrote: > I'm not sure about this, but my estimate is that the trouble doesnt arise > when the mag-compass is part of the user aircraft, but only when it's part > of the exterior world, i.e. when part of an AI aircraft. Also, it's > possible that the instrument by itself may be okay, but triggers an error > in interaction with other scene elements. > I've been checking a little more today, and it looks like there's more than meets the eye. In retrospect, the cessna / mag-compass models themselves are probably okay. It's probably only that the mag compass showed up because it was the first object to be subjected to a test containing bad preconditions already. I'm still not fully understanding the finer details of the ground cache, but in essence, if works by trying to shoot a line through every triangle in the scene graph. Then it returns the distance of the closest intersecting point. To make the process efficient, the scenegraph is traversed node by node, and triangles obviously not relevant are quickly discarded, quite similar to the way the culling algorithm works. Since the scene graph is composed of models that may each have their local coordinate systems, it is necessary that the line that is shot through the scene graph is also transformed accordingly, so for each level of the scenegrahp, a new transformation matrix is created, using a popMatrix function. While trying to trap bad data in this the popMatrix function, I just noticed that a bad transformation matrix is already set up relatively early in the process, only a few levels deep at the stack. I haven't been able to relate this to any meaningful object yet. (All that came up was the name "Scene"). So, it looks like a transformation error early on blows up the intersect line vector(s) already. and scenegraph is traversed further down, OSG keeps happily multiplying already corrupted data with valid transformation data further down the line, restuling in an intersect line, composed of NaNs. This goes unnoticed, until the error is finally picked up at the first possible occasion where there's a nan error check. That is, in trialintersect. I hope to continue this investigation later, and hope to be able to traverse the bad data to their true source. In the mean time, I'm sorry to all the model developers for prematurely raising a red flag. I still find it rather curious that the error only seems to occur when AI traffic is activated, which still seems of indicate a critical role for the c172p model. However, until further notice, it doesn't look like the problem is as straightforward as I thought yesterday. Cheers, Durk |
From: Tim M. <ti...@re...> - 2009-05-16 17:59:26
|
Durk Talsma wrote: > On Friday 15 May 2009 20:31:17 Durk Talsma wrote: > >> I'm not sure about this, but my estimate is that the trouble doesnt arise > >> when the mag-compass is part of the user aircraft, but only when it's part > >> of the exterior world, i.e. when part of an AI aircraft. Also, it's > >> possible that the instrument by itself may be okay, but triggers an error > >> in interaction with other scene elements. > >> > > I've been checking a little more today, and it looks like there's more > than meets the eye. In retrospect, the cessna / mag-compass models > themselves are probably okay. It's probably only that the mag compass > showed up because it was the first object to be subjected to a test > containing bad preconditions already. > > I'm still not fully understanding the finer details of the ground cache, > but in essence, if works by trying to shoot a line through every > triangle in the scene graph. Then it returns the distance of the closest > intersecting point. To make the process efficient, the scenegraph is > traversed node by node, and triangles obviously not relevant are quickly > discarded, quite similar to the way the culling algorithm works. > > Since the scene graph is composed of models that may each have their > local coordinate systems, it is necessary that the line that is shot > through the scene graph is also transformed accordingly, so for each > level of the scenegrahp, a new transformation matrix is created, using a > popMatrix function. > > While trying to trap bad data in this the popMatrix function, I just > noticed that a bad transformation matrix is already set up relatively > early in the process, only a few levels deep at the stack. I haven't > been able to relate this to any meaningful object yet. (All that came up > was the name "Scene"). > > So, it looks like a transformation error early on blows up the intersect > line vector(s) already. and scenegraph is traversed further down, OSG > keeps happily multiplying already corrupted data with valid > transformation data further down the line, restuling in an intersect > line, composed of NaNs. This goes unnoticed, until the error is finally > picked up at the first possible occasion where there's a nan error > check. That is, in trialintersect. > > I hope to continue this investigation later, and hope to be able to > traverse the bad data to their true source. It may be helpful to dump the scene graph to a file (from the debug menu) once you're getting the NaN error. Hopefully the offending matrix will be printed with NaNs instead of valid coordinates. Tim |
From: Martin S. <Mar...@mg...> - 2009-05-15 19:08:51
|
Curtis Olson wrote: > I did a long haul flight recently from Boston to NY to MSP in the alphajet > (current CVS version) and not far out of Boston started dropping tiles right > and left ... it was a mess. I got to the point where I had no visible tiles > loaded, just flying over empty space as far as the eye could see. Mmmmh, this sounds like fetching tiles via TerraSync over a slow network connection. Are you certain that you've been loading the tiles from a local filesystem ? Cheers, Martin. -- Unix _IS_ user friendly - it's just selective about who its friends are ! -------------------------------------------------------------------------- |
From: dave p. <ski...@mi...> - 2009-05-16 15:19:10
|
Heiko Schulz wrote: > > I think the pa24-250 is another user of this. Does this aircrafts > also cause this NaN-errors? > Hi Heiko, The pa24-250 uses Aircraft/Instruments-3d/comp/comp.xml. Regards, Dave P. |
From: Heiko S. <aei...@ya...> - 2009-05-16 16:00:22
|
Hi, o.k. I coulden't remember and didn't take a look. still in work: http://www.hoerbird.net/galerie.html But already done: http://www.hoerbird.net/reisen.html ----- Ursprüngliche Mail ---- Von: dave perry <ski...@mi...> An: FlightGear developers discussions <fli...@li...> Gesendet: Samstag, den 16. Mai 2009, 17:10:23 Uhr Betreff: Re: [Flightgear-devel] Progress report on the infamous "error in TriangleIntersect" NAN Problem Heiko Schulz wrote: > > I think the pa24-250 is another user of this. Does this aircrafts > also cause this NaN-errors? > Hi Heiko, The pa24-250 uses Aircraft/Instruments-3d/comp/comp.xml. Regards, Dave P. ------------------------------------------------------------------------------ Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects _______________________________________________ Flightgear-devel mailing list Fli...@li... https://lists.sourceforge.net/lists/listinfo/flightgear-devel |
From: Tim M. <ti...@re...> - 2009-05-20 08:06:22
|
Tim Moore wrote: > Durk Talsma wrote: >> On Friday 15 May 2009 20:31:17 Durk Talsma wrote: >> >> While trying to trap bad data in this the popMatrix function, I just >> noticed that a bad transformation matrix is already set up relatively >> early in the process, only a few levels deep at the stack. I haven't >> been able to relate this to any meaningful object yet. (All that came up >> was the name "Scene"). >> >> So, it looks like a transformation error early on blows up the intersect >> line vector(s) already. and scenegraph is traversed further down, OSG >> keeps happily multiplying already corrupted data with valid >> transformation data further down the line, restuling in an intersect >> line, composed of NaNs. This goes unnoticed, until the error is finally >> picked up at the first possible occasion where there's a nan error >> check. That is, in trialintersect. >> >> I hope to continue this investigation later, and hope to be able to >> traverse the bad data to their true source. > It may be helpful to dump the scene graph to a file (from the debug menu) > once you're getting the NaN error. Hopefully the offending matrix will > be printed with NaNs instead of valid coordinates. > > Tim I've added an --enable-fpe argument which, on Linux, will cause an abort or core dump on a division-by-zero or other invalid floating point operation, including generating NaNs and overflowing float-to-integer conversions. See if you can get to the source of the NaNs using that. Tim |
From: George P. <geo...@gm...> - 2009-05-20 14:55:47
Attachments:
fpe_LNCM.txt
|
On Wed, May 20, 2009 at 6:06 PM, Tim Moore <ti...@re...> wrote: > Tim Moore wrote: >> Durk Talsma wrote: >>> On Friday 15 May 2009 20:31:17 Durk Talsma wrote: > >>> >>> While trying to trap bad data in this the popMatrix function, I just >>> noticed that a bad transformation matrix is already set up relatively >>> early in the process, only a few levels deep at the stack. I haven't >>> been able to relate this to any meaningful object yet. (All that came up >>> was the name "Scene"). >>> >>> So, it looks like a transformation error early on blows up the intersect >>> line vector(s) already. and scenegraph is traversed further down, OSG >>> keeps happily multiplying already corrupted data with valid >>> transformation data further down the line, restuling in an intersect >>> line, composed of NaNs. This goes unnoticed, until the error is finally >>> picked up at the first possible occasion where there's a nan error >>> check. That is, in trialintersect. >>> >>> I hope to continue this investigation later, and hope to be able to >>> traverse the bad data to their true source. >> It may be helpful to dump the scene graph to a file (from the debug menu) >> once you're getting the NaN error. Hopefully the offending matrix will >> be printed with NaNs instead of valid coordinates. >> >> Tim > I've added an --enable-fpe argument which, on Linux, will cause an abort or > core dump on a division-by-zero or other invalid floating point operation, > including generating NaNs and overflowing float-to-integer conversions. See > if you can get to the source of the NaNs using that. > > Tim > Hi Tim and All, As per conversation on IRC ia have been able to get a backtrace when using --enable-fpe. FG was not paused by me with the error occuring very early on (no sound not image showing in the spash screen). Machine is a Dual Core Intel processor with a nvidia 8600GT video card. I did add a debug line to the file src/Instrumentation/inst_vertical_speed_indicator.cxx on line 207. flightgear$ grep -n DEBUG src/Instrumentation/* |grep GP src/Instrumentation/inst_vertical_speed_indicator.cxx:207: printf("DEBUG GP: SeaIngHG: %fL InternalSeaInHG: %fL DT: %fL\n", sea_inhg, _internal_sea_inhg, dt); Either dt is zero or I have the parameter in the printf line wrong. Please find attached a full backtrace from GDB. Let me know if you could do with more information. Regards George |
From: George P. <geo...@gm...> - 2009-05-20 14:59:37
|
On Thu, May 21, 2009 at 12:55 AM, George Patterson <geo...@gm...> wrote: > On Wed, May 20, 2009 at 6:06 PM, Tim Moore <ti...@re...> wrote: >> Tim Moore wrote: >>> Durk Talsma wrote: >>>> On Friday 15 May 2009 20:31:17 Durk Talsma wrote: >> >>>> >>>> While trying to trap bad data in this the popMatrix function, I just >>>> noticed that a bad transformation matrix is already set up relatively >>>> early in the process, only a few levels deep at the stack. I haven't >>>> been able to relate this to any meaningful object yet. (All that came up >>>> was the name "Scene"). >>>> >>>> So, it looks like a transformation error early on blows up the intersect >>>> line vector(s) already. and scenegraph is traversed further down, OSG >>>> keeps happily multiplying already corrupted data with valid >>>> transformation data further down the line, restuling in an intersect >>>> line, composed of NaNs. This goes unnoticed, until the error is finally >>>> picked up at the first possible occasion where there's a nan error >>>> check. That is, in trialintersect. >>>> >>>> I hope to continue this investigation later, and hope to be able to >>>> traverse the bad data to their true source. >>> It may be helpful to dump the scene graph to a file (from the debug menu) >>> once you're getting the NaN error. Hopefully the offending matrix will >>> be printed with NaNs instead of valid coordinates. >>> >>> Tim >> I've added an --enable-fpe argument which, on Linux, will cause an abort or >> core dump on a division-by-zero or other invalid floating point operation, >> including generating NaNs and overflowing float-to-integer conversions. See >> if you can get to the source of the NaNs using that. >> >> Tim >> > > Hi Tim and All, > > As per conversation on IRC ia have been able to get a backtrace when > using --enable-fpe. > > FG was not paused by me with the error occuring very early on (no > sound not image showing in the spash screen). Machine is a Dual Core > Intel processor with a nvidia 8600GT video card. > Oops... The splash screen gets up the stage of loading scenery objects. Sorry for any confusion caused. Regards George |
From: Durk T. <d.t...@xs...> - 2009-05-26 20:37:57
|
Hi Tim, On Wednesday 20 May 2009 10:06:18 Tim Moore wrote: > > It may be helpful to dump the scene graph to a file (from the debug menu) > > once you're getting the NaN error. Hopefully the offending matrix will > > be printed with NaNs instead of valid coordinates. > > > > Tim > > I've added an --enable-fpe argument which, on Linux, will cause an abort or > core dump on a division-by-zero or other invalid floating point operation, > including generating NaNs and overflowing float-to-integer conversions. See > if you can get to the source of the NaNs using that. > No breakthroughs yet, but just a quick progress report to keep the thread alive. :-) Thanks for your suggestions. I've been trying to track this down, but don't have anything firm yet. My current working hypothesis is that a stack corruption may be feeding bad data into the "prepare ground cache" function. As I've been tracing the problem further up the stack, I got to the point that suggests this. I'll post some more specific results later, because the core dumps are on a different machine that I don't have access to. That being the case, there's probably no bad date in the scene graph itself. I currently don't fully understand the results form the stacktrace yet. As for the --enable-fpe argument, this is probably going to be a very useful debugging tool, but enabling it resulted in a segfault inside the GUI when I wanted to click the menu to enable the autopilot... Cheers, Durk |
From: Durk T. <d.t...@xs...> - 2009-06-14 08:48:15
|
Folks, Here's a rather long overdue follow-up to my own previous mail. On Tuesday 26 May 2009 22:37:45 I wrote: > Thanks for your suggestions. I've been trying to track this down, but don't > have anything firm yet. My current working hypothesis is that a stack > corruption may be feeding bad data into the "prepare ground cache" > function. As I've been tracing the problem further up the stack, I got to > the point that suggests this. I'll post some more specific results later, > because the core dumps are on a different machine that I don't have access > to. That being the case, there's probably no bad date in the scene graph > itself. I currently don't fully understand the results form the stacktrace > yet. > Below is a stack trace of gdb, after I placing error trapping code further upstream. My original thought, looking at this stack trace was that a stack corruption occurs somewhere between stack frames #8 and #7, but upon closer inspection, I'm not so sure anymore: At stack frame #8 _simTime has a normal value, whereas at stack frame #7, startSimTime is listed as 0 (when it should have been the same value as _simTime in frame 8). However, at stack frame 6, startStimTime is again the same as _simTime in frame 8, so this variable is passed correctly, but just printed in correctly in frame 7. On previous runs, I've seen these values being printed as NaN though, so I'm not quite comfortable as to what's going on here. I have a core file for this particular crash, so any suggestions would be welcome. Cheers, Durk (gdb) bt #0 osg::PositionAttitudeTransform::computeLocalToWorldMatrix (this=0x9d1d3b0, matrix=<value optimized out>) at /home/durk/src/OpenSceneGraph/src/osg/PositionAttitudeTransform.cpp:63 #1 0x00007fbf16244187 in osg::Transform::computeBound (this=0x9d1d3b0) at /home/durk/src/OpenSceneGraph/src/osg/Transform.cpp:164 #2 0x00007fbf162207a1 in osg::Switch::computeBound (this=0x9d1d290) at /home/durk/src/OpenSceneGraph/include/osg/Node:334 #3 0x00007fbf1618df5f in osg::Group::computeBound (this=0x832d590) at /home/durk/src/OpenSceneGraph/include/osg/Node:334 #4 0x00000000004fcb90 in FGGroundCache::CacheFill::apply (this=0x7fff21e74330, group=@0x832d590) at /usr/local/include/osg/Node:334 #5 0x00007fbf1618f393 in osg::Group::accept (this=0x832d590, nv=@0x7fff21e74330) at /home/durk/src/OpenSceneGraph/include/osg/Group:38 #6 0x00000000004f504d in FGGroundCache::prepare_ground_cache (this=0x9de29d8, startSimTime=238.20000000000687, endSimTime=238.21666666667355, pt=@0x7fff21e74730, rad=12.576125144958496) at groundcache.cxx:355 #7 0x00000000004ee91c in FGInterface::prepare_ground_cache_m (this=<value optimized out>, startSimTime=0, endSimTime=-0, pt=<value optimized out>, rad=-0) at flight.cxx:612 #8 0x0000000000675aa5 in YASim::update (this=0x9de2650, dt=0.016666666666666666) at YASim.cxx:213 #9 0x0000000000427255 in fgUpdateTimeDepCalcs () at main.cxx:157 #10 0x00000000004294af in fgMainLoop () at main.cxx:447 #11 0x00000000004714bf in fgOSMainLoop () at fg_os_osgviewer.cxx:177 #12 0x0000000000428bb5 in fgMainInit (argc=4, argv=0x7fff21e74d68) at main.cxx:1004 #13 0x0000000000426a95 in main (argc=4, argv=0x7fff21e74d68) at bootstrap.cxx:216 (gdb) frame 8 #8 0x0000000000675aa5 in YASim::update (this=0x9de2650, dt=0.016666666666666666) at YASim.cxx:213 213 prepare_ground_cache_m( _simTime, _simTime + dt, xyz, vr ); (gdb) p _simTime $6 = 238.20000000000687 (gdb) p dt $7 = 0.016666666666666666 (gdb) frame 7 #7 0x00000000004ee91c in FGInterface::prepare_ground_cache_m (this=<value optimized out>, startSimTime=0, endSimTime=-0, pt=<value optimized out>, rad=-0) at flight.cxx:612 612 SGVec3d(pt), rad); (gdb) |
From: Mathias F. <Mat...@gm...> - 2009-06-18 17:21:47
|
Hi, On Sunday 14 June 2009 10:48:03 Durk Talsma wrote: > /home/durk/src/OpenSceneGraph/src/osg/PositionAttitudeTransform.cpp:63 #1 > 0x00007fbf16244187 in osg::Transform::computeBound (this=0x9d1d3b0) at Well the PositionAttitudeTransform is used now for everything having a SGModelPlacement. May be you need to look into the AI models positions and orientations to find that? I am not completely sure what you mean by those variables being different on different stack frames. But keep in mind that it might be even possible that due to code optimizations gdb might print some nonsense in some stack frames. That depends on plenty conditions. Did you try to increase osg's verbose level. Does it print something for the node paths so that you can see the models? Keep in mind that this only works for osg's *trunk*. Also you might put some code into SGModelPlacement, when this writes its values into the PositionAttitudeTransform. Greetings Mathias |
From: Torsten D. <To...@t3...> - 2009-08-21 12:23:31
|
I have _probably_ found at least one reason for this bug. I was able to constantly create a FPE when running fgfs --enable-fpe and /sim/traffic-manager/enabled=true I was able to locate the offending code in FGAISchedule::update when the new position of some AI aircraft was calculated by multiplying the start position with a rotation matrix. When computing the geodetic position from cartesian coordinates in current = SGGeod::fromCart(newPos); it happened, that within SGGeodesy::SGCartToGeod() the value for 's' was _very_ close to zero and slightly negative causing sqrt(s*(2+s)) fail which is ony defined for s greater or equals zero or less than or equals -2. The workaround clamps 's' to values greater than zero. This is probably mathematically incorrect but should keep us running. Maybe someone who fully understands the math in this method can explain, if 's' ever can legally go negative or if this is a rounding error. Greetings, Torsten |