#1021 SysFileTree usage Access Violation

4.1.3
closed
Mark Miesfeld
none
7
2013-07-09
2011-09-07
Jerry Senowitz
No

Under Windows XP Professional, SP1 32bit w/ ooRexx 4.1.0, when I try to use the results of SysFileTree of my C: Drive (64k+entries, avg entry length 102 bytes) an access violation occurs or I get "date" function argument errors randomly. Sometimes the .rex program even works as expected. When the errors occur is unpredictable as well as when it will run successflly. Frustrating! Tried to put in stops like Trace ?r or exit after certain points. Never have been able to catch what is causing the problem.

The function is rc = SysFileTree("C:*","f","BLS","*")
After SysFileTree is executed, the RC is 0 or the .rex program will exit displaying a message and the return code.
The .rex program goes on to examine the results of SysFileTree and select entries based on user command line selected search criteria. At random points in the seach loop I get either of the above errors. Have included a zip file containing two Dr Watson logs and a sample of the random "date" function error.

Regressed to ooRexx to 4.0.0 and still had the above problems.

Regressed to ooRexx to 3.2.0 and I no longer had any problems with ooRexx. Finished writing the code under 3.2.0 and it works as expected. After I finished writing the code I restored my Windows XP System back to the ooRexx 4.1.0 version and the problems are reoccuring again.

I need to move to a new computer and operating system (Windows 7 Home Premium 64 bit) and wanted to use the latest ooRexx 64 bit programs. I am writing .rex programs to replace 16 bit programs that no longer function on Windows 7. I've developed replacement programs on Windows XP and hope to port them to Windows 7. But, now I am concerned about ooRexx reliability on Windows 7.

Discussion

1 2 > >> (Page 1 of 2)
  • Mark Miesfeld
    Mark Miesfeld
    2011-09-07

    I'm out of town with very limited Internet access. I'll take a look at this when I get back in about a week. You'll have to be patient until then. ;-)

     
  • Mark Miesfeld
    Mark Miesfeld
    2011-09-14

    I've looked at this and don't see enough to tell what the problem is.

    Please attach a short working program that produces the problem and indicate about how often you have to run it to see the error pop up.

     
  • Jerry Senowitz
    Jerry Senowitz
    2011-09-15

    Thanks for taking a look. Won't be able to provide a sample until after the 29th when I get back to my development computers.

    Did try the app causing the problem on my 64 bit system. Day 1 - No problems. Day 2 - No problems. But, these 2 days were limited tests. Day 3 - When I really tried to use my app to do searches of the 64 bit system, I ran into both symptoms. I have to assume that it is not hardware or operating systems. Furthermore, the symptom strikes me as a ptr is either uninitialized or gets clobbered during the processing of the stem variables (64k+) in the loop that examines the contents of each stem variable. I don't think that the SysFileTree is the problem unless it is using memory that it shouldn't or it ran out of memory and didn't correctly report it back to the app, but set the .0 to a count that far exceeded what had been built. But, this assumption doesn't explain why it works sometimes.

     
  • Jerry Senowitz
    Jerry Senowitz
    2011-10-03

    Created 2 .rex programs to demonstrate the symptoms I have been having. Test.rex is a set of code that represents how my real program uses SysFileTree and then tries to convert the date portion of the stem variable f.i. There is both symptoms output in console logs and Dr Watson logs. The filenames containing the name 'test' are from test.rex.

    Once I realized that the stem variable was not being assigned a value, I wrote a much simpler version of test.rex -> test3.rex. It was able to produce the symptom of the undefined value for f.24317 and also f.34907. Did not try long enough to create the 'access violation'.

    From the information I've collected so far, it would appear that SysFileTree is the problem. It appears,at times,to have trouble building stem variables when using the "BLS" options for a full drive ("C:*").

    Now about frequency. I ran the real program multiple times over a 3 day period and saw no evidence of a problem. Ran it on 2 different computers and operating systems.

    I then set about to write a much simpler version of the program to demonstrate the problem and low and behold the symptoms started to reappear. In the last 24 hours (minus sleep time) I have had 7 occurences of the 'not been assigned a value' and 1 of the 'access violation'. I have run these 2 programs ~ 40 times. All but 8 have completed successfully. Can't predict when it will occur, but, it does appear to only occur during the first iteration of the 10 iteration do loop and usually on the first invocation in a DOS Virtual Machine. Have tried Command.com, Cmd.exe and an environment similar to 'Try Rexx' which I wrote years ago when Rexx was first introduced on PC DOS (either V5 or V7, don't remember which).

    See attached 3405740.zip for programs and results.

    Hope this helps! The symptoms are annoying due to their lack of predictability.

     
  • Mark Miesfeld
    Mark Miesfeld
    2011-10-03

    Thanks Jerry for the files and the info. I'll investigate this further.

    Just as a FYI, sometimes the Dr Watson files are useful, usually they are not. With the original Dr Watson files you attached, there was no way to correlate the information in the file with anything in ooRexx.

     
  • Jerry Senowitz
    Jerry Senowitz
    2011-10-03

    I agree with you Mark about Dr Watson. I usually throw them away. They are usually not worth the time to look at.

    In the investigation of these symptoms, you may want to try the scenario on a UNIX/AIX machine. One thing I learned years ago about Windows is that it doesn't do a good job of protecting one application from another, or itself. I have developed and tested programs on Windows that ran fine. When ported to a UNIX/AIX environment all kinds of storage violations occurred because UNIX/AIX does a much better job of storage management/protection than windows. Not sure about LINUX.

    If my assumptions are correct, then the stem variables that are messed up/missing are someplace, just not sure where. ooRexx isn't either :). In the old PC DOS version of Rexx there was a REXXDUMP.EXE that one could use to dump the variable pool, I don't think ooRexx has one nowadays unless it is in the C++ developer section that I don't normally install. Would be nice to see what is in the variable pool.

     
  • Jerry Senowitz
    Jerry Senowitz
    2011-10-04

    I discovered the SysDumpVariables function and inserted it into the test3.rex program. When the unassigned variable message pops up, it now dumps the variable pool and exits.

    As I suspected, all stem variables are in the pool (Yesterday, I just wasn't sure where they were). What was interesting is when you look at the variable_pool.txt file, the message from the console says that the unassigned variable is 'F.24609' but I see it in the pool and it appears complete. Even more interesting is F.24610 is not there. In it's place is 'F.'. Curious the unassigned variable is F.24609 and not F.24610?

    I took the test3.txt generated by the test3.rex program, chopped it down to only show stem variables for the one folder where the problem occurred. I next sorted the variables to produce the variable_pool.txt file. Also made sure that the stem variables for the folder where 'F.' occurred matched the actual number files in that folder.

    Everything is packaged up in 3405740-1.zip

    Hope this helps.

    BTW, after I uploaded the previous .zip file yesterday, I ran the real program and got several 'access violation' and 'unassigned variable' messages after having previously run this program 3 days without incident. Go figure!!!

     
  • Jerry Senowitz
    Jerry Senowitz
    2011-10-17

    Disgusted with the reliability of SysFileTree (now approaching a 50% failure rate) I regressed ooRexx back to Version 3.2.0 again. Since first encountering this problem I have written a number of testn.rex apps to see if I could predict it's reliability, but, to no avail. Well, I have now run these little apps on version 3.2.0 and the problem DOES exist at this level as well. This is NOT a problem introduced in 4.x. It has existed for several ooRexx releases.

    I've tried to emulate (using SysStemInsert) the massive amount of variable length data dumped into stem variables that is produced by SysFileTree w/ the "S" option, but have not found a problem. The only difference between SysFileTree and my emulations is that after a stem variable is generated under this emulation, control is returned to my app for the next instruction in the DO loop. SysFileTree continues building stem variables until it completes it recursion and then returns.

    When I first submitted this problem report I thought I was dealing with an access violation problem and since it only happened on 4.x I thought I was safe at 3.2. While I have not seen the access violation at 3.2, I have now seen several iterations of missing variables, something I didn't initially check when I submitted the problem report. Because of the problem, I have added code to test each stem variable using var().

    The SysFileTree process, in it's variable building manages to step on pointers or doesn't set them up properly in the variable pool that cause missing variables to occur and on occasion storage access violations. Access violations can occur during the stem variable build or somewhere after SysFileTree returns to my app. My app now checks, using var(), whether or not the stem variable has been defined. If 1 is undefined, it will loop thru the rest of the stem variables, listing the stem variables that var() thinks are undefined. It then dumps the variable pool using SysDumpVariables so that I can inspect the data. When the problem occurs, I have seen anywhere from 1 to 16 stem variables that var() thinks are undefined and while some are consecutive numbers, they aren't always. Also, I have seen access violations occur on the var() and SysDumpVariables functions.

    My only problem is that I can't predict when this is going to occur other than it happens in a window of 64107 to 64136 stem variables. I've seen missing variables as low as number 1301 and as high as 27000+. With access violations, I usually don't see anything that tells me which variable caused the problem (bad pointer?). If I add a line of code to the app, the dynamics are such that I probably won't get the error. That is frustrating! I did try to get a feel for size of the variable pool once by summing the lengths of the stem variables to the point of failure, but everything ran to normal completion. Took the code out and the failure reoccurred. I can run the test app 1 time and it fails and run it again and it works fine under the same cmd.exe invocation. Have seen it run fine using cmd.exe but happen when invoking the test using rexx.exe and visa versa.

    Without knowing the dynamics of variable pool assignment and how it expands when variables are added, I can offer little info on what to focus on other than it occurs during a recursion of subdirectories.

    I have been a system and application programmer most of my adult life and have used various incarnations of ooRexx's parents going back to the 1980's, mostly at the classic rexx level. I was happy when IBM decided to introduce it on the PC with PC DOS. While I do not consider myself a REXX expert, I do have a lot of experience with the language and what it can do. As I stated in previous comments, I am trying to replace several old 16 bit apps that no longer work in 64 bit Windows and there are no known replacements for these apps. 2 of these apps require the use of SysFileTree with the ability to recurse thru the directory tree structure of hard drives.

    Realizing that this problem can be difficult to reproduce, I would be willing to provide assistance in problem determination, if that would be helpful. I am adding my latest incarnation of a test that produces the problem plus a console log and select variable pool stem variables - 3405740-2.zip.

     
  • Bruce
    Bruce
    2011-10-17

    I'm seeing pretty much the same thing on ooRexx 4.2.0/Mac OS X/ 64bit.

    I get one of three things:

     5 *-* RC =  SysFileTree("/Users/bjskelly/*", "f", "S");
    

    REX0040E: Error 40 running /Users/bjskelly/sft.rex line 5: Incorrect call to routine
    REX0372E: Error 40.1: External routine "SYSFILETREE" failed

    or
    Abort

    or
    Bus Error

    The first case occurs when there are massive amounts for files to process, in my case 600k files and sub directories.

    Abort happens when I give a more qualified filespec, so the data being added is small, but the number of directories to process is large, 90k.

     
  • Mark Miesfeld
    Mark Miesfeld
    2011-12-28

    I can not reproduce this problem. I have tried your test programs and run them hundreds of times with greater than 100,000 files on my systems. No luck.

    Bruce if you can reproduce this on your Mac, maybe you could try to debug it.

     
  • Jerry Senowitz
    Jerry Senowitz
    2012-02-08

    Status Update:

    1) Test4.rex has been the most effective test that demonstrates this problem whether it is using command.com, cmd.exe, rexx.exe or "Try Rexx". It doesn't always occur in each of these environments.

    2) Regressed ooRexx back to 3.1.1 (the newest release that does not show the problem) and have been running successfully on Windows XP since 11-16-11. Unfortunately 3.1.1 has a bug (1630937) that isn't fixed until 3.1.2. The bug prevents 3.1.1 from running on Windows 7 and Vista (x64). 3.1.2 and newer ooRexx releases have all demonstated the problem of this bug report (3405740).

    3) Walked thru the SysFileTree code in RexxUtil in 3.1.1 and 4.1.0. The only real difference is that there is a structure definition change in the SysFileTree section between the 2 releases.

    4) Upgraded my Windows XP (x86) to SP3. Made no difference.

    5) Installed Visual C redistributables thru 2010. Made no difference.

    6) Thinking that maybe the reason you, Mark, can't reproduce the problem is that you have Visual Studio installed, So, I installed Visual Studio 2005. Problem no longer appeared. Interesting!

    7) Decided to get rid of the backups from the SP3 update (3087 objects). Running Test4.rex, the problem reappears with an "Access violation" and Visual Studio Debugger gets control. Because I have not compiled 4.1.0, I don't
    have the necessary data to let the debugger do it's thing. (Don't want to do a build of ooRexx either.) I choose to leave this level of problem determination to the experts.

    8) Based on number 7), I have concluded that it is NOT the number of stem variables that causes the problem, but a variable pool management problem. It is the amount of data that causes the problem (each stem variable is of variable length), not the number of variables.

    Here is my theory:

    Variable Pool Manager (call it VPM) allocates memory for the variable pool.

    As VPM runs out of memory, it allocates (malloc?) more memory and puts the next stem variable in the new block of memory. But, VPM keeps track of the unused memory from the previous malloc(s) so that if a variable fits in that unused memory it will be used. This would explain why stem variables, after a while, are no longer consecutively numbered when dumped using SysDumpVariables. In hind sight I should not have sorted the variable pool variables prior to submitting additional doc on the problem.

    I think that there are situations where VPM thinks data will fit in the unused part of memory but it does not take into account stem variable name and or ptr adjustment or both. I further believe that it is the case where data is equal to the amount of unused memory or within 6 bytes. This would explain the situation in "CONSOLES.TXT"(in 3405740-2.zip) where another stem variable was overlaid. It also would explain it occuring so randomly. It would also explain why adding additional code to a .rex for problem determination changes the dynamics of it's occurence. It also explains that when it does occur, it will reoccur the same way in the same instance of a DOS VDM.

    The question I have is whether the VPM does memory defragmentation (sometimes called "data/garbage collection") along the way and whether that impacts the problem?

    Without me having to dig thru all the source code, which module actually performs the VDM function? I thought I could take a peek at it in my spare time to see if anything obvious jumps out at me.

     
  • Bruce
    Bruce
    2012-02-08

    This line causes the rexx program to abort:

    RC = SysFileTree("/Users/bjskelly/b*", "F", "DS");

    Mac OS 10.6.8 / ooRexx 4.1.1 Intel 64 bit.

    See Crash Reporter File rexx_2012-02-08-111628_BookWormMac attached below.

     
  • Jerry Senowitz
    Jerry Senowitz
    2012-02-08

    In my 2012-02-07 17:00:00 PST comment, last paragraph, I asked which module performed the "VDM" function. This was a typeo. I meant "VPM". Sorry for any confusion.

     
  • Mark Miesfeld
    Mark Miesfeld
    2012-02-09

    Hi Bruce,

    Looking at the crash log and searching on: release_file_streams_for_task I see a number pages that seem to hint that this could be an Apple bug rather than ooRexx. Here is just one, where the person says that Apple admitted they have a bug in 10.6.8, which is what it looks like you for using.

    http://pleasantsoftware.com/blog/

     
  • Mark Miesfeld
    Mark Miesfeld
    2012-02-09

    We already had this bug opened:

    3149277 SysFileTree causes Segmentation fault

    which I'm going to close as a duplicate of this one (even though it was opened up first, this one currently has more information in it.)

    The total comment of 3149277 so far is:

    I was running ooRexx 4.0.1 and found a seg fault within SysFileTree. I updated to ooRexx-4.1.0-ubuntu1004.i386.deb and the error still persists.

    Ubuntu Server 10.04 x86

    This code never completes, rather seg faults:
    / Next scan for FILE's... /
    say 'Next scan for FILEs...'
    rc=SysFileTree('*', f., 'FOS')

    I checked the subtree in question and it appears the SysFileTree should return:
    /srv/shares$ find data -name "*" -print | wc -l
    217331

    that many files. I recalled SysFileTree not being able to get through the data share a LONG time ago, before I did massive cleanup. Perhaps then the file count was around 500,000. Puzzling that years later with far less files it still can not make it through.

    I was running the rexx as root via sudo, so that can not be the problem.

    I guess I will take the data share a subtree at a time to get through the files fixing perms. Please let me know how I may help sort this out. Thank you!

    opened by: mdlueck

     
  • Jerryo37
    Jerryo37
    2012-05-06

    where can I Down Load OOREXX 3.1.1 for winxp? Thank you.

     
  • Mark Miesfeld
    Mark Miesfeld
    2012-07-24

    Jerry,

    I have coded a second implementation of SysFileTree that eliminates some possible causes of access violations. In my testing it produces the same results as the original implementation.

    Since I can't produce the crash to begin with, I have no way to tell if this actually fixes things.

    I have a debug verison of the Windows installation package built that contains both a SysFileTree() and a SysFileTreeB(), wtih SysFileTreeB() being the new implementation.

    If you could test this build we can see if you still get access violations with SysFileTreeB(). E-mail me to get instructions on how to get the debug package.

    Thanks.

     
  • Mark Miesfeld
    Mark Miesfeld
    2012-07-24

    I'm going to use this bug report to keep track of some information.

    Both theWindows and the Unix implementation use code of the form:

    sprintf(ldp->Temp, "%s%c%c%c%c%c%c%c%c%c%c  %s", ldp->Temp,
      tp, ...
    

    This is definitely wrong as the result is undefined in both the Windows compiler and versions of gcc. Quote:

    "Some programs imprudently rely on code such as the following

    sprintf(buf, "%s some further text", buf);

    to append text to buf. However, the standards explicitly note that the results are undefined if source and destination buffers overlap when calling sprintf(), snprintf(), vsprintf(), and vsnprintf(). Depending on the version of gcc(1) used, and the compiler options employed, calls such as the above will not produce the expected results."

    I think this is likely to be the cause of the weird Debian bug we saw with the output of SysFileTree.

    I fixed that in the rewrite of the Windows SysFileTree, I guess I need to do that in the Unix version.

     
  • Mark Miesfeld
    Mark Miesfeld
    2012-07-30

    Committed revision 8126 in trunk

    Testing looks good, but will do more testing before committing to the 4.1 fixes branch

     
1 2 > >> (Page 1 of 2)


Anonymous


Cancel   Add attachments