Menu

Unbound / Ping LDAP client - ldap-diff issue with hangs / error

2023-07-30
2023-08-08
  • Eric Tuttle

    Eric Tuttle - 2023-07-30

    We are currently working to sync a directory with very large groups (+_ 2M members). The system was working well with the addition of a property - com.unboundid.ldap.sdk.LDAPConnectionOptions.defaultMaxMessageSizeBytes and setting it to 400 mb. We have tried to improve performance and moved the work to a bigger host with significant more processor / Memory, and the system is hanging at 73 million compares (consistently) then failing with the following errors -

    Entries identified so far: 73328000.
    An error occurred while attempting to identify the set of entries to examine in the source server: The attempt to
    search for applicable entries failed with result SearchResult(resultCode=81 (server down), diagnosticMessage='An I/O
    error occurred while trying to read the response from the server: IOException(The end of the input stream was reached
    before the first length byte could be read.), ldapSDKVersion=6.0.6, revision=b8c6c463def55758ed8ec0d914c84268c944251c',
    entriesReturned=0, referencesReturned=0).
    and

    Entries identified so far: 73380000.
    An error occurred while attempting to identify the set of entries to examine in the source server: The attempt to search for applicable entries failed with
    result SearchResult(resultCode=84 (decoding error), diagnosticMessage='Unable to read or decode a search result entry: IOException(The end of the input
    stream was reached before the full value could be read.), ldapSDKVersion=6.0.6, revision=b8c6c463def55758ed8ec0d914c84268c944251c', entriesReturned=0,
    referencesReturned=0).

    respectively. I assume it might still be a group object issue - so I am running another pass with the max object size set to 3.9 gb. These take hours to process - in advance of the result - is there anything else you might consider as the issue? any way to turn on additional debug / output?

    Following the resolution of the issue - any thoughts on making the compare run faster? java tuning / cleanup?

    Many thanks,

    Eric.

     
  • Jarrett Peterson

    Entries identified so far: 73514000.
    An error occurred while attempting to identify the set of entries to examine in the source server: The attempt to search for applicable entries
    failed with result SearchResult(resultCode=84 (decoding error), diagnosticMessage='Invalid protocol op type 84 encountered in an LDAP message.',
    entriesReturned=0, referencesReturned=0).
    PS E:\unboundid-ldapsdk-6.0.6\unboundid-ldapsdk-6.0.6\tools>

     
  • Eric Tuttle

    Eric Tuttle - 2023-07-30

    Yup - Failed third time... thoughts on cause?

     
  • Neil Wilson

    Neil Wilson - 2023-07-31

    I don't believe that this specific error has anything to do with exceeding the maximum message size on the client (although that may also have come into play before you set the maximum message size, so that may still be needed), which should give you a more specific error message. It does sound like the connection between the client and the server is being closed unexpectedly in the middle of processing. That looks like it's happened both times (the "server down" result code 81 received on the first attempt suggests that a previously valid connection now appears to be closed, and the "decoding error" result in the second case sounds like it's because the connection has also been closed when the client was in the middle of reading a message from the server.

    The first thing I would recommend would be checking the server access and error logs to see if it provides any indication as to why the connection might be getting closed. If it's a server-side issue, then that could help identify it.

    But given that you are dealing with some large group entries and a large number of entries, I would definitely recommend tuning the amount of memory that the tool is allowed to consume. If memory pressure starts to build up in the JVM, then the garbage collector will start taking more and more time to run, during which time the JVM will essentially be paused for all other activity so that the Java code can't run (and it sounds like that's definitely a possibility given that you also report that the tool seems to hang). If this causes the server to accumulate a large backlog of data to be sent to the client so much that its TCP transmit queue becomes full, then attempts to send additional data to the client will block, and that might cause a timeout on the server which could cause it to close the connection.

    By default, the tool relies on the underlying JVM to choose the amount of memory that it will try to use, but that's usually a fairly small value. You can specify the maximum amount of memory that the JVM is allowed to use with the "-Xmx" argument, and it's often a good idea to also set the initial amount of memory that the JVM will allocate to the same amount via the "-Xms" argument. For example, "-Xms2g -Xmx2g" sets both the initial and maximum JVM heap size to 2 gigabytes. However, those arguments need to go to the JVM itself and not to the tool, so you can't just provide them directly on the command line.

    Although the process for doing this is a little different if you're running the tool as part of the Ping Identity Directory Server, it looks like you're running this from a standalone LDAP SDK installation, so the way that you specify the arguments to pass to the JVM is by setting the JAVA_ARGS environment variable to the desired string. On Windows, that would look like:

    set JAVA_ARGS "-Xmx2g -Xmx2g"

    If you have trouble setting the environment variable for some reason, you could also just edit the ldap-diff.bat batch file and put those arguments in directly between the "%JAVA_ARGS%" and "-cp" arguments. In that case, you wouldn't enclose the -Xms and -Xmx arguments in quotes.

     

    Last edit: Neil Wilson 2023-07-31
  • Jarrett Peterson

    we have the java memory set at 40/40 right now not sure if we need some specific gc settings

     
  • Jarrett Peterson

    or if there's some extra debug we could do possibly?

     
  • Neil Wilson

    Neil Wilson - 2023-08-01

    I doubt that changing the garbage collector would make any substantial difference if you've got that much memory available to the process. It shouldn't need anywhere near that much. And I don't think that adding any extra debugging would help, either.

    The ultimate problem is that the server is closing the connection to the client. As I suggested, the best thing to do would be to look at the server logs to see if it provides any indication as to why the connection is being closed. You could also try using just a simple ldapsearch with the same settings to see if it is closed in the same way as further confirmation that's what the issue is.

    If the giant search is an issue, you could try again in ldapsearch with the --simplePageSize argument to cause it to use the simple paged results control (assuming that the server you're using supports it). If that works, then we could consider adding that option to ldap-diff.

    If the giant search works in ldapsearch, either with or without the simple paged results, then you could use it to create files with just the DNs of the entries in each server, and then use the --sourceDNsFile and --targetDNsFile arguments instead of having it perform the giant search in each of the servers to find all of the entries.

    If the giant search still fails in the server, then that's something that you should take up with the vendor to see if there is a way to diagnose that problem.

     
  • Jarrett Peterson

    It seems like it must be something environmental and not sure why. We have another server that ran at 450 threads and made it past the same error point the new server fails at with 450 threads. Not sure how we determine why we have networking teamlooking. The major difference he saw was that one server sent 5-10gb of data during various packet capture periods and the other never went over 1

     
  • Jarrett Peterson

    The server is sitting in compare at a single entry for hours but it's at 70% cpu. could our groups just be that big?

     
  • Neil Wilson

    Neil Wilson - 2023-08-03

    Does this mean that you were able to get past the initial phase of retrieving the DNs of all of the entries in each of the servers, or is it still in that phase? If it's still in that phase, then it should definitely not be consuming a lot of CPU time on either the client or the server because it's just doing a single search against each directory to retrieve the DNs (the entries without any attributes) of all entries in the server.

    It's unlikely that the process of comparing a single entry between two servers would require a substantial amount of processing time on the server. The ldap-diff tool would just need to retrieve the entry from each server, and then it compares them on the client side, and it would do that in a single thread. If you have lots of concurrent threads (and it sounds like earlier you were using 450 threads, is probably way too high and is likely to produce worse performance than a much smaller number like 32 threads, although even that depends on the processing capabilities of both the client and the server), then each of them could try comparing multiple entries in parallel, and that could cause higher CPU utilization on the server as it's handling all of those requests.

    If ldap-diff looks like it's hung, then you could use jstack to see what it's currently doing. Doing this several times with several seconds between each jstack invocation would let us see a few snapshots into what processing it's performing to see if it's stuck on something really expensive or if it is actually making progress.

     
  • Jarrett Peterson

    We did a run a day or so after our last successful run (at 150 threads) which took approx 30 hours this time at 32 threads as recommended above and it's approx 24 hours in and 4% into the first pass after the read in so seems significantly slower

     
  • Neil Wilson

    Neil Wilson - 2023-08-08

    It sounds like you're at least able to get past the slowest, single-threaded first phase where it's getting the DNs of the entries. If a large number of threads is working faster for you, then that's the way to go.

    If it seems like the tool is getting stuck in that phase for a period of time, then jstack is still probably the best tool to use to figure out what it's trying to do. You could also look at the server logs to see if operations are taking a long time there in case it's a server-side issue.

     

Log in to post a comment.