Doing system-wide profiling of a large Java application, a user intermittently encountered problems with opjitconv not processing all JIT dump files, and operf displays the message
JIT dump processing exited abnormally: 1
Passing "--verbose=debug" to operf allowed us to collect detailed information on the JIT conversion process. In the log, we see the following error:
record past end of file JIT convert error -1
I believe the "record past end of file" error is due to a timing problem between the Java process writing to its JIT dump file (via the oprofile Java agent library) and the operf JIT conversion process copying the same JIT dump file before begining its conversion process. There is no synchronization between the two processes to ensure that JIT dump records written by the Java process are complete and coherent. Thus, the opjitconv process may read a record header to get the record length and find that this length would go past the end of the file (which it had copied from the original).
Diff:
We've gotten to the root cause of the problem reported by this user. The JIT dump file was being corrupted by the JVM process writing stdout messages to it. This happened because of the manner in which the user's script for starting the JVM was actually closing stdout and stderr. Thus, when the JIT dump file was opened by the agent, it used file descriptor #1 (which is normally stdout.
In spite of the fact that the actual cause of this particular user's problem was of their own making, there is a lack of synchronization between the JIT agent library and opjitconv as I mentioned in my opening description. I will attach a patch that addresses that issue.
Daniel, please review the patch I just posted. Thanks.
I reviewed the patch.
I have seen that flock() can not set the lock for a file that is located in a NFS. In such a scenario the function fcntl() is suggested.
If that scenario is not significant I would suggest to document that limitation in the source code and/or in the docomentation of oprofile.
From what I understand, flock can safely be used by local NFS clients, no matter the kernel version. In a scenario where NFS is being used to back the oprofile_data session directory, we would have two local NFS clients using flock -- opjitconv and the JIT agent library. So I don't think there's a concern here.
I think you are right.
The restriction for NFS located files could be meant in a scenario where one process used flock() to set the lock on one network side and another process want to use flock() on the other network side (i.e. both processes are on different network locations).
In our case of Oprofile both processes are located on the same network side.
So I can give my sign-off to your patch. Thanks.
The patch has been pushed upstream. Thanks for reviewing, Daniel!