I am not sure how to file a bug report, so I am posting here in the hopes one of you will notice and address this issue. I do consider this a bug.
The fact that OpenCCG uses Java's user preference system prevents multiple instances of ccg-realize from being run in parallel. While it is possible to write parallelized Java code specifically, I feel that this is the wrong way to go about it because of resource management. I am currently running several hundred instances of ccg-realize on a cluster at our lab. Writing parallel Java code would mean that either I am unable to use all our computing power, or have to implement a separate system of resource management, defeating the purpose of our grid engine. On top of that, having to deal with this kind of issue is simply an unnecessary time sink.
The problem that arises is this:
The first instance of ccg-realize that is started reads the user's preferences, which are stored on the file system.
The user preference file is locked for the duration of the realization process.
All further instances of ccg-realize cannot obtain the lock and fail for this reason.
Also, all tools for sub-tasks use the same preference node: the one for TextCCG. I feel that while the TextCCG preferences are a good fallback in those cases where no more specific preferences are given, this cannot be the intended mechanism. I might not care for seeing all features or derivation trees when debugging my grammars via tccg, but I might still need them for large-scale realization experiments which I process via scripts.
For these reasons, I feel that it is necessary to re-implement OpenCCG's preference handling. A new system should allow for parallelization methods that employ multiple instances of JVM as well as for configuring specific settings per task.
Any feedback? I'd be happy about a discussion.
Best,
Andrea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, openccg's current use of preferences was clearly a design mistake in retrospect. At the time, it seemed like a nice way of propagating settings from the one interactive tool (tccg), but the usual way to manage settings for batch tools is via config files. The developers of visccg in Austin ran into similar issues and did some partial workarounds, but not a complete reworking as would've been ideal.
In my experience we've not actually had trouble running multiple instances of ccg-realize at once by executing eg 'ccg-build -f build-rz.xml test-perceptron' via separate command line invocations (from openccg/ccgbank). So perhaps there is an easy fix for your lock file issue, but I'm not sure of the details. Maybe someone else better understands this issue?
Reworking openccg's preferences to support config files while remaining backwards compatible would be a great user contribution. We recently had a user contribution via a pull request on github to get rid of current eclipse warnings, which was delightfully painless.
Mike
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear all,
I am not sure how to file a bug report, so I am posting here in the hopes one of you will notice and address this issue. I do consider this a bug.
The fact that OpenCCG uses Java's user preference system prevents multiple instances of ccg-realize from being run in parallel. While it is possible to write parallelized Java code specifically, I feel that this is the wrong way to go about it because of resource management. I am currently running several hundred instances of ccg-realize on a cluster at our lab. Writing parallel Java code would mean that either I am unable to use all our computing power, or have to implement a separate system of resource management, defeating the purpose of our grid engine. On top of that, having to deal with this kind of issue is simply an unnecessary time sink.
The problem that arises is this:
The first instance of ccg-realize that is started reads the user's preferences, which are stored on the file system.
The user preference file is locked for the duration of the realization process.
All further instances of ccg-realize cannot obtain the lock and fail for this reason.
Also, all tools for sub-tasks use the same preference node: the one for TextCCG. I feel that while the TextCCG preferences are a good fallback in those cases where no more specific preferences are given, this cannot be the intended mechanism. I might not care for seeing all features or derivation trees when debugging my grammars via tccg, but I might still need them for large-scale realization experiments which I process via scripts.
For these reasons, I feel that it is necessary to re-implement OpenCCG's preference handling. A new system should allow for parallelization methods that employ multiple instances of JVM as well as for configuring specific settings per task.
Any feedback? I'd be happy about a discussion.
Best,
Andrea
Hi Andrea
Hi Andrea
Yes, openccg's current use of preferences was clearly a design mistake in retrospect. At the time, it seemed like a nice way of propagating settings from the one interactive tool (tccg), but the usual way to manage settings for batch tools is via config files. The developers of visccg in Austin ran into similar issues and did some partial workarounds, but not a complete reworking as would've been ideal.
In my experience we've not actually had trouble running multiple instances of ccg-realize at once by executing eg 'ccg-build -f build-rz.xml test-perceptron' via separate command line invocations (from openccg/ccgbank). So perhaps there is an easy fix for your lock file issue, but I'm not sure of the details. Maybe someone else better understands this issue?
Reworking openccg's preferences to support config files while remaining backwards compatible would be a great user contribution. We recently had a user contribution via a pull request on github to get rid of current eclipse warnings, which was delightfully painless.
Mike