|
From: Bryan T. <br...@sy...> - 2010-07-15 16:48:05
|
Fred, > There seems to be a bug in that KeyDecoder objects cannot decode BTree keys where the key was built with the JDK collator. That is, the JDK CollationKey contains bytes of zero, which the KeyDecoder falsely assumes separates sections of the BTree key. What is the best way to fix this? Can you add a unit test which demonstrates this failure. This can go in com.bigdata.btree.keys.TestKeyBuilder. I can then take a look at what is going on with the JDK collation and see if there is something to be done. > The bug was detected when the icu4j jar was inadvertently left out of a classpath in a test environment. The decision to choose a default key encoding based on whether a class can be loaded is undesirable because two different virtual machines can attempt to use different key encodings on the same BTree. Would you object to always using ICU as the default for now (and possibly using an explicit configuration option in the future)? I would be fine making ICU the default. You can explicitly configure this using KeyBuilder.Options.COLLATOR and specifying "ICU" as the value for that option. (I will note in passing that the ICU4JNI option has not been tested but might offer some performance benefits if it could be made to work. Also, I have not checked the current release of ICU -- we might be able to update our dependency.) I should also note that the KeyBuilder.Options can be specified globally, for an index namespace, or for a specific index. > Would you object to always using en_US as the default for now. I think that the local of the machine should control the default since it is more likely to be correct and more transparent with the OS/Java stack. You can specify this in the bigdataCluster.config file for scale-out or the Properties for a Journal. I am happy to specify en_US in the sample configuration files. That makes it pretty obvious where people can override this behavior. > The choice of language appears to be left to default system properties which can also vary from VM to VM. The choice is made based on the local environment, but it is captured by the IndexMetadata object and imposed on any machine which uses that index. > What is the role of locale/language and collation in BTree keys? Does the language choice affect RDF or SAIL or lexicon or SPARQL queries? Bigdata generates Unicode sort keys from Unicode strings when they are encoded into a key. The local/language/collation options govern the ordering that will be imposed on the tuples in the index based on how they influence the collation order of Unicode strings for that index. This effects the RDF database primarily through the TERM2ID index. For example, if the ASCII collator is chosen then Unicode literals will be collapsed to only those distinctions maintained by ASCII. Likewise, the choice of the KeyBuilder.Options.STRENGTH controls what distinctions will be maintained by the Unicode sort keys. RDF Literals which are encoded to the same Unicode sort key are treated as the same literal and only one instance of that literal will be stored in the lexicon. > There seems to be a fairly odd relationship between three major classes in BTree key handling: Schema, KeyBuilder, KeyDecoder. It would seem architecturally that the Schema object for a BTree should contain all the information needed to create and manipulate keys for that BTree rather than having that information distributed over many classes Schema and KeyDecoder are part of the SparseRowStore package. They are not used by general purpose B+Tree instances, but only for an index providing a key-value store. For example, the global row store. The IndexMetadata class captures all information required to encode/decode keys and values for a given index. Thanks, Bryan ________________________________ From: Fred Oliver [mailto:fko...@gm...] Sent: Thursday, July 15, 2010 12:26 PM To: Bigdata Developers Subject: [Bigdata-developers] BTree key mismatch questions There seems to be a bug in that KeyDecoder objects cannot decode BTree keys where the key was built with the JDK collator. That is, the JDK CollationKey contains bytes of zero, which the KeyDecoder falsely assumes separates sections of the BTree key. What is the best way to fix this? The bug was detected when the icu4j jar was inadvertently left out of a classpath in a test environment. The decision to choose a default key encoding based on whether a class can be loaded is undesirable because two different virtual machines can attempt to use different key encodings on the same BTree. Would you object to always using ICU as the default for now (and possibly using an explicit configuration option in the future)? What is the role of locale/language and collation in BTree keys? Does the language choice affect RDF or SAIL or lexicon or SPARQL queries? The choice of language appears to be left to default system properties which can also vary from VM to VM. Would you object to always using en_US as the default for now (and possibly using an explicit configuration option in the future)? There seems to be a fairly odd relationship between three major classes in BTree key handling: Schema, KeyBuilder, KeyDecoder. It would seem architecturally that the Schema object for a BTree should contain all the information needed to create and manipulate keys for that BTree rather than having that information distributed over many classes. Would you object to Schema objects being the sole constructors of KeyBuilders and KeyDecoders (replacing DefaultKeyBuilderFactory)? If that is right, then it seems that Schema objects ought to be properties of BTrees which can be obtained from a BTree itself [BTree.getSchema()], and should only be constructed by clients during BTree creation. Does that make sense? Fred |