|
From: Fred O. <fko...@gm...> - 2010-07-15 16:25:47
|
There seems to be a bug in that KeyDecoder objects cannot decode BTree keys where the key was built with the JDK collator. That is, the JDK CollationKey contains bytes of zero, which the KeyDecoder falsely assumes separates sections of the BTree key. What is the best way to fix this? The bug was detected when the icu4j jar was inadvertently left out of a classpath in a test environment. The decision to choose a default key encoding based on whether a class can be loaded is undesirable because two different virtual machines can attempt to use different key encodings on the same BTree. Would you object to always using ICU as the default for now (and possibly using an explicit configuration option in the future)? What is the role of locale/language and collation in BTree keys? Does the language choice affect RDF or SAIL or lexicon or SPARQL queries? The choice of language appears to be left to default system properties which can also vary from VM to VM. Would you object to always using en_US as the default for now (and possibly using an explicit configuration option in the future)? There seems to be a fairly odd relationship between three major classes in BTree key handling: Schema, KeyBuilder, KeyDecoder. It would seem architecturally that the Schema object for a BTree should contain all the information needed to create and manipulate keys for that BTree rather than having that information distributed over many classes. Would you object to Schema objects being the sole constructors of KeyBuilders and KeyDecoders (replacing DefaultKeyBuilderFactory)? If that is right, then it seems that Schema objects ought to be properties of BTrees which can be obtained from a BTree itself [BTree.getSchema()], and should only be constructed by clients during BTree creation. Does that make sense? Fred |
|
From: Bryan T. <br...@sy...> - 2010-07-15 16:48:05
|
Fred, > There seems to be a bug in that KeyDecoder objects cannot decode BTree keys where the key was built with the JDK collator. That is, the JDK CollationKey contains bytes of zero, which the KeyDecoder falsely assumes separates sections of the BTree key. What is the best way to fix this? Can you add a unit test which demonstrates this failure. This can go in com.bigdata.btree.keys.TestKeyBuilder. I can then take a look at what is going on with the JDK collation and see if there is something to be done. > The bug was detected when the icu4j jar was inadvertently left out of a classpath in a test environment. The decision to choose a default key encoding based on whether a class can be loaded is undesirable because two different virtual machines can attempt to use different key encodings on the same BTree. Would you object to always using ICU as the default for now (and possibly using an explicit configuration option in the future)? I would be fine making ICU the default. You can explicitly configure this using KeyBuilder.Options.COLLATOR and specifying "ICU" as the value for that option. (I will note in passing that the ICU4JNI option has not been tested but might offer some performance benefits if it could be made to work. Also, I have not checked the current release of ICU -- we might be able to update our dependency.) I should also note that the KeyBuilder.Options can be specified globally, for an index namespace, or for a specific index. > Would you object to always using en_US as the default for now. I think that the local of the machine should control the default since it is more likely to be correct and more transparent with the OS/Java stack. You can specify this in the bigdataCluster.config file for scale-out or the Properties for a Journal. I am happy to specify en_US in the sample configuration files. That makes it pretty obvious where people can override this behavior. > The choice of language appears to be left to default system properties which can also vary from VM to VM. The choice is made based on the local environment, but it is captured by the IndexMetadata object and imposed on any machine which uses that index. > What is the role of locale/language and collation in BTree keys? Does the language choice affect RDF or SAIL or lexicon or SPARQL queries? Bigdata generates Unicode sort keys from Unicode strings when they are encoded into a key. The local/language/collation options govern the ordering that will be imposed on the tuples in the index based on how they influence the collation order of Unicode strings for that index. This effects the RDF database primarily through the TERM2ID index. For example, if the ASCII collator is chosen then Unicode literals will be collapsed to only those distinctions maintained by ASCII. Likewise, the choice of the KeyBuilder.Options.STRENGTH controls what distinctions will be maintained by the Unicode sort keys. RDF Literals which are encoded to the same Unicode sort key are treated as the same literal and only one instance of that literal will be stored in the lexicon. > There seems to be a fairly odd relationship between three major classes in BTree key handling: Schema, KeyBuilder, KeyDecoder. It would seem architecturally that the Schema object for a BTree should contain all the information needed to create and manipulate keys for that BTree rather than having that information distributed over many classes Schema and KeyDecoder are part of the SparseRowStore package. They are not used by general purpose B+Tree instances, but only for an index providing a key-value store. For example, the global row store. The IndexMetadata class captures all information required to encode/decode keys and values for a given index. Thanks, Bryan ________________________________ From: Fred Oliver [mailto:fko...@gm...] Sent: Thursday, July 15, 2010 12:26 PM To: Bigdata Developers Subject: [Bigdata-developers] BTree key mismatch questions There seems to be a bug in that KeyDecoder objects cannot decode BTree keys where the key was built with the JDK collator. That is, the JDK CollationKey contains bytes of zero, which the KeyDecoder falsely assumes separates sections of the BTree key. What is the best way to fix this? The bug was detected when the icu4j jar was inadvertently left out of a classpath in a test environment. The decision to choose a default key encoding based on whether a class can be loaded is undesirable because two different virtual machines can attempt to use different key encodings on the same BTree. Would you object to always using ICU as the default for now (and possibly using an explicit configuration option in the future)? What is the role of locale/language and collation in BTree keys? Does the language choice affect RDF or SAIL or lexicon or SPARQL queries? The choice of language appears to be left to default system properties which can also vary from VM to VM. Would you object to always using en_US as the default for now (and possibly using an explicit configuration option in the future)? There seems to be a fairly odd relationship between three major classes in BTree key handling: Schema, KeyBuilder, KeyDecoder. It would seem architecturally that the Schema object for a BTree should contain all the information needed to create and manipulate keys for that BTree rather than having that information distributed over many classes. Would you object to Schema objects being the sole constructors of KeyBuilders and KeyDecoders (replacing DefaultKeyBuilderFactory)? If that is right, then it seems that Schema objects ought to be properties of BTrees which can be obtained from a BTree itself [BTree.getSchema()], and should only be constructed by clients during BTree creation. Does that make sense? Fred |
|
From: Fred O. <fko...@gm...> - 2010-07-15 18:07:36
|
On Thu, Jul 15, 2010 at 12:47 PM, Bryan Thompson <br...@sy...> wrote: > Fred, > >> There seems to be a bug in that KeyDecoder objects cannot decode BTree >> keys where the key was built with the JDK collator. That is, the JDK >> CollationKey contains bytes of zero, which the KeyDecoder falsely assumes >> separates sections of the BTree key. What is the best way to fix this? > Can you add a unit test which demonstrates this failure. This can go in > com.bigdata.btree.keys.TestKeyBuilder. I can then take a look at what is > going on with the JDK collation and see if there is something to be done. Done. Svn #3202. I assume that failures will start showing up in the CI tests. Fred |
|
From: Fred O. <fko...@gm...> - 2010-07-15 18:41:19
|
On Thu, Jul 15, 2010 at 12:47 PM, Bryan Thompson <br...@sy...> wrote: > Fred, > >> The bug was detected when the icu4j jar was inadvertently left out of a >> classpath in a test environment. The decision to choose a default key >> encoding based on whether a class can be loaded is undesirable because two >> different virtual machines can attempt to use different key encodings on the >> same BTree. Would you object to always using ICU as the default for now (and >> possibly using an explicit configuration option in the future)? > > I would be fine making ICU the default. You can explicitly configure this > using KeyBuilder.Options.COLLATOR and specifying "ICU" as the value for that > option. (I will note in passing that the ICU4JNI option has not been tested > but might offer some performance benefits if it could be made to work. > Also, I have not checked the current release of ICU -- we might be able to > update our dependency.) I've made ICU the default as of svn #3203. By configure, I meant changing a configuration file at deployment time. Fred |
|
From: Fred O. <fko...@gm...> - 2010-07-15 20:17:57
|
On Thu, Jul 15, 2010 at 12:47 PM, Bryan Thompson <br...@sy...> wrote: > Fred, > > I think that the local of the machine should control the default since it is > more likely to be correct and more transparent with the OS/Java stack. You > can specify this in the bigdataCluster.config file for scale-out or the > Properties for a Journal. I am happy to specify en_US in the sample > configuration files. That makes it pretty obvious where people can override > this behavior. Yes, please add that locale to the sample configuration files. I think it should be made reasonably obvious that the locale of the machine and the locale of the data need not be related. Otherwise, if different machines (containing different shards or clients) in a single cluster had different locale settings, then would all the machines be creating or processing keys in a single BTree with the same locale? >> The choice of language appears to be left to default system properties >> which can also vary from VM to VM. > > The choice is made based on the local environment, but it is captured by the > IndexMetadata object and imposed on any machine which uses that index. If BTree indices are split and shards moved to machines with different locale settings, then will the IndexMetadata object on both machines necessarily agree? What is the origin of the (content of the) IndexMetadata object? Fred |
|
From: Bryan T. <br...@sy...> - 2010-07-15 20:29:36
|
Fred,
> Yes, please add that locale to the sample configuration
> files. I think it should be made reasonably obvious that the
> locale of the machine and the locale of the data need not be related.
If you don't mind, can you apply and test the edit. If you look in the configuration file (bigdataStandalone.config, bigdataCluster.config, bigdataCluster16.config), you will see the following line in each file. It is part of the section where we are declaring the properties that will be applied to the triple store created by the batch job:
new NV(BigdataSail.Options.COLLATOR,"ASCII"),
You should be able to just specify additional properties right there to override the locale, collator, etc. The BigdataSail.Options is just inheriting options which include KeyBuilder.Options, so all of the options should be accessible in the BigdataSail.Options namespace. You can also explictly reference them in the KeyBuilder.Options namespace if you feel that is clearer (but make sure to import that namespace at the top of the configuration file).
> Otherwise, if different machines (containing different shards or
> clients) in a single cluster had different locale settings,
> then would all the machines be creating or processing keys in
> a single BTree with the same locale?
Yes, they would still be using the same locale regardless of their local settings. The locale of the IKeyBuilder that will be used for the index is fixed when the index is created. Otherwise just changing the Locale on the machine could render the data in the index unreadable!
> If BTree indices are split and shards moved to machines with
> different locale settings, then will the IndexMetadata object
> on both machines necessarily agree? What is the origin of the
> (content of the) IndexMetadata object?
The origin is the IndexMetadata with which the B+Tree was originally created. For scale-out, the IndexMetadata template is stored in the MetadataService. Each time a local B+Tree object is created for a shard the same IndexMetadata object is applied to that new B+Tree object. So the local IndexMetadata is consistent with the original settings. (You have to do extra work if you want to propagate a change to the IndexMetadata for all shards of a scale-out index).
Thanks,
Bryan
PS: I am trying to see if I can work around that JDK collator issue with the embedded nulls in the Unicode sort keys. The underlying problem is that we can't find the start of the column name, not that the column name itself can not be decoded.
> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...]
> Sent: Thursday, July 15, 2010 4:18 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] BTree key mismatch questions
>
> On Thu, Jul 15, 2010 at 12:47 PM, Bryan Thompson
> <br...@sy...> wrote:
> > Fred,
> >
> > I think that the local of the machine should control the
> default since
> > it is more likely to be correct and more transparent with
> the OS/Java
> > stack. You can specify this in the bigdataCluster.config file for
> > scale-out or the Properties for a Journal. I am happy to
> specify en_US
> > in the sample configuration files. That makes it pretty
> obvious where
> > people can override this behavior.
>
> Yes, please add that locale to the sample configuration
> files. I think it should be made reasonably obvious that the
> locale of the machine and the locale of the data need not be related.
>
> Otherwise, if different machines (containing different shards or
> clients) in a single cluster had different locale settings,
> then would all the machines be creating or processing keys in
> a single BTree with the same locale?
>
> >> The choice of language appears to be left to default system
> >> properties which can also vary from VM to VM.
> >
> > The choice is made based on the local environment, but it
> is captured
> > by the IndexMetadata object and imposed on any machine
> which uses that index.
>
> If BTree indices are split and shards moved to machines with
> different locale settings, then will the IndexMetadata object
> on both machines necessarily agree? What is the origin of the
> (content of the) IndexMetadata object?
>
> Fred
>
|
|
From: Fred O. <fko...@gm...> - 2010-07-15 21:39:01
|
On Thu, Jul 15, 2010 at 4:29 PM, Bryan Thompson <br...@sy...> wrote: > Fred, > >> Yes, please add that locale to the sample configuration >> files. I think it should be made reasonably obvious that the >> locale of the machine and the locale of the data need not be related. > > If you don't mind, can you apply and test the edit. If you look in the configuration file (bigdataStandalone.config, bigdataCluster.config, bigdataCluster16.config), you will see the following line in each file. It is part of the section where we are declaring the properties that will be applied to the triple store created by the batch job: > > new NV(BigdataSail.Options.COLLATOR,"ASCII"), > > You should be able to just specify additional properties right there to override the locale, collator, etc. The BigdataSail.Options is just inheriting options which include KeyBuilder.Options, so all of the options should be accessible in the BigdataSail.Options namespace. You can also explictly reference them in the KeyBuilder.Options namespace if you feel that is clearer (but make sure to import that namespace at the top of the configuration file). OK. That covers a few of the cases (maybe the important ones). But there are DefaultKeyBuilderFactories created with empty or null properties objects: bigdata/src/java/com/bigdata/btree/DefaultTupleSerializer.java: return new DefaultKeyBuilderFactory(new Properties()); bigdata/src/java/com/bigdata/btree/keys/KeyBuilder.java: return new DefaultKeyBuilderFactory(null/* properties */) bigdata/src/java/com/bigdata/btree/NOPTupleSerializer.java: new DefaultKeyBuilderFactory(new Properties())); bigdata/src/java/com/bigdata/journal/Name2Addr.java: new DefaultKeyBuilderFactory(new Properties()))); What are the consequences of unfortunate collators or locales in these places? Fred |
|
From: Bryan T. <br...@sy...> - 2010-07-15 21:58:56
|
Fred,
Answers below.
> bigdata/src/java/com/bigdata/btree/DefaultTupleSerializer.java:
> return new DefaultKeyBuilderFactory(new Properties());
^ The actual collator behavior will be captured by the IndexMetadata (the tuple serializer is saved as part of the IndexMetadata). You might add a warning to the constructor.
> bigdata/src/java/com/bigdata/btree/keys/KeyBuilder.java: return
> new DefaultKeyBuilderFactory(null/* properties */)
^ This is the specified behavior - it uses whatever is set in System.properties and otherwise defaults. You might add a warning to the constructor.
What is important for these first two cases is to make sure that we apply the Collator configuration as described for the specific index or triple store when it is created. So the use of these constructor forms could allow an unintended collator configuration to be inherited and made persistent as part of an index, triple store, etc. The critical case for the triple store is handled explicitly in LexiconRelation on line 644 (below) where it uses the properties used to create the AbstractTripleStore to setup the collator for the TERM2ID index. This is the only triple store index which can have Unicode data in the key, and hence the only one for which the collator configuration can have any impact. All of the rest of the indices are based on non-text data in the keys.
protected IndexMetadata getTerm2IdIndexMetadata(final String name) {
final IndexMetadata metadata = newIndexMetadata(name);
metadata.setTupleSerializer(new Term2IdTupleSerializer(getProperties()));
return metadata;
}
> bigdata/src/java/com/bigdata/btree/NOPTupleSerializer.java:
> new DefaultKeyBuilderFactory(new Properties()));
^ This should have no impact. The tuple serializer is not used in this case.
> bigdata/src/java/com/bigdata/journal/Name2Addr.java:
> new DefaultKeyBuilderFactory(new Properties())));
^ This one is worth doing something about. The role of Name2Addr is to map from index names to the address of a checkpoint record for the named index. The journals in the federation really should all have a consistent behavior in this regard otherwise bizarre errors could creep in with different collators on different nodes. E.g., you could have two index names which one node believed to be distinct while another node was using a collator which did not capture the distinction. For this, I think that the only "fix" is to put the information into the service configuration information so the data services and the metadata service all share the same collator configuration (all services can share, but these are the ones which could cause a problem).
Can you file an issue on this?
Thanks,
Bryan
> -----Original Message-----
> From: Fred Oliver [mailto:fko...@gm...]
> Sent: Thursday, July 15, 2010 5:39 PM
> To: Bryan Thompson
> Cc: Bigdata Developers
> Subject: Re: [Bigdata-developers] BTree key mismatch questions
>
> On Thu, Jul 15, 2010 at 4:29 PM, Bryan Thompson
> <br...@sy...> wrote:
> > Fred,
> >
> >> Yes, please add that locale to the sample configuration files. I
> >> think it should be made reasonably obvious that the locale of the
> >> machine and the locale of the data need not be related.
> >
> > If you don't mind, can you apply and test the edit. If you
> look in the configuration file (bigdataStandalone.config,
> bigdataCluster.config, bigdataCluster16.config), you will see
> the following line in each file. It is part of the section
> where we are declaring the properties that will be applied to
> the triple store created by the batch job:
> >
> > new NV(BigdataSail.Options.COLLATOR,"ASCII"),
> >
> > You should be able to just specify additional properties
> right there to override the locale, collator, etc. The
> BigdataSail.Options is just inheriting options which include
> KeyBuilder.Options, so all of the options should be
> accessible in the BigdataSail.Options namespace. You can
> also explictly reference them in the KeyBuilder.Options
> namespace if you feel that is clearer (but make sure to
> import that namespace at the top of the configuration file).
>
> OK. That covers a few of the cases (maybe the important
> ones). But there are DefaultKeyBuilderFactories created with
> empty or null properties objects:
>
> bigdata/src/java/com/bigdata/btree/DefaultTupleSerializer.java:
> return new DefaultKeyBuilderFactory(new Properties());
> bigdata/src/java/com/bigdata/btree/keys/KeyBuilder.java: return
> new DefaultKeyBuilderFactory(null/* properties */)
> bigdata/src/java/com/bigdata/btree/NOPTupleSerializer.java:
> new DefaultKeyBuilderFactory(new Properties()));
> bigdata/src/java/com/bigdata/journal/Name2Addr.java:
> new DefaultKeyBuilderFactory(new Properties())));
>
> What are the consequences of unfortunate collators or locales
> in these places?
>
> Fred
>
|