1. Summary
  2. Files
  3. Support
  4. Report Spam
  5. Create account
  6. Log in

Ticket #107 (closed defect: fixed)

Opened 3 years ago

Last modified 23 months ago

JDK collator results in decode errors for the SparseRowStore

Reported by: thompsonbry Owned by: thompsonbry
Priority: major Milestone:
Component: Other Version: TERMS_REFACTOR_BRANCH
Keywords: Cc: fkoliver, mrpersonick, mroycsi

Description

Fred:
There seems to be a bug in that KeyDecoder? objects cannot decode BTree keys where the key was built with the JDK collator. That is, the JDK CollationKey? contains bytes of zero, which the KeyDecoder? falsely assumes separates sections of the BTree key. What is the best way to fix this?

Bryan:

Can you add a unit test which demonstrates this failure.  This can go in com.bigdata.btree.keys.TestKeyBuilder?.  I can then take a look at what is going on with the JDK collation and see if there is something to be done.

Fred:

Done. Svn #3202. I assume that failures will start showing up in the CI tests.

Change History

  Changed 3 years ago by thompsonbry

  • status changed from new to closed
  • resolution set to fixed

*** THIS COMMIT BREAKS BACKWARD COMPATIBILITY ***

However, a backward compatibility mode is available. See
SparseRowStore?.Options.PRIMARY_KEY_UNICODE_CLEAN.

The issue: The primary key was originally written using a Unicode sort
key. However, the JDK generates Unicode sort keys with embedded nuls
and that broke the logic to detect the end of the Unicode primary
keys. In order to accommodate this behavior, the Unicode primary key
is now encoded as UTF8 which also has the advantage that we can decode
Unicode primary keys. Standard prefix compression on the B+Tree should
make up for the larger representation of the Unicode primary key in
the B+Tree.

Other changes:

- Added javadoc to indicate that the JDK collator uses embedded nul
bytes, which means that we can not detect the end of the start of the
column name for the SparseRowStore?.

- Refactored the KeyBuilder? unit tests a bit to run unit tests of
Unicode behavior under both the JDK and ICU collators.

- Added an explicit override for the JDK collator for the
GlobalRowStore?.

Committed revision 3206.

  Changed 3 years ago by thompsonbry

  • cc mrpersonick added
  • status changed from closed to reopened
  • resolution fixed deleted

I am reopening this issue. Rather than break compatibility now in the trunk, I am going to defer this change until we reach roll in the lexicon refactor branch. That branch will also break compatibility and will offer people more functionality as well. The primary key unicode clean option is defaulting to false with this commit. This is the backward compatibility mode. I will set it to true when we roll in the lexicon refactor branch.

  Changed 3 years ago by thompsonbry

Also, before closing this issue out again, write more unit tests for the SparseRowStore? encode and decode logic and for the successor semantics when the primary key is a Unicode string (it currently uses the fixed length bit string successor in this case for the unicode clean mode.)

  Changed 3 years ago by fkoliver

One of the benefits of using UTF8 is the ability to reconstruct the String as you now allow. When is this helpful?

The successor function applied to the UTF8 encoded byte[] will eventually result in a sequence which is not legal UTF8 and from which no String can be reconstructed. Does that render the reconstruction of the Strings moot?

The collation order implied by UTF8 is vastly different for everyone, and unhelpful for languages other than English.

Would you consider removing the JDK collator entirely instead of using UTF8? (Less code, less tests, little value?)

Otherwise, would you consider changing the name of the enum from JDK to UTF8 to emphasize the difference since you just made a backward incompatible change?

  Changed 3 years ago by thompsonbry

Fred,

Yes, the ability to reconstruct the primary key for the key-value store is useful. We could already do that for all other datatypes. The tradeoff is that the keys in the underlying B+Tree are now clustered by the UTF8 encoding of the primary key -- for primary keys which use Unicode. I do not see any downside of this since the use cases for key-value stores do not have the same level of consideration for total ordering -- many key-value stores actually have no locality and are based on hash maps. We do use some locality within the context of hierarchical namespaces and UTF8 will preserve those semantics.

The successor function does not produce keys, but exclusive upper bounds for key-range iterators. It does not matter that the key is invalid as UTF8 for those purposes.

I would consider strongly cautioning people about the JDK collator. However, bigdata is also able to run in very small footprints and there may be people who would benefit from not having the ICU libraries bundled, which is the main reason for the JDK collator. There is other reason to use it.

The JDK collator and the key-value store compatibility are nearly entirely separate things. The only reason why we are having this issue is because the JDK collator embeds nul bytes into the generated sort key rather than following the Unicode specification for compressed sort keys. This caused an issue because the key-value store made an assumption that there were no such embedded nul bytes. However, as you discovered when running with a bad classpath, no one is was using the JDK collator or they would have seen the same exception from the global row store.

My main concern with the collation ordering is that it support hierarchical namespaces. However, that may be an overly narrow concern. We should look at broader use cases for the key-value store and see if there are other reasons why collation ordering there could make a difference.

Note that the key-value store is very much a special case. Most applications use the BTree directly and inherent whatever collation behavior is configured for the BTree. For the key-value store, collation is much less critical.

follow-up: ↓ 7   Changed 3 years ago by fkoliver

Would you consider changing the name of the enum from JDK to UTF8 to emphasize the difference since there are no explicit uses in the tree?

in reply to: ↑ 6   Changed 3 years ago by thompsonbry

Replying to fkoliver:

Would you consider changing the name of the enum from JDK to UTF8 to emphasize the difference since there are no explicit uses in the tree?

I may be dense, but I frankly do not understand why a name change would be necessary. If we were to make a name change, the name "UTF8" would be completely misleading. The "unicode clean primary key" option for the SparseRowStore? changes the semantics of the key-value store BUT it has absolutely no affect on the other uses of the IKeyBuilder. The name "JDK" indicates clearly that the collator is the JDK based collator rather than the other choices.

I think that you are giving too much of a focus in terms of the overall architecture to the key-value store here. It is just one user of the B+Tree and IKeyBuilder. The proposed "unicode clean primary key" change affects _only_ the SparseRowStore?. It would not have any affect on any other use of the JDK collator. In fact, I did not make any changes to the code surrounding the JDK collator -- it behaves exactly as it always behaved. The only difference is in the use that the key-value store makes of that collator. That change is to fix a broken assumption about the behavior of the JDK collator based on the evidence that it will embed nul bytes into a generated key.

Other than the use of the key-value store for catalog information, the triple store uses B+Trees directly to model RDF data and is not affected by the presence of nul bytes in the JDK collator generated keys.

If we were to change the JDK collator semantics, then I would certainly agree, but that is not what was changed. It is the key-value store semantics that are being changed.

Does that make sense?

  Changed 3 years ago by thompsonbry

  • status changed from reopened to accepted

  Changed 3 years ago by thompsonbry

  • status changed from accepted to closed
  • resolution set to fixed

This is fixed in the 0.83.0 release.

  Changed 3 years ago by thompsonbry

  • cc mroycsi added
  • status changed from closed to reopened
  • resolution fixed deleted

There is still a problem with com.bigdata.sparse.Schema when it encodes the schema name into the SparseRowStore?. It currently relies on a Unicode sort key generated from the schema name. When combined with the choice of the JDK collator option, this can result in embedded nul bytes in the generated Unicode sort key which would make it impossible to decode the key since we could not correctly detect the end of the schema name in the key.

A unit test should be written which demonstrates this problem. The test will have to use a Unicode schema name which would cause the introduction of embedded nul bytes by the JDK collator - not all Unicode strings will have this effect.

The solution is to encode the schema name as UTF8 into the key rather than as a Unicode sort key.

In practice, the effect of this bug is small to negligible for people using bigdata as a triple store host. However, native uses of the SparseRowStore? with the JDK collator could encounter this bug.

Migration for people using bigdata as a triple store host is simply to rewrite the keys in the global row store. There is only one schema name in use and its value is known, so it does not need to be decodable.

Migration for people making more general purpose use of the SparseRowStore? would require generating the Unicode sort key for the name of each Schema, matching the schema name in the row store tuples, and then rewriting the keys with the UTF8 encoding of the Schema name.

  Changed 3 years ago by thompsonbry

I have deprecated KeyBuilder?.asSortKey(Object). This method is used heavily by the unit tests. However, is not suitable for the core code base for two reasons. First, it uses whatever the default KeyBuilder? configuration happens to be which is perfectly Ok unless you are using Unicode sort key components in a key. Second, it uses a static instances protected by the monitor of that instance which causes the lock to be a bottleneck. The correct pattern is to use a thread-local or per task IKeyBuilder instance configured for a specific index or task.

The only remaining use of KeyBuilder?.asSortKey(Object) in the main code base is in com.bigdata.sparse.Schema to generate a Unicode sort key from the schema name.

  Changed 3 years ago by thompsonbry

I have implemented a unicode clean option for encoding the schema name in the SparseRowStore? keys. This option is currently disabled by default, which provides backward compatibility.

Note that I have not been able to generate a schema name which in fact caused the JDK collation rules to embed a nul byte into the key. I imagine that the constraints on the legal patterns for schema names preclude many cases which might otherwise have caused a problem. However, I would not be surprised to learn that legal schema names could be used to generate Unicode sort keys with embedded nul bytes using the JDK CollatorEnum? option.

I have left the unicode clean option disabled for the moment so we can reflect on the best way to handle this. For example, if we put the bigdata release version number into the code and from the code into the persistence store, then we could automatically detect the version of the code used to create a given persistent data structure. Something along these lines could facilitate automatic data migration.

  Changed 3 years ago by thompsonbry

Removed KeyBuilder?.asSortKey() call from com.bigdata.sparse.Schema. The same logic exists as a private static method with a private static KeyBuilder? instance for backward compatibility on the SparseRowStore?.

Moved KeyBuilder?.asSortKey() into the test suite (on TestKeyBuilder?).

  Changed 3 years ago by thompsonbry

  • status changed from reopened to accepted

Fred provided a stack trace, which I have excerpted below. This is during the triple store create, so it should be easy enough to test out. I'll start as you did by throwing an exception out of ICUSortKeyGenerator and making sure that it is not being used. I can then do a triple store create when the triple store is configured for the JDK collator option and see what I find.

Caused by: java.lang.ArrayIndexOutOfBoundsException?: -1

at com.bigdata.sparse.KeyDecoder?.<init>(KeyDecoder?.java:283)
at com.bigdata.sparse.AbstractAtomicRowReadOrWrite?.atomicRead(AbstractAtomicRowReadOrWrite?.java:260)
at com.bigdata.sparse.AbstractAtomicRowReadOrWrite?.atomicRead(AbstractAtomicRowReadOrWrite?.java:158)
at com.bigdata.sparse.AtomicRowWriteRead?.apply(AtomicRowWriteRead?.java:161)
at com.bigdata.sparse.AtomicRowWriteRead?.apply(AtomicRowWriteRead?.java:23)
at com.bigdata.journal.IndexProcedureTask?.doTask(IndexProcedureTask?.java:56)
at com.bigdata.journal.AbstractTask?$InnerWriteServiceCallable?.call(AbstractTask?.java:2038)
at com.bigdata.journal.AbstractTask?.doUnisolatedReadWriteTask(AbstractTask?.java:1799)
at com.bigdata.journal.AbstractTask?.call2(AbstractTask?.java:1726)
at com.bigdata.journal.AbstractTask?.call(AbstractTask?.java:1592)
at java.util.concurrent.FutureTask?$Sync.innerRun(FutureTask?.java:303)
at java.util.concurrent.FutureTask?.run(FutureTask?.java:138)
at com.bigdata.concurrent.NonBlockingLockManagerWithNewDesign?$LockFutureTask?.run(NonBlockingLockManagerWithNewDesign?.java:1970)
... 3 more

Shutting down: Fri Jul 30 16:16:04 EDT 2010
BUILD SUCCESSFUL (total time: 11 seconds)

  Changed 3 years ago by thompsonbry

Bug fix to DefaultKeyBuilderFactory?, which was passing the default into properties.getProperty(key,def) and therefore was never testing System.getProperty(key,def). This was causing the ICU collator to always be chosen by this code path even when the collator had been explicitly set as a JVM property.

Added final in a bunch of places to ICUSortKeyGenerator.

Removed logic in the GlobalRowStoreHelper? which was forcing the ASCII collator when the JDK collator had been requested due to a historical problem with the JDK collator and the global row store.

In order to use the JDK collator you must now turn on unicode clean support in the SparseRowStore? class. Since this breaks binary compatibility (for both the JDK and ICU collator options), I plan to do this as part of our next release.

Improved the inline comments in KeyDecoder? regarding the ArrayIndexOutOfBoundsException?.

To verify that this fixes the JDK collator issue: (1) modify the ICUSortKeyGenerator to always throw an exception out of its constructor; (2) specify "-Dcom.bigdata.btree.keys.KeyBuilder?.collator=JDK" on the command line to select the JDK collator; (3) specify "-Dcom.bigdata.sparse.Schema.schemaName.unicodeClean=true" on the command line; and (4) run any unit test which create a triple store but does not explicitly specify the CollatorEnum? option.

  Changed 2 years ago by thompsonbry

Before closing out this issue, extend the IndexMetadata? for the SparseRowStore? instances to specify whether they are unicode clean or not. Existing store instances should read and then update their SparseRowStore? IndexMetadata? objects to make these configuration decisions restart safe.

See http://sourceforge.net/apps/trac/bigdata/ticket/171

  Changed 23 months ago by thompsonbry

  • version changed from trunk to TERMS_REFACTOR_BRANCH

This issue slipped on the 1.0 release. I've moved it to the TERMS_REFACTOR_RELEASE.

  Changed 23 months ago by thompsonbry

Since we need to issue a dot release for [1], I am going to fold this issue into that dot release as well. The change of to Unicode clean for the schema name is now present in both the maintenance branch for 1.0.0 and the development branch.

[1] https://sourceforge.net/apps/trac/bigdata/ticket/349 (TermIdEncoder? imposes limit of 2B RDF Values on Journal)

  Changed 23 months ago by thompsonbry

  • status changed from accepted to closed
  • resolution set to fixed
Note: See TracTickets for help on using tickets.