String properties may be used to associate a small set of categorical values to a molecule like {red, green, blue} or may represent unique identifier like SMILES. In the first case it would be nice to show the distribution of molecules for the values by the infobar. However, this does not make sense for the latter case.
Since we do not distinguish these different types of string properties, this is currently not supported. There are three solutions to add support:
Distinguish these types of properties, which would require to store the type in the database and allow the user to select the type when importing data.
Count the number of different values a string property takes and decide the type heuristically, e.g., properties with less that 10 different values are considered categorical.
Allow the user to select all string properties and show a warning message in case of string properties with many different values, which explains the problem and allows the user to confirm his choice.
The first solution is cleaner, but the other solutions should be sufficient for most use-cases and do not require a more complicated import process. I would propose to implement the 3. solution. What do you think? Are there other parts of the program that would benefit from the 1. solution?
Solution 1 is clean, but not very flexible. One must select this during import, which is not very practical from a user point of view. It adds another layer of complexity. Solution 2 seeme a bit intransparent and might be interpreted as a bug by the user (Why is property XY not shown?). Furthermore, there might be a few scaffold subtrees that contain only a few distinct category values and a Mapping still makes sense, even if the total number is to high.
=> i would go for solution 3 or solution 4 (see below)
Solution 4: Show a dynamic categorical infobar, that puts all categories which are below a freuqnecy threshot in category "the rest". This category might have a special visual appearance to distinguis it from the other categories (e.g. using a striped filling or something like that)
Some code in the IntervalPanel already implements a mapping for string-properties. It could be used by editing only a few lines of code. However it looks like there are still some issues. Perhaps the interval-comboboxes are not updated, if another string-property is selected. [6feefc] enables the support for string-properties so far.
Related
Commit: [6feefc]
Fixed the remaining problems with [841d1e] and [4a395d]. The feature is implemented now, i think. Please have a look, if it looks okay.
Related
Commit: [4a395d]
Commit: [841d1e]
Subtree accumulation does not work (see screenshot: the two ring node in the center of the image should have some blue in it)
For some string properties the numerical interval panel is shown. Example: Tutorial Dataset / PUBCHEM_MOLECULAR_FORMULA (see Screnshots)
Screenshot: PUBCHEM_MOLECULAR_FORMULA in the table view
The last bug fails silently. Can you please have a look through the code and check if Exceptions are thrown on errors?
The number of distinct string-values is limited to 10 (see SinglePropertyPanel, l.427), but PUBCHEM_MOLECULAR_FORMULA contains more than 500 distinct values. This limitation is reasonable, i think, but how should we handle properties, which do not fulfill this predicate? Just removing them from the dropdown-list?
I think, that it should be always possible to select every value, but the default color mapping should only inlude the most frequent values (lets say with a limit of 10) and only values that have a frequency larger than some X (e.g. no singleton values).
Last edit: Till Schäfer 2017-02-17
Missunderstood your proposal at first, sorry. Sounds good. How should we handle the case that a property only consists of singleton values? There must be at least one string-interval to select.
Last edit: Philipp Mewes 2017-03-01
I think there is no need to handle this case differently. I would propose to sort the strings with the same frequency (one in case of singletons) lexicographically and just show the top ten.
After some work i could implement the subtree accumulation. See [985f41] for details.
Related
Commit: [985f41]
The DbManager renders support for requesting the frequency of distinct string values now ([836e4a]).
Implemented the feature in the view ([c85279]). At most 10 (non-singleton-)values are displayed now. This also enhances the performance for properties with many distinct values, since at most 10 intervals have to be added to the respective panel (See PUBCHEM_MOLECULAR_FORMULA).
Handled the special case of singleton-values as proposed ([2d0092] and [2579df]).
Related
Commit: [2579df]
Commit: [2d0092]
Commit: [836e4a]
Commit: [c85279]
Should the manual be checked for updates, related to this feature too?
Yes, please.
Updated the manual ([bf9f33]).
Related
Commit: [bf9f33]
Thank you for implementing this feature. Some minor issues still need to be fixed:
Last edit: Nils Kriege 2017-03-08