Add an “analyse and load” button to our current “load” button on importing. We could then:
(1) identify duplicate columns: things like country code and country will always have the same entries on each row;
(2) identify superfluous columns: things like the code “Q” and the word “Quarterly” would have only one entry in an entire file.
How we proceed after that has some options. Any dimension with only one entry should obviously be set to ignore; maybe we should also either not show it in the import form, or grey it out so it can’t be selected.
One way to handle duplicated columns would be to colour code identical ones, and/or have a pop up window warn a user when they code “axis” or “data” for two identical columns.
The object is to make sure that as many users as possible can successfully load data into Ravel. We are unlikely to hear from those who try it once or twice, and fail to load the data.
This happens now - any axis with a single slice will be removed from the hypercube on loading anyway, regardless of whether the user set it to ignore or axis. Those are not the problem.
As I mentioned in my email response, I'm not convinced this is such a good idea.
I'm inclined to think the solution will have more to do with database backing, where one can extract the slice or rollup desired using the Ravel widget, without having to worry about the 64 bit hypercube limit.
OK, but we do need to warn users of when they are likely to breach "curse of dimensionality" limits. Could you include a calculation field showing the data requirements of current selections? This could be used to advise users to reduce dimensionality on importing by choosing to ignore sufficient columns while still generating a unique key.
It's a bit more doable. We do, of course calculate the hypercube size and show an error message when the 64 bit limit is breached. Now that we're reading the whole file in during the metadata specification stage, we can update a display of the hypercube size, and display it in red when it exceeds 2^64.
Good. Even a minimum size warning would help, since 2^64 is a much bigger
number than most people appreciate. Something that gave users an indication
of potential data cube size? Maybe a box like this:
Dimensions:
Max number: 21 (this is the initial size of the BIS database)
Est. minimum for unique key: (user entered--7 for the BIS)
Max values per dimension: 350 (that's roughly the number of quarters)
Min values per dimension: 2 (Adjusted or unadjusted for breaks)
Estimated data cube size (calculated by Ravel):
Max Dimensions:
Min Dimensions:
The size difference between the max and min should be enough to let users
know that they should select for the smallest possible number of dimensions.
The idea I think we've agreed to here is to add the number of unique labels in each axis as another row in the data selection tab, along with the total product of all axes, rendered in red if exceeding the 64 bit address boundary.
Should be done as part of the import form refactor started by Niels.