DISTOD
Distributed discovery of bidirectional order dependencies
The DISTOD data profiling algorithm is a distributed algorithm to discover bidirectional order dependencies (in set-based form) from relational data. DISTOD is based on the single-threaded FASTOD-BID algorithm [1], but DISTOD scales elastically to many machines outperforming FASTOD-BID by up to orders of magnitude. Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending order. The knowledge about order relationships is useful for many data management tasks, such as query optimization, data cleaning, or consistency checking. Because the bODs of a specific dataset are usually not explicitly given, they need to be discovered. The discovery of all minimal bODs (in set-based canonical form) is a task with exponential complexity in the number of attributes.