Menu

Join_filter

silex6

This filter should generate a Cartesian product of records received from input A and B that matches a specified join condition. This filter should have 2 connectors and should be used on the JOIN and WHERE statements. Results should be streamed to the output connector.

Swapping connectors should give same results, but a different operation cost: using connector 1 for A give same result as using connector 1 for B, operation cost can differ, depending of number of rows and cardinalities in A and B.

Algorithms

There should be 3 way to process join operations.

A

Algorithm A: nested loop table scan

Operations: for every row in table A, scan table B and extract matching records.

recordA = tableA.getfirstrecord()

loop A
    if recordA.eof then exit A

    recordB = tableB.getfirstrecord()
    loop B      
        if recordB.eof then exit B

        if recordA.fieldA = recordB.fieldB then
            output.putmessage(concat(recordA, recordB))
        end if

        recordB = tableB.getnextrecord()
    end loop B

    recordA = tableA.getnextrecord()
end loop A

This algorithm does not require indexes. Operation cost is usually more expensive than algorithms B and C. Result record set will not be sorted.

B

Algorithm B: scan and hash join

Operations: for every row in A, search index for matches in B, using the index as a hash table.

recordA = tableA.getfirstrecord()
loop

    if recordA.eof then exit loop A

    recordB = indexB.getfirstkey(recordA.fieldA)

    loop
        if indexB.eof then exit loop B

        if recordA.fieldA

This algorithm does require only one index on join field of table B. result record set will be grouped by B.

C

Algorithm C: merge sorted scans

operations: scan A and B at the same time in ascending order, doing a 'racing' style scanning, and extract matches.

recordA = indexA.getfirstkey()
recordB = indexB.getfirstkey()

loop A
    if recordA.eof and recordB.eof then exit loop

    if recordA.fieldA > recordB.fieldB then
        recordB = indexB.getfirstkey(recordA.fieldA)
    elseif recordA.fieldA

This algorithm requires an index in both join fields. Result record set will be sorted or grouped by field A and B. Operation cost is usually less than the two previous algorithms.

Cost calculation

Cost calculation should depends on the algorithm.

Cost for algorithm A should be: number of rows in A * number of rows in B. this is the same as (unit cost of filter connected to A multiply by unit cost of filter connected to B).

Cost for algorithm B should be: number of rows in A * average cardinality of B * 4 / 3. The average cardinality of a table should be: number of rows / number of values. The number of values could be got from statistical data.

Cost for algorithm C should be: Number of values in A * (average cardinality of B * 4 / 3) *4 / 3 + number of values in B * 4 / 3

See also

[Set_engine]


Related

Wiki: Filters
Wiki: Set_engine

MongoDB Logo MongoDB