This filter should generate a Cartesian product of records received from input A and B that matches a specified join condition. This filter should have 2 connectors and should be used on the JOIN and WHERE statements. Results should be streamed to the output connector.
Swapping connectors should give same results, but a different operation cost: using connector 1 for A give same result as using connector 1 for B, operation cost can differ, depending of number of rows and cardinalities in A and B.
There should be 3 way to process join operations.
Algorithm A: nested loop table scan
Operations: for every row in table A, scan table B and extract matching records.
recordA = tableA.getfirstrecord()
loop A
if recordA.eof then exit A
recordB = tableB.getfirstrecord()
loop B
if recordB.eof then exit B
if recordA.fieldA = recordB.fieldB then
output.putmessage(concat(recordA, recordB))
end if
recordB = tableB.getnextrecord()
end loop B
recordA = tableA.getnextrecord()
end loop A
This algorithm does not require indexes. Operation cost is usually more expensive than algorithms B and C. Result record set will not be sorted.
Algorithm B: scan and hash join
Operations: for every row in A, search index for matches in B, using the index as a hash table.
recordA = tableA.getfirstrecord()
loop
if recordA.eof then exit loop A
recordB = indexB.getfirstkey(recordA.fieldA)
loop
if indexB.eof then exit loop B
if recordA.fieldA
This algorithm does require only one index on join field of table B. result record set will be grouped by B.
Algorithm C: merge sorted scans
operations: scan A and B at the same time in ascending order, doing a 'racing' style scanning, and extract matches.
recordA = indexA.getfirstkey()
recordB = indexB.getfirstkey()
loop A
if recordA.eof and recordB.eof then exit loop
if recordA.fieldA > recordB.fieldB then
recordB = indexB.getfirstkey(recordA.fieldA)
elseif recordA.fieldA
This algorithm requires an index in both join fields. Result record set will be sorted or grouped by field A and B. Operation cost is usually less than the two previous algorithms.
Cost calculation should depends on the algorithm.
Cost for algorithm A should be: number of rows in A * number of rows in B. this is the same as (unit cost of filter connected to A multiply by unit cost of filter connected to B).
Cost for algorithm B should be: number of rows in A * average cardinality of B * 4 / 3. The average cardinality of a table should be: number of rows / number of values. The number of values could be got from statistical data.
Cost for algorithm C should be: Number of values in A * (average cardinality of B * 4 / 3) *4 / 3 + number of values in B * 4 / 3