Remove variant slowness
Brought to you by:
bpeng2000
When I delete a large proportion of variants from a project, 'vtools remove variants table' takes a very long time. This is because the query
'DELETE FROM variant WHERE variant_id IN (SELECT variant_id in table)'
needs to update indexes for every record, and remove records one by one is slow. A better method would be to
1. create a new table temp (with structure from variant)
2. INSERT INTO temp SELECT * FROM variant WHERE variant IN SELECT variant_id FROM table)
3. DROP TABLE variant;
4. rename temp to varant
5. rebuild indexes.
Agreed. But I see that slowest part is in removing variants in samples -- processing each genotype table and get rid of variants. That'd take about 3 hrs on my SSD.
I have found 'vtools init --parent' is much faster in getting a project with required variants, so command 'vtools remove variants' should only be used for cases with a small number of variants (e.g. for quality control).
The slowness with removing variants from samples is because genotype tables do not have index. This can be easily fixed.