After testing subqueries (where I met some performance problems) and knowing that you may implement table joins,
I'm wondering if you also planned something for indexes and data direct access path.
Maybe that a system column LINENUM (type = string) that would be an unique identifier for each row in a table.
Would be usable as any other column in a select statement and permit a direct access to the specified LINENUM row when used in
select * from table
where LINENUM='12'
or LINENUM IN (subquery)
An index tab_colx_idx could be created as
select
colx,string_agg(LINENUM,',') LineNumList
from tab
group by colx
Then
Select * from tab
where colx=234
could be replaced by
Select * from tab
where colx=234
and LINENUM in ( select string_split(LineNumList,',') from tab_colx_idx where colx=234)
should be also used to improve performances for subqueries like
Select
...
(Select tab.coly from tab where main.id=tab.colx)
from
main
where exists
(Select 1 from tab where main.id=tab.colx)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
will not be fast because CsvJdbc still has to read the CSV file sequentially from the start to find the matching row. There is no way to jump directly to 1000th row in the CSV file.
I think the only possibilities to make queries faster are:
divide your data into separate CSV files based on the column you use for queries and then query only that table. However, this is not very flexible. For example, with column idx used for querying:
select * from t_4_thousand where idx=4678
select * from t_6_thousand where idx=6324
select * from t_8_thousand where idx=8412
use a real database and create indexes
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This was just an architectural question to improve correlated subqueries (and future joins) in csvjdbc.
LINENUM is useless if performed with a full table scan when used in queries like:
select
(select x from sub where Linenum=BtreeSearchFor(sub_id_index,main.id))
from main
I understand that CsvJdbc has to read le CSV file sequentially from the start to get all rows in memory.
But isn't it a way to get that memory indexed for the next reads ?
I like CsvJdbc and I don't want to use any other database for the tools I have developped yet.
I'm just searching to push the actual limits of CsvJdbc.
Have you think about:
- multicore computing model,
- data Caching,
- Indexing using http://algs4.cs.princeton.edu/62btrees/BTree.java.html,
- Using Hash Join algorithms (as performed in other rdbms)
- Partitionning
- ...
Last edit: PAscal 2015-09-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello Simon,
After testing subqueries (where I met some performance problems) and knowing that you may implement table joins,
I'm wondering if you also planned something for indexes and data direct access path.
Maybe that a system column LINENUM (type = string) that would be an unique identifier for each row in a table.
Would be usable as any other column in a select statement and permit a direct access to the specified LINENUM row when used in
select * from table
where LINENUM='12'
or LINENUM IN (subquery)
An index tab_colx_idx could be created as
select
colx,string_agg(LINENUM,',') LineNumList
from tab
group by colx
Then
Select * from tab
where colx=234
could be replaced by
Select * from tab
where colx=234
and LINENUM in ( select string_split(LineNumList,',') from tab_colx_idx where colx=234)
should be also used to improve performances for subqueries like
Select
...
(Select tab.coly from tab where main.id=tab.colx)
from
main
where exists
(Select 1 from tab where main.id=tab.colx)
The query
will not be fast because CsvJdbc still has to read the CSV file sequentially from the start to find the matching row. There is no way to jump directly to 1000th row in the CSV file.
I think the only possibilities to make queries faster are:
divide your data into separate CSV files based on the column you use for queries and then query only that table. However, this is not very flexible. For example, with column
idxused for querying:select * from t_4_thousand where idx=4678
select * from t_6_thousand where idx=6324
select * from t_8_thousand where idx=8412
use a real database and create indexes
This was just an architectural question to improve correlated subqueries (and future joins) in csvjdbc.
LINENUM is useless if performed with a full table scan when used in queries like:
select
(select x from sub where Linenum=BtreeSearchFor(sub_id_index,main.id))
from main
I understand that CsvJdbc has to read le CSV file sequentially from the start to get all rows in memory.
But isn't it a way to get that memory indexed for the next reads ?
I like CsvJdbc and I don't want to use any other database for the tools I have developped yet.
I'm just searching to push the actual limits of CsvJdbc.
Have you think about:
- multicore computing model,
- data Caching,
- Indexing using http://algs4.cs.princeton.edu/62btrees/BTree.java.html,
- Using Hash Join algorithms (as performed in other rdbms)
- Partitionning
- ...
Last edit: PAscal 2015-09-27