Menu

Data direct path access (for Indexing)

PAscal
2015-09-23
2015-09-27
  • PAscal

    PAscal - 2015-09-23

    Hello Simon,

    After testing subqueries (where I met some performance problems) and knowing that you may implement table joins,
    I'm wondering if you also planned something for indexes and data direct access path.

    Maybe that a system column LINENUM (type = string) that would be an unique identifier for each row in a table.
    Would be usable as any other column in a select statement and permit a direct access to the specified LINENUM row when used in

    select * from table
    where LINENUM='12'
    or LINENUM IN (subquery)

    An index tab_colx_idx could be created as

    select
    colx,string_agg(LINENUM,',') LineNumList
    from tab
    group by colx

    Then

    Select * from tab
    where colx=234

    could be replaced by

    Select * from tab
    where colx=234
    and LINENUM in ( select string_split(LineNumList,',') from tab_colx_idx where colx=234)

    should be also used to improve performances for subqueries like

    Select
    ...
    (Select tab.coly from tab where main.id=tab.colx)
    from
    main
    where exists
    (Select 1 from tab where main.id=tab.colx)

     
  • Simon Chenery

    Simon Chenery - 2015-09-27

    The query

    select * from table where LINENUM='1000'
    

    will not be fast because CsvJdbc still has to read the CSV file sequentially from the start to find the matching row. There is no way to jump directly to 1000th row in the CSV file.

    I think the only possibilities to make queries faster are:

    1. divide your data into separate CSV files based on the column you use for queries and then query only that table. However, this is not very flexible. For example, with column idx used for querying:

      select * from t_4_thousand where idx=4678

      select * from t_6_thousand where idx=6324

      select * from t_8_thousand where idx=8412

    2. use a real database and create indexes

     
  • PAscal

    PAscal - 2015-09-27

    This was just an architectural question to improve correlated subqueries (and future joins) in csvjdbc.

    LINENUM is useless if performed with a full table scan when used in queries like:

    select
    (select x from sub where Linenum=BtreeSearchFor(sub_id_index,main.id))
    from main

    I understand that CsvJdbc has to read le CSV file sequentially from the start to get all rows in memory.
    But isn't it a way to get that memory indexed for the next reads ?

    I like CsvJdbc and I don't want to use any other database for the tools I have developped yet.
    I'm just searching to push the actual limits of CsvJdbc.

    Have you think about:
    - multicore computing model,
    - data Caching,
    - Indexing using http://algs4.cs.princeton.edu/62btrees/BTree.java.html,
    - Using Hash Join algorithms (as performed in other rdbms)
    - Partitionning
    - ...

     

    Last edit: PAscal 2015-09-27

Log in to post a comment.