[Postgres-xc-developers] Materialized views in Postgres-XC

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi All,
I am working on supporting MVs in XC. Here are some notes I have written.
Please provide your comments. I am leaning towards creating MVs at the
coordinators.

For any materialized views (called MV henceforth) implementation, we need to
implement following functionality
1. Creation of MV (bare creation without population)
2. Population of MV
3. Querying MV
4. Dropping MV
5. Alter MV
6. Creating Indexes on MV

Following write-up discussed various approaches for implementing MVs in XC
1. Create materialised view only on the coordinators
   While creating the materialised view, the CREATE MATERIALIZED VIEW
statement is
   turned into a CTAS query. The new table is created as a materialised
view table.
   In XC, CTAS is implemented by rewriting the statement as CREATE TABLE
...; INSERT INTO
   .... Hence we will need to stop the MV's CTAS being rewritten.

   A. Creation of MV
      At the time of creation of the MV table, we will need to create the
table
      locally at the coordinator. The subsequent heap_insert is expected to
      automatically populate this local table. By propagating Create MV
statement
      to each of the coordinator (and not the datanode/s) we will get the
same
      effect. More details below.
      We will need to set the distribution info of MV to NULL to indicate
that
      it's a local table.
      Given that the catalog tables are created locally on each of the
      coordinators, we should be able to CREATE MV locally on the
coordinators.

   B. Population of materialized view (REFRESH MV command)
      There are two approaches here.
      i. REFRESH MV command would be sent to each of the coordinators, and
each of
      the coordinator would populate its MV table separately. This means
that each
      of the coordinators would fire the same query and would get the same
result,
      which is waste of resources. In a sane MV implementation we won't
expect MVs
      to be refreshed frequently, and thus having this wastage once in a
while
      would not be an overhead. Given that this approach needs very small
efforts,
      it might be acceptable in version 1. The planner code is sprinkled
with
      !IsConnFromCoord()? where we do not create RemoteQueryPaths for remote
      relations. For population of MV, we need to lift this restriction. I
am not sure how
      to distinguish normal scans from scans for refreshing MV.

      ii. REFRESH MV on the coordinator where the REFRESH command is issued
and
      send the data using COPY or some bulk protocols to the other
coordinators.
      This approach needs some extra efforts for constructing the COPY
command
      and the corresponding data file to be sent over to the other
coordinators.
      Also, it needs some changes to REFRESH MATERIALIZED handling so that
such
      data can be sent along with it. We may try this approach in version 2.

   C. Querying MV
      Nothing more needs to be done here. In most of the places, we have
treated
      tables with no distribution info as local tables. Hence this won't be
      difficult. There might be corner cases, testing will be required for
such
      corner cases.

   D. ALTER MV
      I don't think there is any change required here. We will need to test
it
      to make sure that the things are working as expected. We should
propagate
      the ALTER MV command to all the coordinators and not to the datanodes.

   E. Creating indexes on MV
      Given that we can create indexes on the catalog tables, it should be
      possible to create indexes on the MV tables as well. Testing will
need to
      be done to make sure that this works. CREATE INDEX command will need
to be
      propagated to all the coordinators and not to the datanodes.

   Materialized views are a query caching technique. Usually a query would
be
   served by a single materialized view or multiple of them. So, having them
   created on the coordinator/s reduces one trip to the datanode. This will
also
   allow us to take a step towards coordinator - datanode merge.

2. Create MV definition on both the coordinator and datanode but keep the
data on the datanode/s
   While creating the materialised view, the CREATE MATERIALIZED VIEW
statement is
   turned into a CTAS query. The new table is created as a materialised
view table.
   In XC, CTAS is implemented by rewriting the statement as CREATE TABLE
...; INSERT INTO
   .... We need to filter MVs out of this rewriting.

   A. Creating MV
      Pass MV definition to the datanode and all the coordinators. This
would be
      done in two phases. First we create just the definition on all the
nodes.
      In the second phase, the MV table created on the datanode will be
      populated using method below.

   B. Populating an MV
      We need to device a special INSERT INTO protocol which would run the
      associated query on the initiating coordinator and insert the results
into
      the MV table at the datanodes. REFRESH MV would be turned into INSERT
INTO
      at the initiating coordinator and each other coordinator would get
REFRESH
      MV command so that the MV would be set as scanable without causing the
      query to be fired.

   C. Querying MV
      Like normal tables, we will have to create corresponding
      REMOTE_QUERY_PATH for MV relations.

   D. ALTER MV
      Passing down ALTER MV to the datanodes should work. But it needs some
      testing to make sure that ALTER MV is working fine.

   E. Creating indexes on MV
      This looks straight forward, since we can create indexes on the
datanodes.
      Needs some testing.

If we create MV on the datanodes, and do not allow distribution to be
specified,
those MVs need to be replicated (since they are stable by definition) by
default
as against the default distribution by hash. Given that materialized views
are
query caching technique, we expect that a query would be entirely served by
a
materialized view or multiple of those. In such case, having materialized
views
on datanode/s would create unnecessary hops from coordinator to the
datanode.

Distributing MV
---------------
If we need to create materialized view only on selected nodes
(coordinators/datanodes) we will need to
   a. extend the CREATE/ALTER MV syntax to specify nodes where to create
the MV.
   b. disallow any REFRESH on the nodes where the MV does not exist.
   We may take up this extension in version 2.

What happens to an MV when a node is added or dropped?
------------------------------------------------------
We need to test. Having MV at the coordinator helps here, since addition or
removal of datanode doesn't affect the MV. While adding a coordinator, the
catalog replication itself would create the MV. Dropping a node wouldn't be
a
problem. We need to test the feature.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company