Menu

#10 Feature: Support Per-Column Date/Timestamp Format Transformations via String-Cast Ingestion

open
nobody
2026-06-06
2026-05-30
Anonymous
No

Originally created by: DarekDan
Originally owned by: SouravRoy-ETL

This feature request stems from observations in [#3].

Currently, DuckDB imposes a strict limitation where it only supports one global date and timestamp format configuration per file during data ingestion. If a dataset contains multiple date or timestamp columns that utilize completely different string formats (for example, one column using dd/mm/yy and another using yyyy-mm-dd), DuckDB cannot natively auto-detect or parse both correctly at the same time.

To bypass this natively in DuckDB, users have to read columns as strings and parse them manually post-ingestion.

Proposed Solution
Introduce a new, explicit transformation layer within duckle that implements a two-step parsing strategy for date/timestamp columns:

Staged Ingestion: Read the specified date or timestamp columns from the source file explicitly as standard text (VARCHAR/STRING), preventing DuckDB's automatic global parsing from failing or misinterpreting the values.

Post-Read Conversion: After or during the initial ingestion phase, apply an explicit type-casting transformation using DuckDB's strptime() function (or an equivalent mapping method), passing a user-specified format string uniquely configured for each individual column.

This approach will decouple format specifications from the file-level constraints and permit true per-column format specification.

Proposed UI / UX Adjustments
Schema Mapping View: When a user updates a column type to Date or Timestamp in the UI schema editor, provide an optional text field to supply a custom format string (e.g., %d/%m/%Y).

Under the Hood: If a custom format is supplied, duckle should construct the internal DuckDB query to ingest that specific column as a VARCHAR initially, appending a SQL transformation rule: strptime(column_name, 'user_format_string')::DATE.

Additional Context
This addresses a common real-world limitation of DuckDB auto-detection handling complex schemas, as mentioned in the conversation on [#3].

Related

Tickets: #3

Discussion

  • Anonymous

    Anonymous - 2026-06-01
     
  • Anonymous

    Anonymous - 2026-06-04

    Originally posted by: SouravRoy-ETL

    Implemented the solution.

     
  • Anonymous

    Anonymous - 2026-06-06

    Originally posted by: mitslabo

    I was able to get it working by specifying a FORMAT in the Schema tab of sync.csv.

    However, after the pipeline is executed once, the setting is removed from the plan.

    Before execution (with FORMAT specified)

    CREATE OR REPLACE VIEW "n_mq1lgav8_nfgv3" AS
    SELECT * REPLACE (
        try_strptime("date", '%Y-%m-%d')::DATE AS "date"
    )
    FROM read_csv_auto(
        '${DUCKLE_PATH}',
        header=true,
        delim=' ',
        quote='"',
        encoding='utf-8',
        dateformat='%Y-%m-%d',
        timestampformat='%Y-%m-%dT%H:%M:%S+%Z',
        types = {'date': 'VARCHAR', 'remote_addr': 'VARCHAR'}
    );
    

    After execution

    The generated plan changes to:

    -- VIEW · TSV (n_mq1lgav8_nfgv3)
    CREATE OR REPLACE VIEW "n_mq1lgav8_nfgv3" AS
    SELECT * FROM read_csv_auto(
        '${DUCKLE_PATH}',
        header=true,
        delim=' ',
        quote='"',
        encoding='utf-8',
        dateformat='%Y-%m-%d',
        timestampformat='%Y-%m-%dT%H:%M:%S+%Z',
        types = {'date': 'DATE', 'remote_addr': 'VARCHAR'}
    );
    

    The try_strptime() conversion is removed, and the column type is changed from VARCHAR to DATE.
    As a result, the original FORMAT setting appears to be lost after execution. Is this expected behavior, or is it a bug?

     

Log in to post a comment.

Auth0 Logo