Originally created by: DarekDan
Originally owned by: SouravRoy-ETL
This feature request stems from observations in [#3].
Currently, DuckDB imposes a strict limitation where it only supports one global date and timestamp format configuration per file during data ingestion. If a dataset contains multiple date or timestamp columns that utilize completely different string formats (for example, one column using dd/mm/yy and another using yyyy-mm-dd), DuckDB cannot natively auto-detect or parse both correctly at the same time.
To bypass this natively in DuckDB, users have to read columns as strings and parse them manually post-ingestion.
Proposed Solution
Introduce a new, explicit transformation layer within duckle that implements a two-step parsing strategy for date/timestamp columns:
Staged Ingestion: Read the specified date or timestamp columns from the source file explicitly as standard text (VARCHAR/STRING), preventing DuckDB's automatic global parsing from failing or misinterpreting the values.
Post-Read Conversion: After or during the initial ingestion phase, apply an explicit type-casting transformation using DuckDB's strptime() function (or an equivalent mapping method), passing a user-specified format string uniquely configured for each individual column.
This approach will decouple format specifications from the file-level constraints and permit true per-column format specification.
Proposed UI / UX Adjustments
Schema Mapping View: When a user updates a column type to Date or Timestamp in the UI schema editor, provide an optional text field to supply a custom format string (e.g., %d/%m/%Y).
Under the Hood: If a custom format is supplied, duckle should construct the internal DuckDB query to ingest that specific column as a VARCHAR initially, appending a SQL transformation rule: strptime(column_name, 'user_format_string')::DATE.
Additional Context
This addresses a common real-world limitation of DuckDB auto-detection handling complex schemas, as mentioned in the conversation on [#3].
Originally posted by: SouravRoy-ETL
Implemented the solution.
Originally posted by: mitslabo
I was able to get it working by specifying a FORMAT in the Schema tab of
sync.csv.However, after the pipeline is executed once, the setting is removed from the plan.
Before execution (with FORMAT specified)
After execution
The generated plan changes to:
The
try_strptime()conversion is removed, and the column type is changed fromVARCHARtoDATE.As a result, the original FORMAT setting appears to be lost after execution. Is this expected behavior, or is it a bug?