Menu

#317 Support embedded linefeeds

Pascal
closed
CSV import (39)
6normal
2023-08-21
2023-01-31
TomScat
No

More a question that a bug: I'm trying to open a db and get the error message:
Something went wrong: invalid data: for value dimensioned column: 4.25 at line 8786, col 10

How do I have to understand this?
- Line 8786: does this include the header lines?
- "dimensioned column: 4.25" I'm not sure where to look.

1 Attachments

Related

Ravel: #317

Discussion

  • High Performance Coder

    • Line 8786: does this include the header lines?
      Yes.
    • "dimensioned column: 4.25" I'm not sure where to look.
      You have labelled the column 4.25, rather than "Interest Rate", which might be more helpful. There should be some sort of issue on line 8786. I'm exporting that dataset now to see if it is the same for me.
     
    👍
    1
    • Steve Keen

      Steve Keen - 2023-01-31

      This is a very useful database for testing Ravel. It emphasises the need
      for an in situ tool to see and edit data, since this file is too big to
      load into Excel (it truncates the file at its record limit).

      Maybe adding a viewing window to the import routine that shows the
      offending row and its neighbours and allows the user to edit the
      highlighted error?

      Automated cleaning will also be necessary. Yesterday I located a database
      that used a dash "-" for no data. A tool to convert such things (including
      Excel's N/A) into NaNs (or just empty cells) would be great.
      Best, Steve
      Professor Steve Keen
      Want to rebuild economics? Support me on
      Patreon: https://www.patreon.com/ProfSteveKeen
      https://www.patreon.com/ProfSteveKeen

      My latest book, The New Economics, is now available from Polity:
      http://politybooks.com/bookdetail/?isbn=9781509545285
      @ProfSteveKeen
      Mobile +66 (0) 99-257-2692

      Honorary Professor, UCL &ISRS Distinguished Research Fellow
      www.profstevekeen.com

      On Tue, Jan 31, 2023 at 7:54 AM High Performance Coder hpcoder@users.sourceforge.net wrote:

      • Line 8786: does this include the header lines?
        Yes.
      • "dimensioned column: 4.25" I'm not sure where to look.
        You have labelled the column 4.25, rather than "Interest Rate", which
        might be more helpful. There should be some sort of issue on line 8786. I'm
        exporting that dataset now to see if it is the same for me.

      Status: open
      Milestone: Pascal
      Created: Tue Jan 31, 2023 04:48 AM UTC by TomScat
      Last Updated: Tue Jan 31, 2023 05:43 AM UTC
      Owner: nobody
      Attachments:

      More a question that a bug: I'm trying to open a db and get the error
      message:
      Something went wrong: invalid data: for value dimensioned column: 4.25 at
      line 8786, col 10

      How do I have to understand this?
      - Line 8786: does this include the header lines?
      - "dimensioned column: 4.25" I'm not sure where to look.


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/minsky/ravel/317/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
      👍
      1

      Related

      Ravel: #317

      • High Performance Coder

        The error report option floats the errors to the top of the dataset, so you can edit (ie fix) the data. Unfortunately, not so useful when the dataset is too large for a spreadsheet to import it, or even a text editor (emacs struggled on this dataset).

        I'm not sure that we could successfully add an editing tool that will handle these large dataset cases either, BTW. Usually at this stage, its using unix CLI tools, or python scripts to get it done.

         
  • High Performance Coder

    Interesting "bad boy" example. There are a mutliple duplicate records in this dataset, nearly 100 of which are exact duplicates.

    zen>sort IBRD_Statement_Of_Loans_-_Historical_Data.csv|uniq|wc -l
    1169592
    zen>wc -l IBRD_Statement_Of_Loans_-_Historical_Data.csv 
    1169685 IBRD_Statement_Of_Loans_-_Historical_Data.csv
    

    I chose to average the values of the duplicate records (choices are typically average, max or min, and if they're exactly duplicate, it doesn't matter).

    I selected ignore for the trailing date columns, and data for the interest rate. and the numerical columns from original principal to loans held.

    I got a "missing data" error on line 184888. this file is rather too large for spreadsheets, and even my text editor was struggling with it, so I examined the 10 lines following from 184888:

    zen>tail -n +184888 IBRD_Statement_Of_Loans_-_Historical_Data.csv|head -10
    01/31/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,Zagrebacka banka d.d.,HR,Croatia,SNGL CRNCY,Disbursed,1.97,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4451700.03,500253.29,0,500253.29,0,0,0,500253.29,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    02/29/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    ka banka d.d.",HR,Croatia,SNGL CRNCY,Disbursed,1.97,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4451700.03,510294.8,0,510294.8,0,0,0,510294.8,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    03/31/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    ka banka d.d.",HR,Croatia,SNGL CRNCY,Disbursed,1.41,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4698768.85,253603.82,0,253603.82,0,0,0,253603.82,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    04/30/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    ka banka d.d.",HR,Croatia,SNGL CRNCY,Disbursed,1.41,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4698768.85,251425.95,0,251425.95,0,0,0,251425.95,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    05/31/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    ka banka d.d.",HR,Croatia,SNGL CRNCY,Disbursed,1.41,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4698768.85,235924.02,0,235924.02,0,0,0,235924.02,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    06/30/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    zen>tail -n +184888 IBRD_Statement_Of_Loans_-_Historical_Data.csv|head -10
    01/31/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,Zagrebacka banka d.d.,HR,Croatia,SNGL CRNCY,Disbursed,1.97,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4451700.03,500253.29,0,500253.29,0,0,0,500253.29,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    02/29/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    ka banka d.d.",HR,Croatia,SNGL CRNCY,Disbursed,1.97,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4451700.03,510294.8,0,510294.8,0,0,0,510294.8,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    03/31/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    ka banka d.d.",HR,Croatia,SNGL CRNCY,Disbursed,1.41,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4698768.85,253603.82,0,253603.82,0,0,0,253603.82,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    04/30/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    ka banka d.d.",HR,Croatia,SNGL CRNCY,Disbursed,1.41,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4698768.85,251425.95,0,251425.95,0,0,0,251425.95,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    05/31/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    ka banka d.d.",HR,Croatia,SNGL CRNCY,Disbursed,1.41,,P040139,INVESTMENT RECOVERY,8000000,3734380.61,0,3636810.02,4698768.85,235924.02,0,235924.02,0,0,0,235924.02,03/15/2003 12:00:00 AM,09/15/2012 12:00:00 AM,12/04/1997 12:00:00 AM,11/18/1997 12:00:00 AM,03/17/1998 12:00:00 AM,12/31/2001 12:00:00 AM,09/27/2001 12:00:00 AM
    06/30/2012 12:00:00 AM,IBRD42460,EUROPE AND CENTRAL ASIA,HR,Croatia,"Zagreba
    

    It looks to me like the export program has started inserting spurious line feeds into the data.

    Not sure what to do to correct that. Maybe just process the first 184887 lines?

     
  • High Performance Coder

    Got it imported (first 184887 lines).

     
  • High Performance Coder

    According to RFC1480, linefeeds are acceptable within a quoted field. Dang - this complicates our CSV parsing dramatically...

     
  • High Performance Coder

    • summary: Ravel v19 - error message - invalid data --> Support embedded linefeeds
     
  • High Performance Coder

    • labels: --> CSV import
     
  • High Performance Coder

    • status: open --> closed
    • assigned_to: High Performance Coder
     
  • High Performance Coder

    Done

     

Log in to post a comment.