Menu

#460 add option to sort/merge with multipe processes/thread

GCSORT
open
None
5 - default
2024-06-23
2024-01-24
No

The idea is to split the file into chunks, do a sort on each with a separate process and then "back in the main process" do a merge/sort.

Using threads instead of processes may be easier, but likely harder to achieve in a portable way.
The user would specify with new commands (or command line switches) how many processes/threads to use (possibly with a "max" named value that queries the number of processors) and how many records each should handle.
The use of the later: cpu-cycles and therefore time for sorting is data-dependent, so quite likely that each chunk takes a different amount. In this case it could be useful to use for example 8 processes/threads to each sort 10000 records and when over, taking the next 10000 (either directly or started in a new process/thread) and when each sub-process/thread is over, then do the final sort/merge. For a lot of sorted chunks it could even be useful to do that multi-threaded.

@smenna What do you think? If you like the idea but won't do it yourself, then we may could as the OCamlPro folks.

Discussion

  • Mickey White

    Mickey White - 2024-01-24

    I hate to butt in , but wouldn't the DISK (if files on disk) be an issue. will there be 8 reads and 8 writes to the same disk as the same time. Or would all 8 (processes/threads) be in memory.
    I remember back in the Mainframe IMS days we would do something similar . Our large database was on about 12 big disks drives that the IMS randomizer would split the DB up into. Then using randomizer with the key for the sequential flat files we knew could work on all 16 disks at the same time. We could to a reload quicker... sorts... whatever.
    Just a question/thought.
    Feel free to delete this comment, but send me your thoughts and corrections please

     

    Last edit: Mickey White 2024-01-25
  • Sauro Menna

    Sauro Menna - 2024-01-31

    Hi,
    I am preparing feedback on current GCSort times and a parallelism hypothesis for sort. I will forward the result of the analysis.
    Splitting into blocks, at read time is only possible on fixed length files, while for variable ones there will be a need for only one Reader and 'n' Sort.
    Give me a few more days.
    I in the experiments I initially performed have found Memory Mapped File type files to perform very well in both encryption and reading.
    It must be considered that the final write will be unique, so there will be:
    - a Reader
    - n Sorter
    - a Writer
    From the feedback I will send it will be evident.
    Best Regards.
    Sauro

     
    👍
    1
    • Sauro Menna

      Sauro Menna - 2024-02-23

      Hi,
      attached is a document where there are some considerations on the topic of multi threading.
      I have reported some real execution cases and a projection using threads.
      The greatest benefits are obtained with sorting with a large number of sorting keys.
      Performance with CPU and I/O parallelisms must be considered.
      Thread usage for sort is triggered when the read block ends, so the thread works while the reader reads the next chuck, so the read time must be added to the execution time of the sort phase. This results in the threads working in parallel, but ending one after the other as they will start at different times. That is why only in the case of sort with multiple keys can multi-threading be considered, whereas when the sort time is less than the read time you run the risk of having a single thread.
      I will do some more analysis on the subject.
      Best regards.
      Sauro

       
      ❤️
      1
      • Simon Sobisch

        Simon Sobisch - 2024-02-23

        Thank you for your thoughts and tests. I think that the "single reader" would not be necessary when applying a trick: example layout with LSQ files and a "read split" with 100k

        thread 1 is started to read "around 100k", it starts and handles the first records
        thread 2 is started in the meantime to read "around 100k". It positions itself at 100k, then reads until it finds a line feed, then starts from there, reading at least until 200k.

        thread 1 sometime hits its 100k limit in reading, it then reads until the record ends, which is the same place that thread 2 started to handle the records.

        That sounds simple - would it solve the non fixed-length issue? Note: with indexed files the multi read can only work if the number of records are known and a position by record number is possible.

        Your conclusion about the sorting itself being the biggest issue (when memory mapped files are used) and that multi-thread = "max sort thread time" vs. single thread "sum sort time" is the biggest part of the difference is definitely true. Operating "multi process" is of course more expensive than "multi thread", especially on Win32; so it is only reasonable to use this approach when threads don't work in that environment (both approaches can use memory mapped files).

        What I'm not sure is the reading part - if the source file is memory mapped, too, then the threads itself can read it (split by offset), so the single reader is not necessary, no?
        For multiple input files these could itself be split to be handled by multiple threads.

         
        • Sauro Menna

          Sauro Menna - 2024-03-10

          Hi Simon,
          I have integrated multithreading into gcsort and have run initial tests in a Windows environment with fixed-length sequential files.
          I will extend the handling to variable-length sequential files and then to line sequential files.
          Added a parameter to the command-line 'mt=<n>' where <n> is the number of threads.
          This will give each thread an assigned block of records to read.
          Each thread opens the input file(s), loads the data into meoria, and then sorts it according to the specification of the sort key.
          The gcsort process waits for all threads to finish and then merge and write the result to the output file.
          The merge step can read either from memory or from TMP (MMF) files depending on the size of the input file.
          I will now test the operation in linux environment.
          I used the qsort_s function for Linux and qsort_r for Windows.
          I will have to update the testcases and documentation.
          I will send an update on developments.</n></n>

          Sauro

           
          • Simon Sobisch

            Simon Sobisch - 2024-03-10

            Thank you for this good news.

            It is reasonable to do fixed length first, adding the others later on.

            Do you already have running results concerning the run with different values?

            Simon

             
            • Sauro Menna

              Sauro Menna - 2024-03-10

              The execution times between singleThread and multiThread are similar.
              The data are identical between gcsort no Thread, multiThread and COBOL sort program.
              Unfortunately, the part where gcsort loses time is the merge and write phase of the output result of the parallel sorts.
              I will have to improve this phase.
              In general, in linux environment, the times are:

              Memory allocation
              GCSORT_MEMSIZE="2048000000"

              ~~~
              Single KEY
              Single Thread
              Total Records Number..............: 3000000
              Total Records Write Sort..........: 0
              Total Records Write Output........: 3000000
              Start : Sun Mar 10 13:20:12 2024
              End : Sun Mar 10 13:20:23 2024
              Elapsed Time 00hh 00mm 11ss 000ms

              MultiThread (4 threads)
              Total Records Number..............: 3000000
              Total Records Write Sort..........: 0
              Total Records Write Output........: 3000000
              Start : Sun Mar 10 13:20:58 2024
              End : Sun Mar 10 13:21:09 2024
              Elapsed Time 00hh 00mm 11ss 000ms

              MultiThread (3 threads)
              Total Records Number..............: 3000000
              Total Records Write Sort..........: 0
              Total Records Write Output........: 3000000
              Start : Sun Mar 10 13:24:55 2024
              End : Sun Mar 10 13:25:08 2024
              Elapsed Time 00hh 00mm 13ss 000ms

              MultiThread (6 threads)
              Total Records Number..............: 3000000
              Total Records Write Sort..........: 0
              Total Records Write Output........: 3000000
              Start : Sun Mar 10 13:25:35 2024
              End : Sun Mar 10 13:25:50 2024
              Elapsed Time 00hh 00mm 15ss 000ms

              Multy Keys
              File Size: 270 MByte
              SORT FIELDS : (8,5,CH,A,13,3,BI,A,16,4,FI,A,20,8,FL,A,28,4,PD,A,32,7,ZD,A)
              record length = 90

              No Thread
              Total Records Number..............: 3000000
              Total Records Write Sort..........: 0
              Total Records Write Output........: 3000000
              Start : Sun Mar 10 13:34:33 2024
              End : Sun Mar 10 13:34:48 2024
              Elapsed Time 00hh 00mm 15ss 000ms

              MultiThread (4 threads)
              Total Records Number..............: 3000000
              Total Records Write Sort..........: 0
              Total Records Write Output........: 3000000
              Start : Sun Mar 10 13:33:15 2024
              End : Sun Mar 10 13:33:39 2024
              Elapsed Time 00hh 00mm 24ss 000ms

               
              • Sauro Menna

                Sauro Menna - 2024-04-17

                Hi,
                I have revised the handling of Threads and the times have improved.
                I had a problem in linux environment with cobc version 2.2 that took me a lot of time and the only way to solve it was to insert a mutex during the sort phase. The sort phase uses cob_fields to compare values.
                On a few records it works, but on large files there were mismatches on sorting while generating the same number of records.
                In Windows 10 environment and cobc 3.2 I had no problems.
                I will continue testing to nominate the first release with multithreaded integration.
                Regards.
                Sauro

                 
  • Sauro Menna

    Sauro Menna - 2024-06-23

    Hello,
    I consolidated the multithreaded version of GCSort.
    The version is 1.04.00, tested in windows, linux(ubuntu) and mingw 64 environment.
    The execution times seem satisfactory to me (I think).
    Attached is a test in Linux environment, with:
    1) Sequential file generation of 3,500,000 records.
    2) Sort COBOL (set the size of the sort in memory).
    3) GCSort single process
    4) comparison of result between gcsort and cobol.
    5) GCSort with mt=2 (two threads)
    6) comparison of the result between gcsort and cobol
    7) GCSort with mt=4 (four threads)
    8) comparison of the result between gcsort and cobol

    Elapsed
    1) Generation of :
    2) Cobol Sort : sbig001 - 0 minutes and 48 seconds elapsed.
    3) GCsort single process : Elapsed Time 00hh 00mm 49ss 000ms
    4) GCsort 2 thread : Elapsed Time 00hh 00mm 36ss 000ms
    5) GCSort 4 thread : Elapsed Time 00hh 00mm 29ss 000ms

    Best Regards.
    Sauro

     
    👍
    1

Log in to post a comment.