How long does it take to process dump files?

2008-12-05
2013-05-30
  • David Milne
    David Milne
    2008-12-05

    Hi,

    If you process a Wikipedia dump, can you please post up how long it took, and what the specs your machine are? This will give people an idea of how long things take, and who to ask if they can't process the files themselves.

    Cheers,
    Dave

     
    • David Milne
      David Milne
      2008-12-05

      I'll start things off:

      Extraction times for
         en_20080727 (5.85 million pages, 18G of markup)
         on an Intel(R) Core(TM)2 Duo (dual core) CPU E6850  @ 3.00GHz with 4G of RAM

      ==extractWikipediaData.pl==

      extracting page summary from dump file: done in 01:35:15
      extracting redirect summary from dump file: done in 01:07:33
      extracting core summaries from dump file: done in 04:16:26
      - adding titles and redirects to anchor summary: done in 00:01:12
      - saving anchors: done in 00:01:47
      summarizing anchors for quick caching
      - reorganizing anchors (pass 1 of 2): done in 00:02:53
      - saving anchor summary (pass 1 of 2): done in 00:01:07
      - reorganizing anchors (pass 2 of 2): done in 00:04:17
      - saving anchor summary (pass 2 of 2): done in 00:01:05
      summarizing generality
      - gathering category links: done in 01:04:31
      - calculating and saving page depths: done in 00:01:33
      summarizing link counts
      - gathering link counts: done in 00:06:14
      - saving link counts: done in 00:00:19
      summarizing links out from each page
      - gathering destination frequencies: done in 00:00:37
      - saving links: done in 00:07:03
      summarizing links in to each page
      - calculating space requirements: done in 00:00:38
      - pass 1 of 2
         - allocating space: done in 00:00:34
         - gathering links: done in 05:19:57
         - saving links: done in 00:00:28
      - pass 2 of 2
         - allocating space: done in 00:00:59
         - gathering links: done in 06:00:36
         - saving links: done in 00:00:28
      extracting content: done in 01:29:08

      Total time: 21:24:40

      ==splitData.pl==

      splitting dump file: done in 01:24:10

      ==extractAnchorOccurances.pl==

      2/10ths of the data (two seperate processes, running simultaneously)

      Loading anchors: done in 00:00:54
      Gathering anchor occurances: done in 16:05:53
      Printing n-grams and frequencies: done in 00:00:24
       
      Loading anchors: done in 00:00:53
      Gathering anchor occurances: done in 16:17:42
      Printing n-grams and frequencies: done in 00:00:22

      This would take around four days if I only used this machine (fortunately I had some others to throw at it).

      ==mergeAnchorOccurances.pl==

      merging anchor occurances (pass 1 of 10): done in 00:00:50
      merging anchor occurances (pass 2 of 10): done in 00:00:48
      merging anchor occurances (pass 3 of 10): done in 00:00:45
      merging anchor occurances (pass 4 of 10): done in 00:00:44
      merging anchor occurances (pass 5 of 10): done in 00:00:46
      merging anchor occurances (pass 6 of 10): done in 00:00:45
      merging anchor occurances (pass 7 of 10): done in 00:00:44
      merging anchor occurances (pass 8 of 10): done in 00:00:45
      merging anchor occurances (pass 9 of 10): done in 00:00:45
      merging anchor occurances (pass 10 of 10): done in 00:00:46
      saving merged anchor occurances: done in 00:00:29

      Total time: 00:08:07

       
  • Hello everybody,

    I've just processed the Portguese Wikipedia dump (2,4 Gb) using a Pentium Dual Core 64 bits. 1.8 Ghz, 4 Gb RAM, using Ubuntu Linux 9.10 64 bits. In spite of the regex limit warning ("Complex regular subexpression recursion limit (32766) exceeded at extractAnchorOccurances.pl line 196"), everything went normal for Portuguese version. I choosed to not split it, this is the first time I run the script. But this year I intend to run it on 4 machines, then I back here to post the time report about it.

    Here it goes…

    ==extractWikipediaData.pl==
    extracting core summaries from dump file: done in 00:58:27                         
    - adding titles and redirects to anchor summary: done in 00:00:20                         
    - saving anchors: done in 00:01:48                         
    summarizing anchors for quick caching
    - reorganizing anchors (pass 1 of 2): done in 00:00:36                         
    - saving anchor summary (pass 1 of 2): done in 00:00:11                         
    - reorganizing anchors (pass 2 of 2): done in 00:00:36                         
    - saving anchor summary (pass 2 of 2): done in 00:00:12                         
    summarizing generality
    - gathering category links: done in 00:07:45                         
    - calculating and saving page depths: done in 00:02:54                         
    summarizing link counts
    - gathering link counts: done in 00:01:32                         
    - saving link counts: done in 00:00:06                         
    summarizing links out from each page
    - gathering destination frequencies: done in 00:00:11                         
    - saving links: done in 00:01:43                         
    summarizing links in to each page
    - calculating space requirements: done in 00:00:13                         
    - pass 1 of 2
       - allocating space: done in 00:00:11                         
       - gathering links: done in 02:05:24                         
       - saving links: done in 00:00:08                         
    - pass 2 of 2
       - allocating space: done in 00:00:10                         
       - gathering links: done in 01:45:56                         
       - saving links: done in 00:00:08                         
    extracting content: done in 00:29:13 

    ==splitData.pl==
    splitting dump file: done in 00:26:22

    ==extractAnchorOccurances.pl== (no split was used)
    Loading anchors: done in 00:00:21
    Gathering anchor occurances: done in 33:09:27                      <----- !   
    Printing n-grams and frequencies: done in 00:00:06  

    ==mergeAnchorOccurances.pl==
    loading anchors: done in 00:00:21                         
    merging anchor occurances (pass 1 of 1): done in 00:00:14                         
    saving merged anchor occurances: done in 00:00:09 

     
  • dhoppe
    dhoppe
    2010-04-28


    - Hardware -

    Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz
    Mem:       8056756


    - Dump -

    enwiki-20091103-pages-articles.xml, 24GiB


    - extractWikipediaData.pl -

    page content will be extracted.
    data will be split into 2 passes for memory-intesive operations. Try using more passes if you run into problems.
    (in cleanup)
    extracting page summary from dump file: done in 02:25:26                         
    extracting redirect summary from dump file: done in 01:46:22                         
    extracting core summaries from dump file: done in 08:19:16                         
    - adding titles and redirects to anchor summary: done in 00:01:39                         
    - saving anchors: done in 00:05:57                         
    summarizing anchors for quick caching
    - reorganizing anchors (pass 1 of 2): done in 00:02:57                         
    - saving anchor summary (pass 1 of 2): done in 00:01:03                         
    - reorganizing anchors (pass 2 of 2): done in 00:03:07                         
    - saving anchor summary (pass 2 of 2): done in 00:01:14                         
    summarizing generality
    - gathering category links: done in 19:55:07                         
    - calculating and saving page depths: done in 00:30:31                         
    summarizing link counts
    reading page summary from csv file: done in 00:03:26                         
    - gathering link counts: done in 00:08:11                         
    - saving link counts: done in 00:00:30                         
    summarizing links out from each page
    - gathering destination frequencies: done in 00:00:47                         
    - saving links: done in 00:09:10                         
    summarizing links in to each page
    - calculating space requirements: done in 00:00:56                         
    - pass 1 of 2
       - allocating space: done in 00:01:28                         
       - gathering links: done in 07:08:27                         
       - saving links: done in 00:00:44                         
    - pass 2 of 2
       - allocating space: done in 00:00:51                         
       - gathering links: done in 08:23:49                         
       - saving links: done in 00:00:46                         
    extracting content: done in 02:47:21                         


    - extractAnchorOccurrances.pl -

    Loading anchors: done in 00:01:33                         
    Gathering anchor occurances: done in 71:03:47                         
    Printing n-grams and frequencies: done in 00:00:35

    Loading anchors: done in 00:01:34                         
    Gathering anchor occurances: done in 69:37:03                         
    Printing n-grams and frequencies: done in 00:00:38  

    Loading anchors: done in 00:01:41                         
    Gathering anchor occurances: done in 71:08:02                         
    Printing n-grams and frequencies: done in 00:00:36  

    Loading anchors: done in 00:01:43                         
    Gathering anchor occurances: done in 71:25:43                         
    Printing n-grams and frequencies: done in 00:00:36    


    - mergeAnchorOccurrances.pl -

    loading anchors: done in 00:01:44                         
    merging anchor occurances (pass 1 of 4): done in 00:01:06                         
    merging anchor occurances (pass 2 of 4): done in 00:01:05                         
    merging anchor occurances (pass 3 of 4): done in 00:01:06                         
    merging anchor occurances (pass 4 of 4): done in 00:01:06                         
    saving merged anchor occurances: done in 00:00:41  

     
  • Hey folks!

    I've did it again the latest Portuguese Dump of Wikipedia (this time with 2,5 gb). But this time I've choosed to work with two splits in order to benefit myself from the Dual Core architeture of the same computer i used before.

    My previous times for the anchor counting using no split was:
    * Gathering anchor occurances: done in 33:09:27

    The times using two splits processed in parallell was:
    * Gathering anchor occurances: done in 21:53:48 
    * Gathering anchor occurances: done in 24:02:10

    I believe it was a good gain! If you have a multi-core computer, try to work with as many splits as you could.

    Bye!!!

     
  • extracting core summaries from dump file: done in 06:06:34                         
    - adding titles and redirects to anchor summary: done in 00:02:42                         
    - saving anchors: done in 00:02:38                         
    summarizing anchors for quick caching
    - reorganizing anchors (pass 1 of 2): done in 00:02:28                         
    - saving anchor summary (pass 1 of 2): done in 00:00:50                         
    - reorganizing anchors (pass 2 of 2): done in 00:02:33                         
    - saving anchor summary (pass 2 of 2): done in 00:00:49                         
    summarizing generality
    - gathering category links: done in 08:10:51                         
    - calculating and saving page depths: done in 00:16:43                         
    summarizing link counts
    reading page summary from csv file: done in 00:03:04                         
    - gathering link counts: done in 00:07:02                         
    - saving link counts: done in 00:00:39                         
    summarizing links out from each page
    - gathering destination frequencies: done in 00:00:47                         
    - saving links: done in 00:07:32                         
    summarizing links in to each page
    - calculating space requirements: done in 00:00:42                         
    - pass 1 of 2
       - allocating space: done in 00:01:09                         
       - gathering links: done in 09:01:27                         
       - saving links: done in 00:00:40                         
    - pass 2 of 2
       - allocating space: done in 00:00:50                         
       - gathering links: done in 07:09:19                         
       - saving links: done in 00:00:39                         
    extracting content: done in 02:25:07

    amir@amir-desktop:~/wikipedia-miner_1.1/extraction$ perl splitData.pl /home/amir/en_20100312/ 8
    splitting dump file: done in 02:09:30

    Gathering anchor occurances: done in 19:32:57                         
    Printing n-grams and frequencies: done in 00:00:28

     
  • code51
    code51
    2010-08-18


    - Hardware -

    Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
    Mem:       16438616


    - Dump -

    trwiki-20100406-pages-articles.xml, 756 Mb

    ####### extractWikipediaData.pl ########

    page content will be extracted.
    data will be split into 2 passes for memory-intesive operations. Try using more passes if you run into problems.
    extracting page summary from dump file: done in 00:05:01                         
    extracting redirect summary from dump file: done in 00:03:40                         
    extracting core summaries from dump file: done in 00:11:36                         
    - adding titles and redirects to anchor summary: done in 00:00:03                         
    - saving anchors: done in 00:00:07                         
    summarizing anchors for quick caching
    - reorganizing anchors (pass 1 of 2): done in 00:00:05                         
    - saving anchor summary (pass 1 of 2): done in 00:00:01                         
    - reorganizing anchors (pass 2 of 2): done in 00:00:04                         
    - saving anchor summary (pass 2 of 2): done in 00:00:02                         
    summarizing generality
    - gathering category links: done in 00:00:08                         
    - calculating and saving page depths: done in 00:00:05                         
    summarizing link counts
    reading page summary from csv file: done in 00:00:07                         
    - gathering link counts: done in 00:00:13                         
    - saving link counts: done in 00:00:01                         
    summarizing links out from each page
    - gathering destination frequencies: done in 00:00:02                         
    - saving links: done in 00:00:13                         
    summarizing links in to each page
    - calculating space requirements: done in 00:00:02                         
    - pass 1 of 2
       - allocating space: done in 00:00:02                         
       - gathering links: done in 00:07:59                         
       - saving links: done in 00:00:01                         
    - pass 2 of 2
       - allocating space: done in 00:00:02                         
       - gathering links: done in 00:04:00                         
       - saving links: done in 00:00:02                         
    extracting content: done in 00:05:57

    ###### extractAnchorOccurences.pl ######

    ## split to 4 parts ##
    splitting dump file: done in 00:04:54

    ## anchor occurances ##
    Gathering anchor occurances: done in 00:56:56      # 1
    Gathering anchor occurances: done in 00:56:49     # 2
    Gathering anchor occurances: done in 00:56:41     # 3
    Gathering anchor occurances: done in 00:55:15     # 4

    ## merge 4 parts ##
    loading anchors: done in 00:00:03                         
    merging anchor occurances (pass 1 of 4): done in 00:00:02                         
    merging anchor occurances (pass 2 of 4): done in 00:00:02                         
    merging anchor occurances (pass 3 of 4): done in 00:00:02                         
    merging anchor occurances (pass 4 of 4): done in 00:00:02                         
    saving merged anchor occurances: done in 00:00:01