From: Erik H. <eri...@uc...> - 2013-06-03 19:07:20
|
At Mon, 3 Jun 2013 11:39:40 +0000, Kristinn Sigurðsson wrote: > > Dear all, > > We are planning on updating our Wayback installation and I would > like to poll your collective wisdom on the best approach for > managing the Wayback index. > > Currently, our collection is about 2.2 billion items. It is also > growing at a rate of approximately 350-400 million records per year. > > The obvious approach would be to use a sorted CDX file (or files) as > the index. I'm, however, concerned about its performance at this > scale. Additionally, updating a CDX based index can be troublesome. > Especially as we would like to update it continuously as new > material is ingested. > > Any relevant experience and advice you could share on this topic > would be greatly appreciated. Hi Kristinn, We use 4 different CDX files. One is updated every ten minutes, one hourly, one daily, and one monthly. We use the unix sort command to sort. This has worked pretty well for us. We aren’t doing it in the most efficient manner, and we will probably switch to sorting with hadoop at some point, but it works pretty well. best, Erik |