From: Naomi D. <nd...@st...> - 2008-10-17 17:23:11
|
Jeffrey, I'm not sure if I know anything more than you do, but here's what I know: Have you tweaked the index parameters in solrconfig.xml? See http://wiki.apache.org/solr/SolrPerformanceFactors and http://wiki.apache.org/solr/SolrConfigXml I would recommend: generating the new index starting with an empty index. increasing the ramBufferSizeMB as high as you can (it speeds up indexing). increasing the merge factor - the index will be reorged fewer times. NOTE: you will probably need to up your ulimit on your box or the indexing will run out of file descriptors. This is documented in the above. Here's our values from solrconfig (note that there are two second in there!: <indexDefaults> <useCompoundFile>false</useCompoundFile> <mergeFactor>20</mergeFactor> <!--<maxBufferedDocs>1000</maxBufferedDocs>--> <ramBufferSizeMB>10240</ramBufferSizeMB> <maxMergeDocs>2147483647</maxMergeDocs> <writeLockTimeout>1000</writeLockTimeout> <commitLockTimeout>10000</commitLockTimeout> <lockType>single</lockType> <maxFieldLength>10000</maxFieldLength> </indexDefaults> <mainIndex> <!-- options specific to the main on-disk lucene index --> <useCompoundFile>false</useCompoundFile> <ramBufferSizeMB>10240</ramBufferSizeMB> <mergeFactor>20</mergeFactor> <maxMergeDocs>2147483647</maxMergeDocs> <maxFieldLength>10000</maxFieldLength> <unlockOnStartup>false</unlockOnStartup> </mainIndex> We index 500K chunks of records. My current script allocates 20g of heap space, but i think that's way more than we actually use. Our machine is big and fast, 'tis true. We also index on a box distinct from the public facing vufind instance; we don't want indexing to impact UI performance. And yes, the later chunks take longer that the first ones ... that's normal. - Naomi On Oct 17, 2008, at 9:51 AM, Barnett, Jeffrey wrote: > I didn't mean two days = 48 hrs. I meant two work days. We have > 8.3 million records, and have found that we need to stop and > optimize two or three times part way through. Even then, the last > million take about twice as long as the first. I think you also > have a somewhat larger, faster machine than we do. We are planning > to add memory and separate drives to help reduce seek time, but > being technically still in beta, we also aren't yet sure how much or > how soon to spend on hardware. > > If you have any tuning hints, feel free to share. (I have one I > shared earlier: When allocating more than 3072M to a JVM, use the - > d64 param to force 64 bit addressing)(I've been told this is only > needed for Solaris) > > -----Original Message----- > From: Naomi Dushay [mailto:nd...@st...] > Sent: Friday, October 17, 2008 12:32 PM > To: vuFind-Tech > Subject: Re: [VuFind-Tech] author name stemming > > Jeffrey, > > It takes you two days to generate an index? We do our 5.5 million > marc21 records in about 6 hours. We're in SOLR 1.2, though I don't > think that should matter. > > - Naomi > > > On Oct 17, 2008, at 8:43 AM, Barnett, Jeffrey wrote: > >> To: 'Andrew Nagy' >> Subject: RE: [VuFind-Tech] author name stemming >> >> I hate to ask, but does this imply another index reload? I just >> finished one in order to get ready for the impending 1.0rc. Will >> there be more, or is it safe to start the two day process over again? |