lius-devp Mailing List for LIUS (Lucene Index Update and Search)
Brought to you by:
benjellr
You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(30) |
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
(1) |
2007 |
Jan
|
Feb
(1) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
(2) |
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Jens F. <jf...@te...> - 2011-02-27 09:06:54
|
Hi Indira, can you please post the respective excerpt from your source? Sounds to me like it could be you are using the API wrong.. Jens On Sat, 2011-02-26 at 22:41 +0530, indira priyadharshini wrote: > Hi, > > I am new to this LIUS. I am using lucene for creating index and the > index gets created properly. > I am trying to search for a keyword. While searching I am getting the > list of LiusHit object. > I can able to get the size() of the list. But while iterating the list > I am getting "Problem: java.util.NoSuchElementException". > Any idea how to overcome this. > Thank you for your replies. > > Regards, > Indira > ------------------------------------------------------------------------------ > Free Software Download: Index, Search & Analyze Logs and other IT data in > Real-Time with Splunk. Collect, index and harness all the fast moving IT data > generated by your applications, servers and devices whether physical, virtual > or in the cloud. Deliver compliance at lower cost and gain new business > insights. http://p.sf.net/sfu/splunk-dev2dev > _______________________________________________ Lius-devp mailing list Liu...@li... https://lists.sourceforge.net/lists/listinfo/lius-devp |
From: indira p. <ind...@gm...> - 2011-02-26 17:11:56
|
Hi, I am new to this LIUS. I am using lucene for creating index and the index gets created properly. I am trying to search for a keyword. While searching I am getting the list of LiusHit object. I can able to get the size() of the list. But while iterating the list I am getting "Problem: java.util.NoSuchElementException". Any idea how to overcome this. Thank you for your replies. Regards, Indira |
From: Walker, K. 1 <kei...@lm...> - 2007-02-22 21:43:24
|
I'm currently using Lucene wrapped by an EJB and am investigating how it behaves in that multi-threaded environment. I'm looking at adding LIUS, but is it thread-safe? Also, I notice in the sample code that there are several calls to getSingletonInstance(). Assuming LIUS is thread safe and my app server has multiple instances of the EJB running, will LIUS convert/index multiple documents at once or are the threads processed one at a time? Thanks, Keith |
From: Luca C. <luc...@gm...> - 2006-12-09 13:39:16
|
Hi Rida, I've downloaded LIUS 1.0-RC2 and I think it a nice project. However I've found something that I think you should correct. I don't mean to criticize your job and if I post this message is because I want to support you. 1) If you browse the code in SourceForge.net via "Code -> SVN Browse" and click on "src/" and "lius/", you get .class files instead of .java. It is expected to find the source code in the directory "src", isn't it? 2) The documentation of LIUS 1.0-RC2 at http://www.doculibre.com/lius/doc-1.0_en.html has an error in Example 3. The line: <indexer class="lius.index.PDF.PdfIndexer"> should be: <indexer class="lius.index.pdf.PdfIndexer"> 3) Line 54 of src/lius/index/IndexerFactory.java is: indexer.setDocToIndexPath(file.getAbsolutePath()); but I think that it should be: if (indexer != null) { indexer.setDocToIndexPath(file.getAbsolutePath()); } This is because the line 53 might set indexer to null. 4) I suspect that lines 68 and 69 of src/lius/index/IndexerFactory.java should be exchanged. indexer.setDocToIndexPath(url.toString()); indexer = getIndexer(url.openStream(), MimeTypeUtils.getMimeType(url), lc); I guess the method setDocToIndexPath should be called after the indexer has been returned by getIndexer. Cheers, Luca |
From: <Ja...@we...> - 2006-10-29 16:51:40
|
Hello! We are planning to use Lius for a project in my study and i need to know if it supports Samba (smb). Can you help me? Thanks & best regards, Jan Sievers. |
From: Rida B. <bi...@he...> - 2005-09-12 15:36:17
|
Hi Jens, I downloaded the latest version in the CVS and I found an error with the class FontIndexer. Could you please correct this error. I would like to create a new version with the classes that you have done. Also I want to correct some problems. Thanks Best regards Rida. |
From: Jens F. <jf...@te...> - 2005-06-22 18:48:42
|
Hi Rida.. > Olso we can add WebDavStreamProvider, I had some discussion this week- > end with frensh colleagues that has implements into Lius WebDav > indexing, they told me that they are ready to contribute buy adding this > module to Lius. I don't have any experience with webdav (except for having heard the name).= =20 But if it fits.. we'll go for it. > Olso they are ready to add webservice search capability into Lius. > I olso start thinking and testing intranet indexing using http, so I=92m > verry happy to see that your UML represent it. :-) > I olso start thinking about indexing JCR (Java Content Repository) or > JSR 170. again, I'm not familiar with it. I'll have a look at it when time permits.= =20 could you sketch the content for me? > I have add some classes in red into your UML. looks good.. I'm not sure whether we can agree on a UML tool to use for=20 drawing - which would make it more easy to share such diagrams. I'm=20 (voluntarily) restricted to Linux here and would suggest something like 'di= a'=20 but it's probably not the best choice for the windows world. can you think = of=20 anything that everybody could use in a similar way?=20 the only (good) Java tool I have seen is 'Poseidon UML' but I'm not sure if= =20 they still offer a freeware version.. if you have anything in your bookmarks you can recommend: let me know :-) otherwise we'll just stick to drawing JPEGs for now.. > How mixteIndexing will work, actually we index files, JavaBeans and > nodes, mixte indexing allow indexing directory and store it in a single > Lucene document. This is very useful when we want to index for example > XML file containing metadata and fulltext in the same time. In the new > UML I=92can=92t see clearly how it will works ? As I wrote I have omitted most of the Indexer interface methods. I think we= =20 can mostly reuse the available code from the current version after a little= =20 refactoring to comply with the new design.. > Best regards and thanks for this great job. which was not really a lot of work but thank you very much for your=20 appreciation! cu, jens =2D-=20 =2EO. Jens Fendler =2E.O jf...@te... OOO http://www.teamskill.de |
From: Jens F. <jf...@te...> - 2005-06-22 18:38:31
|
Hi Rida.. > The don't like to match the Idea about having XML without attribute > because the config file become more difficult to use. I don't like the idea either but it seems to me like a price we could pay i= f=20 we decide to use ACC. > Olso I can=92t really see the real problem with LiusConfig, why you don= =92t > like it ? What do you mean? I thought we agreed on leaving knowledge about component= =20 configuration to the component and not LiusConfig (which will only serve as= a=20 common entry-point into configuration issues). > I didn=92t have yet the opportunity to work with ACC, Olso What J.J Larrea > suggested is very good because we can define some classes the represent > the config and then ask digester to populate them. This can olso be a > good approach because if we decide to use other API to parse the config > we can repopulate the config classes using the new API, Withowt having > any impact into the core level. I agree that the digester concept is nice but as I wrote earlier: I current= ly=20 don't see a way of passing a reference to previously unhandled (i.e. unknow= n)=20 XML elements to the respective components (e.g. Indexer classes). Again, we= =20 would end up with declaring all possible configuration elements and=20 structures in LiusConfig. > And finally if we implement a new config we don't have to spend a lot of > time on it because we have a lot of thinks to do, like implementing the > new indexing framework that Jens has suggested. indeed, but we should focus on the configuration issues first - otherwise w= e'd=20 have to modify even the new indexing framework again (which I'd like to tou= ch=20 exactly once for now only :-) cu, jens =2D-=20 =2EO. Jens Fendler =2E.O jf...@te... OOO http://www.teamskill.de |
From: JM T. <jm....@gm...> - 2005-06-21 16:51:53
|
Hi, me again! When I index a collection of many different files I get an output for a lot (all?) of HTML files. This output is the path of the file being indexed with "_liusTmpFileToIndex" appended at the end. I found out looking at the code that it doesn't help at all (it is not thrown by an exception), and is a bit annoying while indexing. It is in HtmlIndexer.getPopulatedCollection() : System.out.println(fileToDelete); I prefer telling you, when I think there are mistakes, instead of modifying myself on the CVS. Jean-Marie |
From: JM T. <jm....@gm...> - 2005-06-21 08:25:27
|
Hi, Once again, I'm not really sure about what I'll say... I often got an error Exception, when I try to index a file that is in a bad format (for any reason, today it is an excel file that seems to have a problem) : First an exception is caught by the Indexer, but the program keeps going (and that's fine). But then, an NullPointerException is caught when the IndexWriter tries to close the writer (and stops the execution, which is not that fine) When looking at the quoted line where the exception is caught, I saw that it was in the finally block (for example in LuceneActions.index). The problem is, I think, that there shouldn't be the writer.close() instruction in both the catch and finally blocks. Actually the finally block is always executed, so it should only be in this block and not in the catch. The same problem appears in a few methods in LuceneActions. Jean-Marie |
From: Rida B. <bi...@he...> - 2005-06-20 10:10:19
|
Hi all, The don't like to match the Idea about having XML without attribute because the config file become more difficult to use. Olso I cant really see the real problem with LiusConfig, why you dont like it ? I didnt have yet the opportunity to work with ACC, Olso What J.J Larrea suggested is very good because we can define some classes the represent the config and then ask digester to populate them. This can olso be a good approach because if we decide to use other API to parse the config we can repopulate the config classes using the new API, Withowt having any impact into the core level. And finally if we implement a new config we don't have to spend a lot of time on it because we have a lot of thinks to do, like implementing the new indexing framework that Jens has suggested. Best Regards. |
From: Rida B. <bi...@he...> - 2005-06-20 09:32:20
|
Hi Jens, I totally agree with your UML. It is perfect and respects the orientation that I want to give to Lius. Very good job :-). Olso we can add WebDavStreamProvider, I had some discussion this week- end with frensh colleagues that has implements into Lius WebDav indexing, they told me that they are ready to contribute buy adding this module to Lius. Olso they are ready to add webservice search capability into Lius. I olso start thinking and testing intranet indexing using http, so Im verry happy to see that your UML represent it. I olso start thinking about indexing JCR (Java Content Repository) or JSR 170. I have add some classes in red into your UML. How mixteIndexing will work, actually we index files, JavaBeans and nodes, mixte indexing allow indexing directory and store it in a single Lucene document. This is very useful when we want to index for example XML file containing metadata and fulltext in the same time. In the new UML Icant see clearly how it will works ? Best regards and thanks for this great job. -----Original Message----- From: Jens Fendler <jf...@te...> To: liu...@li... Date: Fri, 17 Jun 2005 13:16:42 +0200 Subject: New Lius Indexer framework > Hi all.. > > Despite the problem of finding a proper configuration approach, I've > UML-ized > my thoughts regarding a restructuring of the indexer framework in order > to > support heterogeneous data sources. > > Please have a look at the provided diagram and tell me what you think.. > > -- > .O. Jens Fendler > ..O jf...@te... > OOO http://www.teamskill.de > |
From: Rida B. <bi...@he...> - 2005-06-17 12:06:36
|
Hi Jean-marie, Im sorry for the delay. I'm actually in Europe (France). And I don't have a lot of time to read and to answer the forum. For your Questions : You have totally reason its a mistake, we have to set the property before indexing. So you can correct it and put it into the CVS or I can do it if you want. Olso for the performance or scalability, you have totally reason, and I think that we have to redesign it. Jens have a good idea : <reject pattern="/home/jf/Recordings/" /> this can be applied to the files that we don't want to index in the directory. I olso think that we have to redesign the indexing process by making more scalable and multi-thread. I don't know if you have read the book Lucene in action ? this book suggest a good way for tread indexing. I'm sorry, I have to go now, but I will write an other message to explain you how the new thread indexing will work. For the updating process, I think that we can create a method for updating directory. Best regards. NB : Thanks Jean-marie for your message in English en French. |
From: Jens F. <jf...@te...> - 2005-06-17 11:17:01
|
Hi all.. Despite the problem of finding a proper configuration approach, I've UML-ized my thoughts regarding a restructuring of the indexer framework in order to support heterogeneous data sources. Please have a look at the provided diagram and tell me what you think.. -- .O. Jens Fendler ..O jf...@te... OOO http://www.teamskill.de |
From: Jens F. <jf...@te...> - 2005-06-17 10:17:59
|
On Friday 17 June 2005 11:38, JM Tinghir wrote: > Oops, sorry, I totally forgot! And I've just sent another one... I'll > resend it too. thanks. helps a lot :-) > In the LuceneActions.index method, there is : > fileDirectoryIndexing(toIndex, lc, writer); > setIndexWriterProps(writer, lc); > > Shouldn't it rather be : > setIndexWriterProps(writer, lc); > fileDirectoryIndexing(toIndex, lc, writer); > > Initializing the writer before indexing? you're right. this should definitely be other way around (unless Rida had s= ome=20 genius hack in mind I don't see right now.. :-) =2D-=20 =2EO. Jens Fendler =2E.O jf...@te... OOO http://www.teamskill.de |
From: JM T. <jm....@gm...> - 2005-06-17 09:47:12
|
Well, sorry again... english version! When you want to index a folder and all the files it contains, you can pass the folder's path to the LuceneActions.index() method. But if you want to use updateDoc on this folder, you have to call LuceneActions.updateDoc() on each file. Also, if you only want to index certain files, but not every, you need to call LuceneActions.index on each of them. So, you open and close the IndexWriter each time, which decreases performance when there are a lot of files to process. Also, the mergeFactor isn't used at all (I think) in this case. What do you think about that? Did I say stupid things? Jean-Marie |
From: JM T. <jm....@gm...> - 2005-06-17 09:35:44
|
En fait, mon message pr=E9c=E9dent, avait pour origine le pb suivant :=20 Lorsque l'on veut indexer un r=E9pertoire et tous les documents qu'il contient, on peut passer le chemin de ce r=E9pertoire =E0 la m=E9thode LuceneActions.index(). Mais si l'on veut faire du updateDoc sur ce r=E9pertoire, il faut appeler LuceneActions.updateDoc() sur chaque fichier. De m=EAme, si on veut indexer certains fichiers, mais pas forc=E9ment tous, il faut appeler LuceneActions.index sur chacun d'eux. Et du coup, on ouvre et ferme l'IndexWriter =E0 chaque fois, ce qui pose des pb de performance lorsqu'il y a beaucoup de fichiers =E0 traiter. De plus, le mergeFactor ne sert plus =E0 rien (je crois) dans ce cas. Qu'en penses-tu? Est-ce que je dis des b=E9tises? Bonne journ=E9e, Jean-Marie |
From: JM T. <jm....@gm...> - 2005-06-17 09:12:17
|
Bonjour, Dans la m=E9thode LuceneActions.index il y a : fileDirectoryIndexing(toIndex, lc, writer); setIndexWriterProps(writer, lc); Est-ce que =E7=E0 ne devrait pas =EAtre l'inverse : initialiser les propri=E9t=E9s du writer avant d'indexer? Jean-Marie |
From: Jens F. <jf...@te...> - 2005-06-13 16:07:26
|
Hi J.J. > I make that a conditional because, reading the past few messages, I > am not sure it is a perfect fit. =20 I _am_ quite sure that it is not :-) But as I also doubt that such a thing exists (unless we write it ourselves)= ,=20 I'd be quite satisfied with what we'd get from ACC. > Jens expressed a concern that ACC=20 > is very heavy weight, with dependencies on many different APIs. This > is because it incorporates so many different potential sources of > configuration directives. =20 true. but I'm quite sure we could leave some major parts out as long as we= =20 stick to XML. we could simply add remaining libs for ACC once we'd want to= =20 support other means of configuration.=20 > However, I haven't seen that much interest=20 > expressed by either of you in actually using any of those different > kinds of sources. Which is fine with me, and perhaps an important > conclusion to draw. > Unless there is a strong desire to actually use different types of > configuration sources, then perhaps ACC isn't providing much actual > benefit. =20 Thanks you very much for pointing this one out - which I have ignored in=20 favour of playing around with the ACC :-) In the past I've only used two different ways of configuration in my projec= ts:=20 either a properties/preferences based approach (which is nice if you're usi= ng=20 JDK >=3D1.4 as it maps into the OS's native configuration) or XML (through= =20 jdom). Both have their pro's and con's but as I think it through now for LI= US=20 I believe that properties/preferences would not offer any advantage. Lius h= as=20 been designed to make use of different configurations - through different X= ML=20 files, while a (native) properties file/registry/... should (in my opinion)= =20 be used for programs with ONE configuration. As XML is still going strong and won't likely be replaced by something bett= er=20 in the near future, we should indeed think about what exactly we need and h= ow=20 small/easy/flexible we could implement it. > Jens points out below that ACC doesn't handle attributes.=20 > Well, it actually does, using '@' as a path separator rather than > '.', but the problem is that the @ attribute path is supported ONLY > by the XML configuration source, and so by using it one has lost the > ability to transparently work across different sources! And so that > lack of transparency would force us to modify the Lius XML > configuration file format (by removing attributes) in order to > support "dumber" configuration formats. We shouldn't do that unless > using those dumber formats is a high priority. Is it? I intentionally left out the non-transparent aspects of ACC. If we'd not ma= ke=20 use of the different config sources, there is in fact not much gained. > If we made the decision that we only cared about XML for > configuration, then there are other possibilities for configuration > parsing. One which might really simplify the configuration code is > Apache Commons Digester, which I admit I did not look at carefully > enough the first time I was looking around for configuration > frameworks: well, at least you DID look around at all - sorry, I still like coding much= =20 better than browsing although the first can often save you a lot of time on= =20 the second ;-) >> http://jakarta.apache.org/commons/digester/commons-digester-1.7/docs/api= doc >s/org/apache/commons/digester/package-summary.html#package_description > > It is very lightweight: SAX-based, and depends on very few > dependencies (Logging, BeanUtils, & Collections). And it uses > reflection to very good ends. =20 This one also looks like a nice tool, but after reading over some (by far n= ot=20 all) of the docs I'm not sure to pass custom elements for certain component= s. Bean-style get/set is fine with me but as I understood it only works for=20 'flat' object structures. How would you handle this case without adding=20 indexer-specific knowledge (i.e. rules) to the main config builder? <module class=3D"x.y.z"> <aaa> <bbb>47</bbb> <bbb>38</bbb> </aaa> In ACC you'd simply pass a sub-configuration object to the module.. If our= =20 config parser has to have knowledge of all rules we're just where we starte= d=20 from.. > In summary, I'm thinking that such simplicity (the potential to > collapse pages of configuration code to a few rules) would be a > bigger win than the potential to use multiple config sources. It does look promising, but I'm not quite sure if all possible indexer opti= ons=20 can (or should) always be mapped into a flat list for get/set. I'd rather=20 leave even the format of everything under <options> to the module authors. Do you see any (simple) chance here? (The only one I can currently think of= =20 would involve passing a reference to the digester (stack) itself to the=20 to-be-configured module which could add it's own rules there. But somehow=20 that seems like an evil hack to me..) =2D-=20 =2EO. Jens Fendler =2E.O jf...@te... OOO http://www.teamskill.de |
From: J.J. L. <jj...@pa...> - 2005-06-13 15:00:30
|
Hello all. Sorry for my recent silence on this list; I've been up to my head in a couple of deadlines. Jens, thanks for following through with evaluating the ACC Apache Commons Configuration for LIUS. The example code you provided in your last message was very instructive for seeing the overall approach you are thinking of for module configuration, and so I think is valuable whether or not we choose ACC as the framework for parsing configuration files. I make that a conditional because, reading the past few messages, I am not sure it is a perfect fit. Jens expressed a concern that ACC is very heavy weight, with dependencies on many different APIs. This is because it incorporates so many different potential sources of configuration directives. However, I haven't seen that much interest expressed by either of you in actually using any of those different kinds of sources. Which is fine with me, and perhaps an important conclusion to draw. Unless there is a strong desire to actually use different types of configuration sources, then perhaps ACC isn't providing much actual benefit. Jens points out below that ACC doesn't handle attributes. Well, it actually does, using '@' as a path separator rather than '.', but the problem is that the @ attribute path is supported ONLY by the XML configuration source, and so by using it one has lost the ability to transparently work across different sources! And so that lack of transparency would force us to modify the Lius XML configuration file format (by removing attributes) in order to support "dumber" configuration formats. We shouldn't do that unless using those dumber formats is a high priority. Is it? If we made the decision that we only cared about XML for configuration, then there are other possibilities for configuration parsing. One which might really simplify the configuration code is Apache Commons Digester, which I admit I did not look at carefully enough the first time I was looking around for configuration frameworks: http://jakarta.apache.org/commons/digester/commons-digester-1.7/docs/apidocs/org/apache/commons/digester/package-summary.html#package_description It is very lightweight: SAX-based, and depends on very few dependencies (Logging, BeanUtils, & Collections). And it uses reflection to very good ends. For example, the loop which Jens outlined for creating classes based on the configuration file: List moduleKeys = config.getList( "modules.use" ); for ( Iterator i = moduleKeys.iterator(); i.hasNext(); ) { String mId = (String) i.next(); String mKey = "modules." + mId; String className = config.getString( mKey + ".class" ); Configuration options = config.subset( mKey + ".options" ); LiusModule module = (LiusModule) Class.forName( className ).newInstance(); module.setup( options ); } could (other than the "use" mechanism, which is a separate topic) be replaced with something like this: digester.addObjectCreate("lius/modules/module", moduleClass, "class"); that would parse <module class="de.teamskill.lius.FilenameFilter"> Then if all the modules (and the main LiusConfig) use bean-style get/set methods for all configuration options, an additional rule like digester.addSetProperties("lius/modules/module"); would automatically map from, say, <accept>*.txt</accept> to module.setAccept("*.txt"); In summary, I'm thinking that such simplicity (the potential to collapse pages of configuration code to a few rules) would be a bigger win than the potential to use multiple config sources. What do you think? - J.J. At 3:02 PM +0200 6/13/05, Jens Fendler wrote: >Hi Rida.. > >> It sound good and I think that we can start the new developpements. I >> have some questions. First if we replace the Liusconfig by the >> commonconfig, we olso have to consider the LiusFields. Actually User can >> configure the lucene fields using the config file, how this will work >> with the new config ? I think that we have to start a discution about > > the XML structure of the new Config. > >I have attached a sample XML config file which could be used with ACC. Maybe >we can use that as a starting point for the discussion on how to structure a >new XML config. >It looks a little bit more complex than the old but that's mostly due to the >fact that ACC doesn't support attributes. Therefore, an element must either >have a unique name or cannot be treated individually within ACC. For some >elements this is OK but for others the names should be read dynamically >(please have a look at my example). > >As different fields are provided by the indexers they should be configured >within the respective nodes for the indexers. > >> Olso for the indexer class we must finalize it, by replacing getDocument >> methode by index methode. Olso for the addCustomFields(List >> liusFieldsOrLuceneFields), this methode can be verry useful when user >> whant to add somme custom fields that Lius can't handle automaticly, for >> example adding a URL path etc. >> Finally the methode getContent (Object ressource) will allow user to get >> the content of the document that they index. This can be verry usefull >> when user whant to store the content out of the index. Lucene recommende >> the storage of big content out of the index, this content witsh is >> stored out of the index could be used for example for Highlighting >> capabilities. > >I think those parts can still be used (more or less unchanged) under the new >circumstances. We will have to agree on a definition of the interfaces >anyway. > >> Olso we cant have a new Lius user that will use Lius as a text >> extraction API. > >...which would be a nice feature. I propose a new interface "ContentExtractor" >that a developer can choose to implement on every Indexer. It would simply >contain a method like > StringBuffer getContent(); >that returns raw, textual content if applicable. Thereby, the surrounding >components (such as content-type determination and automatic handling/passing >of files to indexers can be used from available Lius code should somebody >want to use Lius for text extraction only. > > >-- >.O. Jens Fendler >..O jf...@te... >OOO http://www.teamskill.de > >Attachment converted: Adam X:new-config.xml (TEXT/R*ch) (0188371C) >Attachment converted: Adam X:Untitled 37 ( / ) (01883728) |
From: Jens F. <jf...@te...> - 2005-06-13 13:03:05
|
Hi Rida.. > It sound good and I think that we can start the new developpements. I > have some questions. First if we replace the Liusconfig by the > commonconfig, we olso have to consider the LiusFields. Actually User can > configure the lucene fields using the config file, how this will work > with the new config ? I think that we have to start a discution about > the XML structure of the new Config. I have attached a sample XML config file which could be used with ACC. Maybe we can use that as a starting point for the discussion on how to structure a new XML config. It looks a little bit more complex than the old but that's mostly due to the fact that ACC doesn't support attributes. Therefore, an element must either have a unique name or cannot be treated individually within ACC. For some elements this is OK but for others the names should be read dynamically (please have a look at my example). As different fields are provided by the indexers they should be configured within the respective nodes for the indexers. > Olso for the indexer class we must finalize it, by replacing getDocument > methode by index methode. Olso for the addCustomFields(List > liusFieldsOrLuceneFields), this methode can be verry useful when user > whant to add somme custom fields that Lius can't handle automaticly, for > example adding a URL path etc. > Finally the methode getContent (Object ressource) will allow user to get > the content of the document that they index. This can be verry usefull > when user whant to store the content out of the index. Lucene recommende > the storage of big content out of the index, this content witsh is > stored out of the index could be used for example for Highlighting > capabilities. I think those parts can still be used (more or less unchanged) under the new circumstances. We will have to agree on a definition of the interfaces anyway. > Olso we cant have a new Lius user that will use Lius as a text > extraction API. ...which would be a nice feature. I propose a new interface "ContentExtractor" that a developer can choose to implement on every Indexer. It would simply contain a method like StringBuffer getContent(); that returns raw, textual content if applicable. Thereby, the surrounding components (such as content-type determination and automatic handling/passing of files to indexers can be used from available Lius code should somebody want to use Lius for text extraction only. -- .O. Jens Fendler ..O jf...@te... OOO http://www.teamskill.de |
From: Rida B. <bi...@he...> - 2005-06-13 10:25:29
|
Hi Jens, It sound good and I think that we can start the new developpements. I have some questions. First if we replace the Liusconfig by the commonconfig, we olso have to consider the LiusFields. Actually User can configure the lucene fields using the config file, how this will work with the new config ? I think that we have to start a discution about the XML structure of the new Config. Olso for the indexer class we must finalize it, by replacing getDocument methode by index methode. Olso for the addCustomFields(List liusFieldsOrLuceneFields), this methode can be verry useful when user whant to add somme custom fields that Lius can't handle automaticly, for example adding a URL path etc. Finally the methode getContent (Object ressource) will allow user to get the content of the document that they index. This can be verry usefull when user whant to store the content out of the index. Lucene recommende the storage of big content out of the index, this content witsh is stored out of the index could be used for example for Highlighting capabilities. Olso we cant have a new Lius user that will use Lius as a text extraction API. |
From: Jens F. <jf...@te...> - 2005-06-10 12:27:52
|
Hi.. ACC looks pretty straightforward to me. it does lack some nice functions e.= g.=20 JDOM would be offering, but those can be worked around without too much=20 effort. I'm providing some sample code on how a new ConfigBuilder / Module interact= ion=20 could look like.. =2D-- within the module (e.g. TestIndexer) --- public void setup( org.apache.configuration.Configuration c ) { // fetch options for the specific component // e.g. encoding: String enc =3D c.getString( "encoding" ); =09 // register this component with the factory (however this will look like=20 exactly) IndexerFactory.register( this ); } =2D-- =2D-- within ConfigBuilder --- Configuration config =3D new XMLConfiguration( "test.xml" ); List moduleKeys =3D config.getList( "modules.use" ); for ( Iterator i =3D moduleKeys.iterator(); i.hasNext(); ) { String mId =3D (String) i.next(); String mKey =3D "modules." + mId; String className =3D config.getString( mKey + ".class" ); Configuration options =3D config.subset( mKey + ".options" ); LiusModule module =3D (LiusModule) Class.forName( className ).newInstance(= ); module.setup( options ); } =2D-- =2D-- the XML config for this code could look like this.. <?xml version=3D"1.0" encoding=3D"UTF-8"?> <lius> <modules> <use>FilenameFilter1</use> <use>TestIndexer1</use> <FilenameFilter1> <class>de.teamskill.lius.FilenameFilter</class> <options> <!-- the options configuration subset is passed on to the modul= e=20 for configuration --> <accept>*.txt</accept> <accept>*.text</accept> <accept>*.asc</accept> </options> </FilenameFilter1> <TestIndexer1> <class>de.teamskill.lius.TestIndexer</class> <options> <encoding>ISO-8895-1</encoding> </options> </TestIndexer1> </modules> </lius> =2D-- what do you think? if it's ok with you I'd like to start working on this interface ASAP.. =2D-=20 =2EO. Jens Fendler =2E.O jf...@te... OOO http://www.teamskill.de |
From: Rida B. <bi...@he...> - 2005-06-08 00:42:45
|
Hi, I don't know the apache common config library. If the API offer good functionality it will be a good idea to implement it. Olso we have to consider that the actual size of lius is 7Mo and, in the incoming version we will support other Indexer that comes with a lot of library. Rida. |
From: Jens F. <jf...@te...> - 2005-06-07 09:57:36
|
Hi all.. I've just downloaded Commons Config to play around with it a little..=20 I noticed one thing that might be a small drawback: (at least in version 1.= 1)=20 it has almost 4 MB of external dependencies, most of which we don't have in= =20 the lius libs yet. do we call it acceptable? =2D-=20 =2EO. Jens Fendler =2E.O jf...@te... OOO http://www.teamskill.de |