DocFetcher / Feature Requests / #29 Web interface

Nam-Quang Tran - 2010-01-23

Overview over the relevant methods in DocFetcher:

When the user enters a query in the search box and presses the Enter key, net.sourceforge.docfetcher.DocFetcher.doSearch(String) is called. This leads to net.sourceforge.docfetcher.model.ScopeRegistry.search(String), which returns an array containing the result objects.

Each result is represented as a ResultDocument object, which has various fields containing information about the result object:
- score
- title
- the query that led to the result
It also inherits some fields from the Document class, e.g.
- file
- author
- which parser was used for text extraction

For the text preview, you'll need to know which parser to use in order to extract text from the document.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nam-Quang Tran - 2010-03-27

1) Purpose of the web interface:
The purpose of the web interface is to allow DocFetcher to be used in a network so that one computer becomes the "search server" and provides Google-like access to a central document repository to all the other computers in the network. This is what you would want in an enterprise or in a school network, or something like that. (In fact, I once received an e-mail from a school, asking me how to do that.)
A web interface also comes handy if you're at work and want to access the documents on your home computer, and vice versa.

2) Web interface from other programs:
Beagle, one of DocFetcher's (many) competitors, already has a web interface, you can check it out if you want. (AFAIK, Beagle only runs on Linux.)
VLC also has a web interface, but the configuration is a little bit clumsy, IMO.

3) DocFetcher's web interface:
For starters, a hotkey for turning the web interface on and off should suffice. After the basic functionality is implemented, we can wrap a nicer user interface around it. See, for example, the web interface dialog for Transmission (a bittorrent client), in the attachment below.
As for the web interface inside the browser, I hope we can make it look (and "feel") as similar to DocFetcher's desktop interface as possible - unless it's too difficult on the technical side.

4) Implementation:
We'll probably need Jetty, and embedded HTTP server for Java. Here's a page that describes how to set it up: http://wiki.eclipse.org/Jetty/Tutorial/Embedding_Jetty
(Actually, I have no idea what this server/HTTP/whatever stuff is all about, that's why I posted the job offer ;-))

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nam-Quang Tran - 2010-03-27

Transmission Web Interface Configuration Dialog

Transmission-Webinterface.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

MasterWizely - 2010-03-27

I will post my answer by using your indexes, adding my own to any question / item that is new (hopefully this makes tracing the answers easier)

RE: 1) Purpose of the web interface
If it is planned to be used in some enterprise network, what about security? I think there might be some need for at least some kind of authorization and authentication. Maybe something like usergroups and permissions might be useful.
Additionally some kind of encryption might be needed as well.

While the scenario mentioned by you includes a central instance that unifies the results of a set of observed clients, any search result might be extended by the id of the client the file was found on.

RE: 3) DocFetcher's web interface
What about a standalone server? Maybe it could be useful to split the application into a core (search, indexing and other backend stuff) and a user interface (e.g. either the SWT frontend or the webserver). Of course some kind of hybrid / combination might also be useful.
I wonder if some GUI might limit the usage on enterprise server, while they might run without any XServer (or pendant) , so while text based access might be possible as well as some webaccess (and administration), the GUI might not.

RE: 4) Implementation
Well, I agree on the fact, that Jetty is a nice application. As you mentioned before, the advantage of Jetty is the possibility to embed it into an application. But in fact any servlet container might do (getting some server like Tomcat working includes some more work, but I think we could be able to offer some choice)

5) I wonder if this part: "In contrast to the desktop interface, it is neither necessary nor desirable
for the web interface to allow the user to modify the indexes (e.g.
add/remove/update/rebuild)." is still up to date. If a centralized server manages the clients the build/rebuild of indexes might be a feature that is not necessary but desirable, isn't it?

6) What about the control? What is to be done to access any index by the webinterface? How is the index created, how published or accessed by the server? And what actions are needed to get a client observed by a centralized server?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nam-Quang Tran - 2010-03-27

If it is planned to be used in some enterprise network,
what about security? I think there might be some need
for at least some kind of authorization and
authentication. Maybe something like usergroups and
permissions might be useful. Additionally some kind
of encryption might be needed as well.

If I understand correctly, this issue can be addressed in the way Transmission (see attached screenshot) handles it. In the screenshot, you can see two settings: First, access can be restricted with a username and a password, and second, there's the "Only allow these IP addresses to connect" option.
I don't think we need any kind of encryption. At least I haven't seen anything like that in other web interfaces.
Also, I don't think we need user groups, because something like that can be emulated by running multiple instances of DocFetcher, each serving on a different port.

While the scenario mentioned by you includes a central
instance that unifies the results of a set of observed
clients, any search result might be extended by the id
of the client the file was found on.

The server is basically a read-only database which each client connects to, like a LAN version of Google. I'm not sure what you mean by "unifes the results".

What about a standalone server? Maybe it could be useful
to split the application into a core (search, indexing and
other backend stuff) and a user interface (e.g. either the
SWT frontend or the webserver).

IMO, an easier solution would be to add a command line parameter that allows the user to start the web interface without the desktop interface. This is how the VLC team did it (I think). If we do it that way, I think Jetty is all we need (and note that most DocFetcher users are ordinary non-developer Windows folks who just want to search their files).

I wonder if this part: "In contrast to the desktop
interface, it is neither necessary nor desirable for the
web interface to allow the user to modify the indexes
(e.g. add/remove/update/rebuild)." is still up to date. If
a centralized server manages the clients the build/rebuild
of indexes might be a feature that is not necessary but
desirable, isn't it?

Only the server should be allowed to perform index operations. If we allow the clients to do it, we'll get into all kinds of trouble. For example, imagine you have one server and 50 clients, and one of the clients decides, for whatever reason (stupidity, malice, etc.), to remove all indexes. Then bam! All indexes are gone, and none of the 50 clients can do any more searches.

What about the control? What is to be done to access
any index by the webinterface? How is the index created,
how published or accessed by the server? And what actions
are needed to get a client observed by a centralized
server?

As said before, DocFetcher can be thought of as a LAN version of Google. All documents and all indexes are on the server, and the indexes are created, updated and removed by the server. A client can send a query to the server, and the server searches its indexes and returns a list of results.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nam-Quang Tran - 2010-03-27

What operating system are you working on?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

MasterWizely - 2010-03-27

Windows XP mostly, but also Suse Linux (Enterprise as well as OpenSuSE), CentOS. Currently the linux derivates are running on servers, although I was using linux mainly as my desktop os (which I stopped for reasons of hardware compatibility)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Tonio Rush - 2010-12-19

Hello guys,

I've got the new version from SVN, ans see there's a lot done for the web interface. But I don't understand where is the entry point. Is it from a special URL ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nam-Quang Tran - 2010-12-19

Yes, there's a special URL :-) You have to launch DocFetcher, then go to localhost:8080.

Btw, I just send you an e-mail and uploaded the Outlook extractor to a folder named 'sandbox' in the SVN repository.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Bert Bouwen - 2012-09-17

It is possible to search the docfetcher indexes thru apache solr.

Just merge the indexes into one ( java -cp lucene-core-3.6.1.jar:lucene-misc-3.6.1.jar org.apache.lucene.misc.IndexMergeTool ./Merged_Index ./DocFetcher/indexes/* ) and copy the merged index to solr's datadir. Then you only need to modify solrs schema.xml and the default search field in solrconfig.xml

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nam-Quang Tran - 2022-09-02

status: open --> closed

Group: --> Next_Release_(example)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nam-Quang Tran - 2022-09-02

Now that the commercial software DocFetcher Server is out, this feature request can finally be marked as resolved.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Web interface

Desktop search application

Group

Searches

Help

#29 Web interface

Discussion