Offnet Code
Program that saves complete web pages retaining multiple timestamps
Brought to you by:
andreas-s523
Offnet =================== Offnet is an open source tool for mirroring web pages. It lets you manage several snapshots of web pages in order to retain the newest as well as some elder versions of web pages. It has also file based deduplication features to store content efficiently. The application comes with an integrated web server in order to be able to navigate directly using your web browser. - please refer to http://sourceforge.net/p/offnetserver/ for extensive documentation and information. VERSION ======= 0.2.0 (2017-10-01, alpha version) AUTHORS AND CONTRIBUTORS ======================== Application developer: Andreas Schwarzkopf. For contact refer to the URL above. LICENSE ======= Offnet is licensed under the terms of the GPLv3. See "COPYING". CONTACT ======= If you have questions, comments, problems or find Bugs, contact us using: * The application's forum on http://sourceforge.net/p/offnetserver/ INSTALLING ========== End users have to install several dependencies in order to use Offnet. * Install Java Runtime Environment. * Windows: Refer to downloads of www.oracle.com * You can disable java content for web pages in settings. * Linux: Install the package openjdk-7-jre * Install one 7-zip variant (use a 64 bit build if possible): * Windows: Install the corresponding binaries: www.7-zip.org * Linux: Install p7zip-full * Configure Offnet: * Copy the runnable jar file to the folder of your desire. This is a standalone executable file. * If you use Linux, open the jar file properties using e. g. dolphin or nautilus and mark it as executable. * Starting using the command line use: java -jar Offnet.jar * After launching the application you will prompt you to it using your web browser. * Change the 7-Zip archiver command path and select the binary for opening the archives. Set 7z.exe using windows. The default setting may remain for Linux if archives can be read that way. BUILDING ======== As you currently read this README, we presume that you already have the source. If not, check http://sourceforge.net/p/offnetserver/. You can download source archives there. Required Dependencies: ---------------------- * openjdk-7-jdk (>= 7) * easyfoldermorpher (>= 0.77) * maven (>= 3.0.0) * eclipse * wicket (>= 6.19.0) (retrieved through maven) * p7zip-full (>= 9.20) Compilation: ------------ If you have installed all required third party libs, you can compile Offnet by using the runnable jar export of Eclipse: * Copy the Offnet and Easy Folder Morpher (git clone git://git.code.sf.net/p/easyfoldermorpher/code easyfoldermorpher-code) project folders to your Eclipse workspace folder and import as maven project. * The project folders contains "src", COPYING, README, changelog etc. * Make sure that the Eclipse integration for Maven is installed (See Eclipse Marketplace). * Import Offnet as Maven project. * Open console in Easy Folder Morpher folder and run: mvn clean install * Open console in Offnet project folder and run: mvn clean compile assembly:single * You need no other files to export to make the program run. The configurations file is created launching the program. Except the java runtime and 7-Zip, the JAR file is the only file that is required for the execution. BINARY DISTRIBUTION =================== If you have received Offnet in binary form, you can always acquire the code at http://sourceforge.net/p/offnetserver/ or from your distributor. In the binary packages we provide, we might include open source, third-party libraries. If so, the licenses are re-distributed along the libraries. I did not modify the sources. However, you can get the sources of these libraries by contacting me. USING ===== After starting Offnet, it will open its URL in your web browser automatically. More information is available online at www.sourceforge.net/p/offnetserver/ USAGE EXAMPLES ============== Main principles * The web site captures are managed using projects. * These are created in the project manager and can be nested in folders. * You can start the capture from the project editor. * You can make several snapshots of a web site so that you can retain the newest as well as elder versions of your desired web site copy * It is possible to navigate through captured content when you are offline. When proceeding these usage examples, it is assumed that you have set up the environment as mentioned in the section INSTALLING. A. Creating web site capture projects 1. Go to the project manager from the main page and create a new project in the folder of your desire. 2. Enter at least one initial URL. The program will know where to start saving. 3. Do not forget to define the URL scope (e. g. "domain.com" and allow sub domains for "www.domain.com/page.html"). It needs at least one entry. Every new web page file will be checked whether it corresponds to at least one scope in order to be enqueued to the downloads. Web content (e. g. images) is not restricted to URL scopes. 4. Enter an appropriate iteration count. A too big number lets the process maybe not come to an end. The crawling is iteration based. Before the start, initial URLs are put in the queue. Each iteration, every URL is opened, stored and searched for web links to be put in the queue of the next iteration. 5. Sometimes there are some URL patterns which let grow the file count exponentially each iteration. Some regex terms (regular expressions) can be entered to exclude them. Example: Skipping a page including its parameters: (.*)a.b.com/stuff.html(.*) Skipping pages with particular parameter values 2-4: (.*)a.b.com/stuff.html[?]param=(2|3|4) The (.*) is necessary to cover the protocol parameter 6. In order to finish the process, just wait to complete all iterations or pause the process. Then you can either deploy or revert the content. Refresh the page to look whether the snapshot is still finalizing. 7. Please close the application only from the system settings section. It will ensure that all jobs finish as proposed to. B. Managing snapshots 1. Downloaded content is managed snapshot based. It means that the user should be able to retain the newest as well as a very old web site copy in order to be able to navigate through copies of all timestamps. 2. The user should be able also to delete snapshots e. g. that do not have significant changes. The data projection is built as following. Snapshots point to web meta folder structure files which have only the file checksum information. The last component is the archive context which stores actual files. The deduplication is based on entire files. Deleting snapshots removes only the snapshot meta files. Not referred web meta files and actual archived files are removed automatically using the cleanup features in the data integrity check section (System settings). 3. Browsing through web sites of elder snapshots. C. Browsing through content 1. Either open the web page query from the main web application page and enter the URL that is already stored. 2. Or open a snapshot to directly open a corresponding initial URL. DEVELOPING ========== This tool is still fairly light weight in order to alter it if you require mechanisms for sure exact order of processes and a special purpose.