Offnet Code

Program that saves complete web pages retaining multiple timestamps

Brought to you by: andreas-s523

Tree [8ccf3f] master / History

HTTPS access

File	Date	Author	Commit
src	2018-12-01	Andreas Schwarzkopf	[8ccf3f] [FIX] Altered to the current Easy Folder Morpher
.gitignore	2017-10-01	Andreas Schwarzkopf	[31af53] [ADD] Projects now can have their own data stor...
COPYING	2015-06-09	Andreas Schwarzkopf	[c6e30f] [ADD #1] First checkin in the repository
Changelog.txt	2017-10-01	Andreas Schwarzkopf	[31af53] [ADD] Projects now can have their own data stor...
README.txt	2017-10-01	Andreas Schwarzkopf	[31af53] [ADD] Projects now can have their own data stor...
pom.xml	2018-10-20	Andreas Schwarzkopf	[c089d7] [FIX] Altered Offnet to current Easy Folder Mor...

Read Me

Offnet
===================

Offnet is an open source tool for mirroring web pages. It lets you manage 
several snapshots of web pages in order to retain the newest as well as some 
elder versions of web pages. It has also file based deduplication features to 
store content efficiently. The application comes with an integrated web server 
in order to be able to navigate directly using your web browser.

  - please refer to http://sourceforge.net/p/offnetserver/ for extensive
    documentation and information.

VERSION
=======
0.2.0 (2017-10-01, alpha version)

AUTHORS AND CONTRIBUTORS
========================

Application developer: Andreas Schwarzkopf.
For contact refer to the URL above.

LICENSE
=======

Offnet is licensed under the terms of the GPLv3. See "COPYING".

CONTACT
=======

If you have questions, comments, problems or find Bugs, contact us using:
  * The application's forum on http://sourceforge.net/p/offnetserver/

INSTALLING
==========

End users have to install several dependencies in order to use Offnet.
  * Install Java Runtime Environment.
      * Windows: Refer to downloads of www.oracle.com
          * You can disable java content for web pages in settings.
      * Linux: Install the package openjdk-7-jre
  * Install one 7-zip variant (use a 64 bit build if possible):
      * Windows: Install the corresponding binaries: www.7-zip.org
      * Linux: Install p7zip-full
  * Configure Offnet:
      * Copy the runnable jar file to the folder of your desire. This is a
        standalone executable file.
      * If you use Linux, open the jar file properties using e. g. dolphin or
        nautilus and mark it as executable.
      * Starting using the command line use: java -jar Offnet.jar
      * After launching the application you will prompt you to it using your 
        web browser.
      * Change the 7-Zip archiver command path and select the binary for opening
        the archives. Set 7z.exe using windows. The default setting may remain
        for Linux if archives can be read that way.

BUILDING
========

As you currently read this README, we presume that you already have the source.
If not, check http://sourceforge.net/p/offnetserver/. You can download 
source archives there.

Required Dependencies:
----------------------

* openjdk-7-jdk (>= 7)
* easyfoldermorpher (>= 0.77)
* maven (>= 3.0.0)
* eclipse
* wicket (>= 6.19.0) (retrieved through maven)
* p7zip-full (>= 9.20)


Compilation:
------------

If you have installed all required third party libs, you can compile Offnet by 
using the runnable jar export of Eclipse:
  * Copy the Offnet and Easy Folder Morpher (git clone 
    git://git.code.sf.net/p/easyfoldermorpher/code easyfoldermorpher-code) 
    project folders to your Eclipse workspace folder and import as maven 
    project.
      * The project folders contains "src", COPYING, README, changelog etc.
  * Make sure that the Eclipse integration for Maven is installed (See Eclipse 
    Marketplace).
  * Import Offnet as Maven project.
  * Open console in Easy Folder Morpher folder and run: mvn clean install
  * Open console in Offnet project folder and run:
    mvn clean compile assembly:single
  * You need no other files to export to make the program run. The
    configurations file is created launching the program. Except the java 
    runtime and 7-Zip, the JAR file is the only file that is required for the execution.


BINARY DISTRIBUTION
===================

If you have received Offnet in binary form, you can always acquire the code at 
http://sourceforge.net/p/offnetserver/ or from your distributor. In the binary 
packages we provide, we might include open source, third-party libraries. 
If so, the licenses are re-distributed along the libraries. I did not modify 
the sources. However, you can get the sources of these libraries by contacting 
me.

USING
=====

After starting Offnet, it will open its URL in your web browser automatically. 
More information is available online at www.sourceforge.net/p/offnetserver/

USAGE EXAMPLES
==============

Main principles
  * The web site captures are managed using projects.
  * These are created in the project manager and can be nested in folders.
  * You can start the capture from the project editor.
  * You can make several snapshots of a web site so that you can retain the 
    newest as well as elder versions of your desired web site copy
  * It is possible to navigate through captured content when you are offline.

When proceeding these usage examples, it is assumed that you have set up the 
environment as mentioned in the section INSTALLING.

A. Creating web site capture projects
 1. Go to the project manager from the main page and create a new project in 
    the folder of your desire.
 2. Enter at least one initial URL. The program will know where to start 
    saving.
 3. Do not forget to define the URL scope (e. g. "domain.com" and allow sub 
    domains for "www.domain.com/page.html"). It needs at least one entry. Every 
    new web page file will be checked whether it corresponds to at least one 
    scope in order to be enqueued to the downloads. Web content (e. g. images) 
    is not restricted to URL scopes.
 4. Enter an appropriate iteration count. A too big number lets the process 
    maybe not come to an end. The crawling is iteration based. Before the 
    start, initial URLs are put in the queue. Each iteration, every URL is 
    opened, stored and searched for web links to be put in the queue of the 
    next iteration.
 5. Sometimes there are some URL patterns which let grow the file count 
    exponentially each iteration. Some regex terms (regular expressions) can 
    be entered to exclude them.
    Example:
    Skipping a page including its parameters: (.*)a.b.com/stuff.html(.*)
    Skipping pages with particular parameter values 2-4:
            (.*)a.b.com/stuff.html[?]param=(2|3|4)
    The (.*) is necessary to cover the protocol parameter
 6. In order to finish the process, just wait to complete all iterations or 
    pause the process. Then you can either deploy or revert the content. 
    Refresh the page to look whether the snapshot is still finalizing.
 7. Please close the application only from the system settings section. It 
    will ensure that all jobs finish as proposed to.

B. Managing snapshots
 1. Downloaded content is managed snapshot based. It means that the user should 
    be able to retain the newest as well as a very old web site copy in order 
    to be able to navigate through copies of all timestamps.
 2. The user should be able also to delete snapshots e. g. that do not have 
    significant changes.
    The data projection is built as following. Snapshots point to web meta 
    folder structure files which have only the file checksum information. The 
    last component is the archive context which stores actual files. The 
    deduplication is based on entire files.
    Deleting snapshots removes only the snapshot meta files. Not referred web 
    meta files and actual archived files are removed automatically using the 
    cleanup features in the data integrity check section (System settings).
 3. Browsing through web sites of elder snapshots.

C. Browsing through content
 1. Either open the web page query from the main web application page and enter 
    the URL that is already stored.
 2. Or open a snapshot to directly open a corresponding initial URL.

DEVELOPING
==========

This tool is still fairly light weight in order to alter it if you require 
mechanisms for sure exact order of processes and a special purpose.

Offnet Code

Program that saves complete web pages retaining multiple timestamps

Branches

Tree [8ccf3f] master / Download Snapshot History

Read Me

Tree [8ccf3f] master /

History