Share

Heritrix: Internet Archive Web Crawler

Tracker: Feature Requests

5 (contrib) Provide Windows batch file version of scripts - ID: 1514538
Last Update: Comment added ( karl-ia )

It's not that hard to convert the shell scripts in the
bin directory to Windows batch files and it would make
using heritrix on Windows much easier.

Attached is a partial conversion of the bin/heritrix
startup script into a batch file. It must be run from
"cmd /v" to enable dynamic variable substitution (to
build classpath from dir of jars). And, it requires
that JAVA_HOME is set. It is only a partial conversion
because it does not support JMX, so it will not work
unless the JMX_OFF environment variable is set.

I suggest someone completes it and includes it
(hopefully with conversions of the other scripts) in an
upcoming release.

Thanks,
Eric Jensen


Eric C. Jensen ( ecjensen ) - 2006-06-29 18:10

5

Closed

None

Nobody/Anonymous

scripts

1.10.0

Public


Comments ( 17 )

Date: 2007-03-14 01:48
Sender: karl-ia


This issue is now discussed in the new JIRA tracker at
http://webteam.archive.org/jira/browse/HER-1018 -- please add further
comments at that location.


Date: 2006-09-01 16:40
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Below is note from Eric confirming scripts work for him.

Committing (See commit message below). Closing since tested
by 3rd party (and we don't do windows).

Date: Fri, 1 Sep 2006 10:53:39 -0500
From: Eric <ej@--->
To: Michael Stack <stack--->
Subject: Re: [ archive-crawler-Feature Requests-1514538 ]
(contrib) Provide Windows batch file version of scripts
Message-ID: <20060901155339.GA7107@duvel.ir.iit.edu>

"....The new windows scripts work fine for me..."

Committed new scripts with below comment.


Finish up '[ 1514538 ] (contrib) Provide Windows batch file
version of scripts'Initial contribution by Eric C. Jensen.
This patch contributed by
Max Schöfmann (schoefmax). Fixes handling of spaces in path
names. Also
adds windows versions of other Heritrix bash scripts.

Here are other remarks by Max:

Additionaly I fixed some other things:
- command line arguments like --scratch=somedir work now
(windows separates this into "--scratch" and
"somedir" and
the "=" gets lost)
- The original classpath is restored afterwards

* maven.xml
Copy over new windows versions of bash scripts.
* src/scripts/heritrix.cmd
Fixes for spaces in paths and replacement for
non-existent sleep
function.
* src/scripts/arcreader.cmd
* src/scripts/extractor.cmd
* src/scripts/foreground_heritrix.cmd
* src/scripts/htmlextractor.cmd
Added.



Date: 2006-08-31 22:48
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Mind testing Eric?


Date: 2006-08-31 14:27
Sender: schoefmax

Logged In: YES
user_id=1585400

Ok, I've fixed that now (hopefully).
And I found out sleep is indeed not a native windows
functionality :-)
It just happended that all three machines I tested the
scripts on, had either Cygwin or the Windows Ressource
Toolkit installed. Both come with a "sleep.exe"...

Additionaly I fixed some other things:
- command line arguments like --scratch=somedir work now
(windows separates this into "--scratch" and "somedir" and
the "=" gets lost)
- The original classpath is restored afterwards



Date: 2006-08-30 18:58
Sender: schoefmax

Logged In: YES
user_id=1585400

Oh yes, spaces... I will fix that tomorrow and test with a
path with spaces (too stupid I forgot that).
The other scripts do basically just set the CLASS_MAIN var
and then call heritrix.cmd, that's why they also don't work.
But "sleep" not working on your machine would be weird.
Maybe some sort of side effect of the other error.

Thanks for the quick feedback


Date: 2006-08-30 18:33
Sender: ecjensen

Logged In: YES
user_id=705615

These new commands don't work for me on my xp pro laptop.
They have problems because I'm running them from a directory
with spaces in the name. The heritrix.cmd will still start
a working heritrix, but since I'm running it from
C:\Documents and Settings\... it says "'\Documents' is not
recognized as an internal or external command", etc. For
some reason it also says "'sleep' is not recognized as an
internal or external command" even though that's not in my
path...must be a keyword you're using. The other commands
won't run at all...they say "'\Documents' is not recognized
as an internal or external command" and just exit with no
other output.


Date: 2006-08-30 18:19
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Eric:

Any chance of your trying the attached scripts?


Date: 2006-08-30 13:00
Sender: schoefmax

Logged In: YES
user_id=1585400

I've now converted the other scripts as well and expanded
the heritrix.cmd a little bit more (more java detection,
fake back/foreground option - as the other scripts set the
FOREGROUND var):

http://www.cip.ifi.lmu.de/~schoefma/howto/run_heritrix_on_windows/heritrix_win_cmds.tar.gz

I've tested the other scripts only so far that they print
out the right usage/help messages...

If they also work for Eric, then I think you can close this
issue.

The single files are also uploaded:
http://www.cip.ifi.lmu.de/~schoefma/howto/run_heritrix_on_windows/heritrix.cmd
http://www.cip.ifi.lmu.de/~schoefma/howto/run_heritrix_on_windows/foreground_heritrix.cmd
http://www.cip.ifi.lmu.de/~schoefma/howto/run_heritrix_on_windows/arcreader.cmd
http://www.cip.ifi.lmu.de/~schoefma/howto/run_heritrix_on_windows/extractor.cmd
http://www.cip.ifi.lmu.de/~schoefma/howto/run_heritrix_on_windows/htmlextractor.cmd



Date: 2006-08-29 20:00
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Thanks Max. I committed your changes with the comment below
(Eric gave them the thumbs up). Please test to ensure all
works (I'm unable to).

Going by your remarks, it looks like I should close this
issue. It seems as though the script can't really get any
better than its current state? What do you think?

Any chance you want to take a look at this windows issue:
https://sourceforge.net/tracker/index.php?func=detail&aid=1436241&group_id=73833&atid=539099
There is more info on the issue in the FAQ (look for windows).

Thanks again.


More on [ 1514538 ] (contrib) Provide Windows batch file
version of script
Improvements by Max Schöfmann (schoefmax at sf dot net)
From notes by Max:
- Command extensions and variable expansion are
automatically enabled (no
need to run cmd /v /c heritrix.bat anymore)
- JMX configuration fixed (not the fancy "sed" stuff
however. This may be
difficult, if not impossible without extra software
on windows)
- Automatically tries to set permissions of the JMX
password file if
Heritrix failes to start and JMX is enabled (should
even work on XP
Home Edition)
- A few more minor improvements: Comments changed from
"rem" to "::" and
file renamed to .cmd (to make it clear that it's an
NT script and
won't work on Win 9x...)
* maven.xml
Script name changed from heritrix.bat to heritrix.cmd.
* src/scripts/heritrix.cmd
Rename of ...
* src/scripts/heritrix.bat
Removed.



Date: 2006-08-28 14:31
Sender: schoefmax

Logged In: YES
user_id=1585400

Hi,

I made an a little bit improved version of the windows script:
http://www.cip.ifi.lmu.de/~schoefma/howto/run_heritrix_on_windows/heritrix.cmd
(I can't attach files here)

Following changes:
- Command extensions and variable expansion are
automatically enabled (no need to run cmd /v /c heritrix.bat
anymore)
- JMX configuration fixed (not the fancy "sed" stuff
however. This may be difficult, if not impossible without
extra software on windows)
- Automatically tries to set permissions of the JMX password
file if Heritrix failes to start and JMX is enabled (should
even work on XP Home Edition)
- A few more minor improvements
- Comments changed from "rem" to "::" and file renamed to
.cmd (to make it clear that it's an NT script and won't work
on Win 9x...)

Max


Date: 2006-08-04 18:13
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Committed with below comment. Leaving open at Eric's
suggestion.

Partial implementation of [ 1514538 ] (contrib) Provide
Windows batch file
version of scripts
Contributed by Eric C. Jensen - ecjensen at
users.sourceforge.net
* src/scripts/heritrix.bat
First cut at a incomplete, unsupported, batch script.



Date: 2006-08-04 17:39
Sender: ecjensen

Logged In: YES
user_id=705615

Just tested it, works great. Disclaimer's fine by me.


Date: 2006-08-04 17:11
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

My fault. I uploaded wrong bat file. Here is the
difference and I've reattached the proper bat file for you
to try. Thanks Eric.

stack@debord:~/workspace$ diff /home/stack/heritrix.bat !$
diff /home/stack/heritrix.bat heritrix/src/scripts/heritrix.bat
1c1,17
< rem This script launches the heritrix crawler.
---
> rem This script launches the heritrix crawler on windows.
While Heritrix
> rem is unsupported on windows, see 2.1.1.3 in the User Manual
> rem
[http://crawler.archive.org/articles/user_manual.html], this
script was
> rem provided by Eric Jensen as rem a convenience to the
windows-afflicted.
> rem
> rem It is a direct translation of the heritrix linux
wrapper script -- and
> rem because windows is not supported on Heritrix, it will
likely lag the unix
> rem start script. It is also incomplete; the JMX setup
needs finishing.
> rem That said, it should be sufficent to get a windows
user up and running
> rem using Heritrix.
> rem
> rem To run, JAVA_HOME and JMX_OFF environment variables
must be set and the
> rem script must be run using 'cmd /v'. See
> rem
> rem See
https://sourceforge.net/tracker/index.php?func=detail&aid=1514538&group_id=73833&atid=539102
> rem




Date: 2006-08-04 14:50
Sender: ecjensen

Logged In: YES
user_id=705615

I don't see the disclaimer in the batch file you uploaded.
In any case, this should remain open until someone finishes
converting the other parts of this script and the rest from
the bin dir.


Date: 2006-07-25 15:57
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

Eric. I added disclaimer at head of script. Would you
check it out and make sure script still runs. If so, I'll
commit. Thanks.


Date: 2006-07-17 20:08
Sender: ecjensen

Logged In: YES
user_id=705615

No, I don't run into any problems on Windows XP Professional
using "cmd /v" (as is required).

Dislaim away, it'd be good to have it checked in. But this
feature request should remain open until someone translates
the rest of this script and the others...it's not hard work,
I just don't have time.

Thanks,
eric.


Date: 2006-07-17 19:52
Sender: stack-sfProject Admin

Logged In: YES
user_id=924942

This script looks great Eric. It works for you without
issue? In the past, the CLASSPATH can grow longer than some
windows string size limit so it'd drop the CLASSPATH tail.
You don't run into this issue? (I don't know much about
windows batch files).

If I was going to check this in, I'd add a disclaimer saying
windows is unsupported and script is a sample only, use at
your own risk, etc. That ok with you? And, I'd add you as
author?


Attached File ( 1 )

Filename Description Download
heritrix.bat Reattempt at upload Download

Changes ( 10 )

Field Old Value Date By
artifact_group_id 0.10.0 2006-09-08 03:13 gojomo
status_id Open 2006-09-01 16:40 stack-sf
close_date - 2006-09-01 16:40 stack-sf
artifact_group_id 1.8.0 2006-08-04 18:13 stack-sf
summary should provide Windows batch file version of scripts 2006-08-04 18:13 stack-sf
File Deleted 186188: 2006-08-04 17:39 ecjensen
File Added 187644: heritrix.bat 2006-08-04 17:11 stack-sf
File Deleted 183288: 2006-08-04 14:50 ecjensen
File Added 186188: heritrix.bat 2006-07-25 15:57 stack-sf
File Added 183288: heritrix.bat 2006-06-29 18:10 ecjensen