-
After upgrading to Snow Leopard and then upgrading fink according to the directions on the fink website, I now have what appears to be a fully functional fink, but broken FinkCommander. fink list looks fine, but everytime I run Update Table on FinkCommander it logs the following to the console and leaves me with a blank package list:
8/30/09 10:36:29 AM...
2009-08-30 17:47:49 UTC in FinkCommander
-
Attached is a follow-up patch that allows a list of local addresses (instead of a single one) to be specified. These are then assigned to ToeThreads round-robin, ensuring that the same local address is assigned to a given ToeThread each time it is in FetchHTTP. For example, if the number of local addresses specified equals max-toe-threads there will be a one-to-one mapping between ToeThreads...
2007-01-31 19:58:38 UTC in Heritrix: Internet Archive Web Crawler
-
Attached is an Extractor that allows you to specify XPath expressions to extract/construct links from XML/RSS. Each node matching each expression has the configured prefix/suffix strings appended to it and is added as a link.
2007-01-31 19:16:36 UTC in Heritrix: Internet Archive Web Crawler
-
It would be nice if the WUI provided some method of finding how many URL's match the regex you enter when searching the frontier. Even if it was just the ability to specify a page number in your request so you can narrow it down that would be good (such as the "Start at match" when you regex a log). Currently the only way you could do such a thing is to set the number of URL's to retrieve to...
2007-01-31 18:24:14 UTC in Heritrix: Internet Archive Web Crawler
-
Logged In: YES
user_id=705615
Just added "stop" to API to stop the replicator and a call
to it in crawlEnded. I also tested to make sure it patches
and builds on 1.9 again. Since I added serialVersionUID for
1.8, that chunk fails on 1.9 which is what is desirable.
This patch doesn't require a serialVersionUID change since
the replicator is transient. This should be the final API...
2006-08-30 18:05:20 UTC in Heritrix: Internet Archive Web Crawler
-
Logged In: YES
user_id=705615
You're absolutely right, this does nothing for partitioning
the crawler state to enable larger crawls. Its sole
purpose is to make crawls heritrix is currently capable of
faster. However, I think it's important to keep in mind
that many people are not using heritrix to crawl the entire
web, but rather to do focused crawls of particular
domains. In...
2006-08-29 22:32:57 UTC in Heritrix: Internet Archive Web Crawler
-
Logged In: YES
user_id=705615
I was looking for the simplest way to run a crawl using
multiple instances of the current version of heritrix. As
such, this API is basic, but it is not incomplete. In
fact, I'm currently using it and the very basic replicator
class also attached to run a very large crawl with more
than ten instances of heritrix on a LAN. In the long run,
there's...
2006-08-29 22:03:47 UTC in Heritrix: Internet Archive Web Crawler
-
Logged In: YES
user_id=705615
I was looking for the simplest way to run a crawl using
multiple instances of the current version of heritrix. As
such, this API is basic, but it is not incomplete. In
fact, I'm currently using it and the very basic replicator
class also attached to run a very large crawl with more
than ten instances of heritrix on a LAN. In the long run,
there's...
2006-08-29 20:28:30 UTC in Heritrix: Internet Archive Web Crawler
-
Logged In: YES
user_id=705615
I have made some small modifications to this API, now it is
final...patch applies to 1.8 and the latest head with some
fuzziness.
2006-08-28 21:58:41 UTC in Heritrix: Internet Archive Web Crawler
-
Logged In: YES
user_id=705615
Again, previous patch didn't work. This one actually is
tested and does, and it reverts to the simple behavior of
the first one. :)
For those who are interested, here is the trace of how
localAddress is passed.
createSocket in HeritrixProtocolSocketFactory instantiates
java.net.Socket, but didn't have bind! (until now)
HttpConnection calls...
2006-08-28 15:40:20 UTC in Heritrix: Internet Archive Web Crawler