Menu

Tree [r201] /
 History

HTTPS access


File Date Author Commit
 conf 2011-05-08 mizhao-upenn [r181] remove lengh check
 doc 2011-05-11 crdelozier [r201] Added graph for crawler
 lib 2011-04-21 mizhao-upenn [r41] modify PageRank
 resources 2011-04-27 crdelozier [r50] List for Search Stress Test
 script 2011-04-22 mizhao-upenn [r43] update
 src 2011-05-10 crdelozier [r198] Updated for more drastic weight changes
 www 2011-04-19 crdelozier [r26] Updated code to get results from index server a...
 README 2011-05-03 mizhao-upenn [r176] change README
 build.sh 2011-05-01 sofiagu [r112]
 build.xml 2011-04-19 sofiagu [r22] add build
 mapreduce-1.jar 2011-05-08 xiaojunf [r182]
 mapreduce-20.jar 2011-05-02 xiaojunf [r164]
 mapreduce.jar 2011-05-02 xiaojunf [r163]
 run-httpServer.sh 2011-05-08 mizhao-upenn [r179] add url ack
 run-searchServer.sh 2011-05-07 crdelozier [r178] Set heap space to be larger so we don't run out...

Read Me

Full names and SEAS login names of project members:
  Siyin Gu(gusiyin), Christian DeLozier(delozier),
  Xiaojun Feng(xiaojunf), Ming-Chen Zhao(mizhao)

Description of features implemented:
  The main search servlet features a simple interface that allows the user to search and to adjust the ranking 
    of search results.  The cache servlet allows users to retrieve pages that have been cached by the web-crawler.
    The search servlet communicates with the index server through a REST interface.  The WeightUpdateServlet and 
    the RerankServlet handle AJAX calls from the main SearchServlet to rerank the search results and display them
    to the user.  The search servlet implements partial stemming by generating possible stems of words and searching 
    for both terms (instead of just searching for the stem).  This stemming behavior allows the search servlet to retrieve 
    more results for the user and avoids the poor behavior of the stemmer that can occur when a stem isn't actually a word.

A Pastry ring based distributed crawler
A Pastry ring based distributed tf-idf calculator
A hadoop based page rank calculator

Extra credits claimed:
  Duplicate content detector: The crawler cache the digest of the content of the webpage to avoid store duplicate content of the web page.

  AJAX for Searching:  Users can rank query results by relevance, authority, or normal weights.  Users 
    can also adjust weights by informing the search servlet about the usefulness of individual results.
    When the servlet is informed, it attempts to adjust the overall weights based on the difference 
    between the helpful result and results above it or between an unhelpful result and the results below it.
    For example, if "yahoo.com" as a high index score but low pagerank and "google.com" has a low index 
    score and a high pagerank and google appears below yahoo in the search results, if the user says that 
    google was helpful, the engine will raise the weight of pagerank and reduce the weight of the index score.
  Search Content Cache:  Users can view the cached content of pages that were crawled by the web-crawler.

List of source files included (consider using `find src | grep java`)
./src/test/edu/upenn/cis/cis555/RunAllTests.java
./src/test/edu/upenn/cis/cis555/index/ServerUtility.java
./src/test/edu/upenn/cis/cis555/index/DatabaseUtility.java
./src/test/edu/upenn/cis/cis555/index/DatabaseTest.java
./src/test/edu/upenn/cis/cis555/index/HandlerTest.java
./src/test/edu/upenn/cis/cis555/webserver/ContextTest.java
./src/test/edu/upenn/cis/cis555/webserver/SessionTest.java
./src/test/edu/upenn/cis/cis555/webserver/ConfigTest.java
./src/test/edu/upenn/cis/cis555/webserver/ResponseTest.java
./src/test/edu/upenn/cis/cis555/webserver/RequestTest.java
./src/test/edu/upenn/cis/cis555/search/WeightUpdateTest.java
./src/test/edu/upenn/cis/cis555/search/CacheTest.java
./src/test/edu/upenn/cis/cis555/search/SearchTest.java
./src/test/edu/upenn/cis/cis555/crawler/HTMLParserTest.java
./src/test/edu/upenn/cis/cis555/crawler/ContextUtility.java
./src/test/edu/upenn/cis/cis555/crawler/HttpParserTest.java
./src/test/edu/upenn/cis/cis555/crawler/FileVerifyTest.java
./src/test/edu/upenn/cis/cis555/crawler/CrawlerTest.java
./src/test/edu/upenn/cis/cis555/crawler/testWebDB.java
./src/test/edu/upenn/cis/cis555/PageRank/DatabaseUtility.java
./src/test/edu/upenn/cis/cis555/PageRank/ManagerTest.java
./src/test/edu/upenn/cis/cis555/PageRank/DatabaseTest.java
./src/edu/upenn/cis/cis555/timing/MethodTimer.java
./src/edu/upenn/cis/cis555/index/KeywordKey.java
./src/edu/upenn/cis/cis555/index/KeywordData.java
./src/edu/upenn/cis/cis555/index/IndexKeywordDatabase.java
./src/edu/upenn/cis/cis555/index/IndexDocumentInformationResultQueue.java
./src/edu/upenn/cis/cis555/index/IndexSearchServlet.java
./src/edu/upenn/cis/cis555/index/IndexMessageQueue.java
./src/edu/upenn/cis/cis555/index/DocumentData.java
./src/edu/upenn/cis/cis555/index/SearchServerHandlerThread.java
./src/edu/upenn/cis/cis555/index/DocumentKey.java
./src/edu/upenn/cis/cis555/index/IndexDocumentDatabase.java
./src/edu/upenn/cis/cis555/index/SearchClient.java
./src/edu/upenn/cis/cis555/index/SearchServerThread.java
./src/edu/upenn/cis/cis555/index/TotalDocumentNumberData.java
./src/edu/upenn/cis/cis555/index/IndexApp.java
./src/edu/upenn/cis/cis555/index/TotalDocumentNumberKey.java
./src/edu/upenn/cis/cis555/index/IndexMessage.java
./src/edu/upenn/cis/cis555/index/SearchServerSocketQueue.java
./src/edu/upenn/cis/cis555/webserver/MyHttpRequest.java
./src/edu/upenn/cis/cis555/webserver/MyHttpResponse.java
./src/edu/upenn/cis/cis555/webserver/ClientRequest.java
./src/edu/upenn/cis/cis555/webserver/MySession.java
./src/edu/upenn/cis/cis555/webserver/Worker.java
./src/edu/upenn/cis/cis555/webserver/TestHarness.java
./src/edu/upenn/cis/cis555/webserver/MyServletContext.java
./src/edu/upenn/cis/cis555/webserver/HttpServer.java
./src/edu/upenn/cis/cis555/webserver/Parser.java
./src/edu/upenn/cis/cis555/webserver/ThreadPool.java
./src/edu/upenn/cis/cis555/webserver/MyServletConfig.java
./src/edu/upenn/cis/cis555/search/StressTest.java
./src/edu/upenn/cis/cis555/search/ResultParser.java
./src/edu/upenn/cis/cis555/search/RerankServlet.java
./src/edu/upenn/cis/cis555/search/SearchResult.java
./src/edu/upenn/cis/cis555/search/SearchServlet.java
./src/edu/upenn/cis/cis555/search/CacheServlet.java
./src/edu/upenn/cis/cis555/search/SearchResultList.java
./src/edu/upenn/cis/cis555/search/SearchClient.java
./src/edu/upenn/cis/cis555/search/WeightUpdateServlet.java
./src/edu/upenn/cis/cis555/search/BaseServlet.java
./src/edu/upenn/cis/cis555/search/SearchResultComparator.java
./src/edu/upenn/cis/cis555/crawler/message/UrlMsg.java
./src/edu/upenn/cis/cis555/crawler/message/VeriReqMsg.java
./src/edu/upenn/cis/cis555/crawler/message/VeriReplyMsg.java
./src/edu/upenn/cis/cis555/crawler/message/AddVeriMsg.java
./src/edu/upenn/cis/cis555/crawler/message/CrawlerEndMsg.java
./src/edu/upenn/cis/cis555/crawler/message/PastryMessage.java
./src/edu/upenn/cis/cis555/crawler/message/RawContentMsg.java
./src/edu/upenn/cis/cis555/crawler/common/MyInteger.java
./src/edu/upenn/cis/cis555/crawler/common/GlobalVar.java
./src/edu/upenn/cis/cis555/crawler/common/MsgType.java
./src/edu/upenn/cis/cis555/crawler/common/UtilityFunction.java
./src/edu/upenn/cis/cis555/crawler/common/WebKey.java
./src/edu/upenn/cis/cis555/crawler/common/RobotRule.java
./src/edu/upenn/cis/cis555/crawler/common/SystemLog.java
./src/edu/upenn/cis/cis555/crawler/common/Param.java
./src/edu/upenn/cis/cis555/crawler/common/WebData.java
./src/edu/upenn/cis/cis555/crawler/common/WebDatabase.java
./src/edu/upenn/cis/cis555/crawler/common/HttpParser.java
./src/edu/upenn/cis/cis555/crawler/queue/SynMessageQu

Outside sources used:


Special instructions for building or running:
  Go to directory cis555final.
  Building:
    If you run the project in given Virtual Machine, you can compile the project
    using build.sh. Otherwise, you can use Eclispe or usual javac command. If you
    didn't use build.sh in given Virtual Machine, please change the command to run
    the project accordingly.
  Running:
    To run the project in EC2, first follow the steps in
    http://www.cis.upenn.edu/~ahae/teaching/cis455-s11/materials/aws-guide.pdf
    to set up Amazon Web Services (AWS). Then follow steps 1 to 12 in
    http://www.cis.upenn.edu/~cis399sc/homeworks/Homework%203-v2.pdf
    to set up a Hadoop cluster with Amazon EC2 instances.
    
    When doing step 8, please use following settings.
    In conf/core‐site.xml, use following content.
		<?xml version="1.0"?>
		<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
		
		<!-- Put site-specific property overrides in this file. -->
		
		<configuration>
			<property>
				<name>fs.default.name</name>
				<value>hdfs://masterIP:9000</value>
				<final>true</final>
			</property>
			<property>
				<name>hadoop.tmp.dir</name>
				<value>/tmp/hadoop/</value>
			</property>
		</configuration>
    In conf/hdfs‐site.xml, use following content.
		<?xml version="1.0"?>
		<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
		
		<!-- Put site-specific property overrides in this file. -->
		
		<configuration>
			<property>
				<name>dfs.data.dir</name>
				<value>/mnt/data</value>
				<final>true</final>
			</property>
			<property>
				<name>dfs.replication</name>
				<value>1</value>
			</property>
		</configuration>
	In conf/mapred‐site.xml, use following content.
		<?xml version="1.0"?>
		<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
		
		<!-- Put site-specific property overrides in this file. -->
		
		<configuration>
			<property>
				<name>mapred.job.tracker</name>
				<value>10.88.213.38:9001</value>
				<final>true</final>
			</property>
			<property>
				<name>mapred.local.dir</name>
				<value>/mnt/local</value>
				<final>true</final>
			</property>
		</configuration>
    For masters and slaves, follow the instruction in pdf file.   
     
    In file conf/web.xml, please change the value of context-param
    indexServer to the value of bootAddr in conf/localconf.xml.
    Change the value of context-param indexServerPort to the value
    of indexPort in conf/localconf.xml.
    
    Then copy the compiled project to each EC2 instances. Each instances
    should contain a conf/localconf.xml file. For most of parameters,
    you can keep it the same across multiple instances.
        For parameter <bootPort>, this is the ip address of the first machine
        who start by running searchServer.sh. <bootAddr> and <bootPort> should be
        same across multiple instances.
        For parameter <indexPort>, this should have same value with context-param
        indexServerPort in conf/localconf.xml.
        For parameter <startPage>, this is the url of crawler start page. Only
        the last machine who start the server by running searchServer.sh should
        contain this filed. Other nodes please delete this field as well as xml
        tag.
        For parameter <pageLimit>, this is the number of pages will be fetched by
        each instances. Please keep this the same across multiple instances.
        For parameter <isMaster>, the value is true only in the master instances.
        In other instances, the value should be false.
        For parameter <totalNode>, this is the total instances number you will start.
        Please keep this the same across multiple instances.
        For other parameters, you do not need to change them.
        
    Please run following commands in one machine (master or slave) before you run
    the searchServer. You also need to run these commands if you stop the searchServer
    completely and want to restart it.
		hadoop dfs -rmr in
		hadoop dfs -mkdir in
	If you stop the searchServer completely and want to restart it, please kindly
	delete directories data, <rankDBHome>, <hpInDir>, <hpOutDir>, <IndexDBHome>,
	<webDBHome>.

     Then you can use run-searchServer.h to start distributed crawler+search server
     on each instances. Note that the first instance and the last instance should
     be consistent with the settings in conf/localconf.xml file. After you start
     running the last instances, the crawler and indexer should begin to work. After
     all crawlers crawled required pages, the hadoop MapReduce will start automatically
     and compute the PageRank Score.
     
     Then you can use run-httpServer.h to start the webserver, and enter
     http://masterDNS:80/search to enjoy the search.
     
     
If you have any questions, feel free to contact our team members. Thank you for reading.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.