cis555final Code
Brought to you by:
mizhao-upenn
File | Date | Author | Commit |
---|---|---|---|
conf | 2011-05-08 | mizhao-upenn | [r181] remove lengh check |
doc | 2011-05-11 | crdelozier | [r201] Added graph for crawler |
lib | 2011-04-21 | mizhao-upenn | [r41] modify PageRank |
resources | 2011-04-27 | crdelozier | [r50] List for Search Stress Test |
script | 2011-04-22 | mizhao-upenn | [r43] update |
src | 2011-05-10 | crdelozier | [r198] Updated for more drastic weight changes |
www | 2011-04-19 | crdelozier | [r26] Updated code to get results from index server a... |
README | 2011-05-03 | mizhao-upenn | [r176] change README |
build.sh | 2011-05-01 | sofiagu | [r112] |
build.xml | 2011-04-19 | sofiagu | [r22] add build |
mapreduce-1.jar | 2011-05-08 | xiaojunf | [r182] |
mapreduce-20.jar | 2011-05-02 | xiaojunf | [r164] |
mapreduce.jar | 2011-05-02 | xiaojunf | [r163] |
run-httpServer.sh | 2011-05-08 | mizhao-upenn | [r179] add url ack |
run-searchServer.sh | 2011-05-07 | crdelozier | [r178] Set heap space to be larger so we don't run out... |
Full names and SEAS login names of project members: Siyin Gu(gusiyin), Christian DeLozier(delozier), Xiaojun Feng(xiaojunf), Ming-Chen Zhao(mizhao) Description of features implemented: The main search servlet features a simple interface that allows the user to search and to adjust the ranking of search results. The cache servlet allows users to retrieve pages that have been cached by the web-crawler. The search servlet communicates with the index server through a REST interface. The WeightUpdateServlet and the RerankServlet handle AJAX calls from the main SearchServlet to rerank the search results and display them to the user. The search servlet implements partial stemming by generating possible stems of words and searching for both terms (instead of just searching for the stem). This stemming behavior allows the search servlet to retrieve more results for the user and avoids the poor behavior of the stemmer that can occur when a stem isn't actually a word. A Pastry ring based distributed crawler A Pastry ring based distributed tf-idf calculator A hadoop based page rank calculator Extra credits claimed: Duplicate content detector: The crawler cache the digest of the content of the webpage to avoid store duplicate content of the web page. AJAX for Searching: Users can rank query results by relevance, authority, or normal weights. Users can also adjust weights by informing the search servlet about the usefulness of individual results. When the servlet is informed, it attempts to adjust the overall weights based on the difference between the helpful result and results above it or between an unhelpful result and the results below it. For example, if "yahoo.com" as a high index score but low pagerank and "google.com" has a low index score and a high pagerank and google appears below yahoo in the search results, if the user says that google was helpful, the engine will raise the weight of pagerank and reduce the weight of the index score. Search Content Cache: Users can view the cached content of pages that were crawled by the web-crawler. List of source files included (consider using `find src | grep java`) ./src/test/edu/upenn/cis/cis555/RunAllTests.java ./src/test/edu/upenn/cis/cis555/index/ServerUtility.java ./src/test/edu/upenn/cis/cis555/index/DatabaseUtility.java ./src/test/edu/upenn/cis/cis555/index/DatabaseTest.java ./src/test/edu/upenn/cis/cis555/index/HandlerTest.java ./src/test/edu/upenn/cis/cis555/webserver/ContextTest.java ./src/test/edu/upenn/cis/cis555/webserver/SessionTest.java ./src/test/edu/upenn/cis/cis555/webserver/ConfigTest.java ./src/test/edu/upenn/cis/cis555/webserver/ResponseTest.java ./src/test/edu/upenn/cis/cis555/webserver/RequestTest.java ./src/test/edu/upenn/cis/cis555/search/WeightUpdateTest.java ./src/test/edu/upenn/cis/cis555/search/CacheTest.java ./src/test/edu/upenn/cis/cis555/search/SearchTest.java ./src/test/edu/upenn/cis/cis555/crawler/HTMLParserTest.java ./src/test/edu/upenn/cis/cis555/crawler/ContextUtility.java ./src/test/edu/upenn/cis/cis555/crawler/HttpParserTest.java ./src/test/edu/upenn/cis/cis555/crawler/FileVerifyTest.java ./src/test/edu/upenn/cis/cis555/crawler/CrawlerTest.java ./src/test/edu/upenn/cis/cis555/crawler/testWebDB.java ./src/test/edu/upenn/cis/cis555/PageRank/DatabaseUtility.java ./src/test/edu/upenn/cis/cis555/PageRank/ManagerTest.java ./src/test/edu/upenn/cis/cis555/PageRank/DatabaseTest.java ./src/edu/upenn/cis/cis555/timing/MethodTimer.java ./src/edu/upenn/cis/cis555/index/KeywordKey.java ./src/edu/upenn/cis/cis555/index/KeywordData.java ./src/edu/upenn/cis/cis555/index/IndexKeywordDatabase.java ./src/edu/upenn/cis/cis555/index/IndexDocumentInformationResultQueue.java ./src/edu/upenn/cis/cis555/index/IndexSearchServlet.java ./src/edu/upenn/cis/cis555/index/IndexMessageQueue.java ./src/edu/upenn/cis/cis555/index/DocumentData.java ./src/edu/upenn/cis/cis555/index/SearchServerHandlerThread.java ./src/edu/upenn/cis/cis555/index/DocumentKey.java ./src/edu/upenn/cis/cis555/index/IndexDocumentDatabase.java ./src/edu/upenn/cis/cis555/index/SearchClient.java ./src/edu/upenn/cis/cis555/index/SearchServerThread.java ./src/edu/upenn/cis/cis555/index/TotalDocumentNumberData.java ./src/edu/upenn/cis/cis555/index/IndexApp.java ./src/edu/upenn/cis/cis555/index/TotalDocumentNumberKey.java ./src/edu/upenn/cis/cis555/index/IndexMessage.java ./src/edu/upenn/cis/cis555/index/SearchServerSocketQueue.java ./src/edu/upenn/cis/cis555/webserver/MyHttpRequest.java ./src/edu/upenn/cis/cis555/webserver/MyHttpResponse.java ./src/edu/upenn/cis/cis555/webserver/ClientRequest.java ./src/edu/upenn/cis/cis555/webserver/MySession.java ./src/edu/upenn/cis/cis555/webserver/Worker.java ./src/edu/upenn/cis/cis555/webserver/TestHarness.java ./src/edu/upenn/cis/cis555/webserver/MyServletContext.java ./src/edu/upenn/cis/cis555/webserver/HttpServer.java ./src/edu/upenn/cis/cis555/webserver/Parser.java ./src/edu/upenn/cis/cis555/webserver/ThreadPool.java ./src/edu/upenn/cis/cis555/webserver/MyServletConfig.java ./src/edu/upenn/cis/cis555/search/StressTest.java ./src/edu/upenn/cis/cis555/search/ResultParser.java ./src/edu/upenn/cis/cis555/search/RerankServlet.java ./src/edu/upenn/cis/cis555/search/SearchResult.java ./src/edu/upenn/cis/cis555/search/SearchServlet.java ./src/edu/upenn/cis/cis555/search/CacheServlet.java ./src/edu/upenn/cis/cis555/search/SearchResultList.java ./src/edu/upenn/cis/cis555/search/SearchClient.java ./src/edu/upenn/cis/cis555/search/WeightUpdateServlet.java ./src/edu/upenn/cis/cis555/search/BaseServlet.java ./src/edu/upenn/cis/cis555/search/SearchResultComparator.java ./src/edu/upenn/cis/cis555/crawler/message/UrlMsg.java ./src/edu/upenn/cis/cis555/crawler/message/VeriReqMsg.java ./src/edu/upenn/cis/cis555/crawler/message/VeriReplyMsg.java ./src/edu/upenn/cis/cis555/crawler/message/AddVeriMsg.java ./src/edu/upenn/cis/cis555/crawler/message/CrawlerEndMsg.java ./src/edu/upenn/cis/cis555/crawler/message/PastryMessage.java ./src/edu/upenn/cis/cis555/crawler/message/RawContentMsg.java ./src/edu/upenn/cis/cis555/crawler/common/MyInteger.java ./src/edu/upenn/cis/cis555/crawler/common/GlobalVar.java ./src/edu/upenn/cis/cis555/crawler/common/MsgType.java ./src/edu/upenn/cis/cis555/crawler/common/UtilityFunction.java ./src/edu/upenn/cis/cis555/crawler/common/WebKey.java ./src/edu/upenn/cis/cis555/crawler/common/RobotRule.java ./src/edu/upenn/cis/cis555/crawler/common/SystemLog.java ./src/edu/upenn/cis/cis555/crawler/common/Param.java ./src/edu/upenn/cis/cis555/crawler/common/WebData.java ./src/edu/upenn/cis/cis555/crawler/common/WebDatabase.java ./src/edu/upenn/cis/cis555/crawler/common/HttpParser.java ./src/edu/upenn/cis/cis555/crawler/queue/SynMessageQu Outside sources used: Special instructions for building or running: Go to directory cis555final. Building: If you run the project in given Virtual Machine, you can compile the project using build.sh. Otherwise, you can use Eclispe or usual javac command. If you didn't use build.sh in given Virtual Machine, please change the command to run the project accordingly. Running: To run the project in EC2, first follow the steps in http://www.cis.upenn.edu/~ahae/teaching/cis455-s11/materials/aws-guide.pdf to set up Amazon Web Services (AWS). Then follow steps 1 to 12 in http://www.cis.upenn.edu/~cis399sc/homeworks/Homework%203-v2.pdf to set up a Hadoop cluster with Amazon EC2 instances. When doing step 8, please use following settings. In conf/core‐site.xml, use following content. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://masterIP:9000</value> <final>true</final> </property> <property> <name>hadoop.tmp.dir</name> <value>/tmp/hadoop/</value> </property> </configuration> In conf/hdfs‐site.xml, use following content. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.data.dir</name> <value>/mnt/data</value> <final>true</final> </property> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> In conf/mapred‐site.xml, use following content. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>10.88.213.38:9001</value> <final>true</final> </property> <property> <name>mapred.local.dir</name> <value>/mnt/local</value> <final>true</final> </property> </configuration> For masters and slaves, follow the instruction in pdf file. In file conf/web.xml, please change the value of context-param indexServer to the value of bootAddr in conf/localconf.xml. Change the value of context-param indexServerPort to the value of indexPort in conf/localconf.xml. Then copy the compiled project to each EC2 instances. Each instances should contain a conf/localconf.xml file. For most of parameters, you can keep it the same across multiple instances. For parameter <bootPort>, this is the ip address of the first machine who start by running searchServer.sh. <bootAddr> and <bootPort> should be same across multiple instances. For parameter <indexPort>, this should have same value with context-param indexServerPort in conf/localconf.xml. For parameter <startPage>, this is the url of crawler start page. Only the last machine who start the server by running searchServer.sh should contain this filed. Other nodes please delete this field as well as xml tag. For parameter <pageLimit>, this is the number of pages will be fetched by each instances. Please keep this the same across multiple instances. For parameter <isMaster>, the value is true only in the master instances. In other instances, the value should be false. For parameter <totalNode>, this is the total instances number you will start. Please keep this the same across multiple instances. For other parameters, you do not need to change them. Please run following commands in one machine (master or slave) before you run the searchServer. You also need to run these commands if you stop the searchServer completely and want to restart it. hadoop dfs -rmr in hadoop dfs -mkdir in If you stop the searchServer completely and want to restart it, please kindly delete directories data, <rankDBHome>, <hpInDir>, <hpOutDir>, <IndexDBHome>, <webDBHome>. Then you can use run-searchServer.h to start distributed crawler+search server on each instances. Note that the first instance and the last instance should be consistent with the settings in conf/localconf.xml file. After you start running the last instances, the crawler and indexer should begin to work. After all crawlers crawled required pages, the hadoop MapReduce will start automatically and compute the PageRank Score. Then you can use run-httpServer.h to start the webserver, and enter http://masterDNS:80/search to enjoy the search. If you have any questions, feel free to contact our team members. Thank you for reading.