|
From: Eric A. <de...@us...> - 2004-03-16 23:28:22
|
Update of /cvsroot/sprawler/sprawler/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv7756/docs Modified Files: how-to-index.txt Log Message: Added more functions to client.pm (indexes more types of data) Index: how-to-index.txt =================================================================== RCS file: /cvsroot/sprawler/sprawler/docs/how-to-index.txt,v retrieving revision 1.2 retrieving revision 1.3 diff -C2 -d -r1.2 -r1.3 *** how-to-index.txt 15 Mar 2004 21:32:24 -0000 1.2 --- how-to-index.txt 16 Mar 2004 23:18:30 -0000 1.3 *************** *** 17,70 **** the "offline" index. ! Here's a list of info we need to grab from the page: ! Title <TITLE> ! Headings (text in <H1>, <H2>, and so on) ! Bold words <B> ! Large text <font size=+x> ! Italic words <i> ! Underlined words <ul> ! Linked words <a ..> ! Capitalized words (like This and like THIS) ! Words linked from other pages to the current page ! Word quantity ! word proximity - how close is one word to another ! size of page (in bytes) ! link/non-link text ratio - if half the page is links, how much content can there really be? ! URL of the page - including domain name ! Is the text at the top of the page more important than text at the bottom? ! If text color is same as background color - it's probably search engine fodder <font color=..> ! How many pages link to this page, and which pages ! phone numbers (international ones, too) ! addresses ! email addresses ! domain names ! product numbers/model numbers ! ISBN book numbers (there's and algorithm for this) ! company names ! meta description <meta desc..> ! meta keywords <meta keywords..> ! meta expires <meta expires..> ! filenames ! postal/zip codes ! stock symbols ! abbreviations for province/state names ! em tagged words ! blinking words <blink> ! marquee words <marquee> ! small font words <font size=-x..> ! table headers <???> ! words in table data tags <td>...</td> ! alt tags (for commenting images) <a ... alt=xxxx> ! image file names <img src=xxxx > ! quoted words <??> ! block text quoted words <block> ! listed text words <li> ! preformatted text <pre> ! text/image ratio ! individual words, and their frequency ! phrases (Panama canal routine) ! size of entire file ! size of data after html removed ! text/html ratio --- 17,70 ---- the "offline" index. ! Here's a list of info we need to grab from the page: (- is to do, * is done, and ? is unknown state) ! * Title <TITLE> ! * Headings (text in <H1>, <H2>, and so on) ! * Bold words <B> ! - Large text <font size=+x> ! * Italic words <i> ! - Underlined words <ul> ! ? Linked words <a ..> ! - Capitalized words (like This and like THIS) ! - Words linked from other pages to the current page ! - Word quantity ! - word proximity - how close is one word to another ! ? size of page (in bytes) ! - link/non-link text ratio - if half the page is links, how much content can there really be? ! * URL of the page - including domain name ! - Is the text at the top of the page more important than text at the bottom? ! - If text color is same as background color - it's probably search engine fodder <font color=..> ! - How many pages link to this page, and which pages ! - phone numbers (international ones, too) ! - addresses (snail mail) ! * email addresses ! - domain names ! - product numbers/model numbers ! - ISBN book numbers (there's and algorithm for this) ! - company names ! - meta description <meta desc..> ! - meta keywords <meta keywords..> ! - meta expires <meta expires..> ! - filenames ! - postal/zip codes ! - stock symbols ! - abbreviations for province/state names ! - em tagged words <em> ! - blinking words <blink> ! * marquee words <marquee> ! - small font words <font size=-x..> ! - table headers <???> ! - words in table data tags <td>...</td> ! - alt tags (for commenting images) <a ... alt=xxxx> ! - image file names <img src=xxxx > ! - quoted words <??> ! * block text quoted words <block> ! * listed text words <li> ! * preformatted text <pre> ! - text/image ratio ! - individual words, and their frequency ! - phrases (Panama canal routine) ! - size of entire file ! - size of data after html removed ! - text/html ratio |