Menu

Tree [ea3b22] main /
 History

HTTPS access


File Date Author Commit
 CHANGELOG.txt 2023-03-03 Ron Spain Ron Spain [d6fc54] Corrected info
 LICENSE 2021-09-15 Ron Spain Ron Spain [310f4e] v2.0 with web interface, scheduler, new words f...
 NuzeBot.png 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 README.htm 2023-03-04 Ron Spain Ron Spain [ea3b22] Informative config file
 README.md 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 README.txt 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 TODO.txt 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 compile.sh 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 compileCrawler.sh 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 config.txt 2023-03-04 Ron Spain Ron Spain [ea3b22] Informative config file
 convertWords.c 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 crawl.c 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 crossCompile.sh 2023-03-03 Ron Spain Ron Spain [2d17a7] Fixed file names in scripts
 favicon.ico 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 index.htm 2023-03-03 Ron Spain Ron Spain [d6fc54] Corrected info
 junk.c 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 mimes.txt 2023-03-03 Ron Spain Ron Spain [d6fc54] Corrected info
 mkTextHelp.sh 2021-07-08 ronspain ronspain [2e0aa5] Cleaned code, added -?, pcre
 nuze-lib.c 2023-03-03 Ron Spain Ron Spain [d6fc54] Corrected info
 nuze-lib.h 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 nuze.c 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 nuze.css 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 reg.txt 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 run.bat 2023-03-03 Ron Spain Ron Spain [d6fc54] Corrected info
 run.sh 2023-03-04 Ron Spain Ron Spain [ea3b22] Informative config file
 sched.c 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 sched.h 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 sched.txt 2021-10-15 Ron Spain Ron Spain [d039be] fixed possible bug with strings in config, move...
 setup.sh 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 sites.txt 2021-10-18 Ron Spain Ron Spain [9c1588] Web interface working for Win32
 std.c 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 std.h 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 web-ui.c 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 web-ui.h 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more
 words.txt 2023-03-01 Ron Spain Ron Spain [03701f] v3: 2-letter opts & more

Read Me

<!doctype html><html lang="en"><head><title>NuzeBot Documentation</title>

<meta charset="utf-8">

<meta name="description" content="Instructions, help, and information for Ron Spain's NuzeBot. This news robot is free and open source software for gathering headlines.">

<meta name="keywords" content="nuze,bot,news,robot,software">

<meta name="author" content="Ron Spain">

<meta name="viewport" content="width=device-width, initial-scale=1">

<style>

body{font-family:sans-serif;background-color:#666}
div{background-color:#aaa;color:#000;padding:1em;border-radius:1em;border:1px solid #000}
td{padding:0 0.25em}
a{text-decoration:none}
a:hover{text-decoration:underline}

@media(min-width:640px){
 body>div{width:640px;display:table;margin:auto}
}

</style>

</head><body>

<div>

<h1>NuzeBot Documentation</h1>

<p>The NuzeBot is a robot that is designed to find interesting new headlines. The headlines are in the form of hyperlinks, allowing further reading at the source. The output of the NuzeBot is an HTML file that you can conveniently view with your favorite web browser.

<p>The NuzeBot is designed to remember the hyperlinks that it sees. Old links are penalized, moving them down the list until they are no longer shown on the page.

<p>Though the NuzeBot is functional without editing anything, you should probably customize a few files so the bot can provide the kinds of results you want. The bot is designed to use plain text files for configuration, so when you're editing these files, you should only use a simple text editor that doesn't add any formatting or markup.

<p>Comments in the configuration files have the '#' character at the beginning of the line.

<p>The NuzeBot is free and open source software.



<h2>Purpose</h2>

<p>The NuzeBot can serve different purposes:

<ul>
<li>It provides interesting headlines for personal information.</li>
<li>It provides free content for your websites.</li>
<li>It supports an enhanced business intelligence by letting you keep up-to-date with the latest news about topics related to your industry.</li>
</ul>


<p>"You should be on top of all the news within your industry, and beyond that all local, national, and global news as well." ~ Donald Trump, <em>Think Like a Billionaire</em>


<h2>Usage</h2>

<p>To use the NuzeBot, simply run the run.sh (on Linux) or run.bat (on Windows) file. The NuzeBot might take a few minutes if you have many sites in your sites.txt file. The output file will be named nuze.htm by default. Open that file with your favorite web browser when the bot is finished.

<p>It may be good to observe how the NuzeBot works before continuing, but for the kind of results you really want, you will need to edit a couple of files that are used for controlling the NuzeBot.

<p>In case you want to create multiple output pages of headlines, use the -oh &lt;file.ext&gt; option to store headlines for repeated use by the Nuzebot to avoid unnecessary network activity and load on servers. Use the -ih &lt;file.ext&gt; to load the headlines to generate different pages. Where zlib is enabled in the build options, headlines are stored in a compressed format.

<p>Using the option -r nw will run the bot now ('n') and start the web interface ('w') on the default port 8888 (or change via the -wp option), which you may access at address 127.0.0.1:8888 via your web browser. To be available from other computers on the LAN, use the appropriate LAN address such as 192.168.0.8 that is found by checking your connection information. To be available from behind a router, enable port forwarding. Running the web interface is not required to use the NuzeBot, but it provides a built-in server and some extra features.



<h2>Automation</h2>

<p>Version 2 of NuzeBot includes a built-in scheduler / timer system, if you choose to use it. The additional configuration file "sched.txt" uses a cron-like syntax except ',' and '/' are not yet supported. There are only four numbers per schedule entry telling when to run the bot: minute, hour, day (of month), and the day of the week. You can use a '*' as a wildcard to match any value.

<p>To edit scheduled times to run the bot, edit the sched.txt file.

<p>To run every day at 5 PM:

<table>
<tr><td>#Minute</td>	<td>Hour</td>	<td>Day</td>	<td>Weekday</td></tr>
<tr><td>0</td>		<td>17</td>	<td>*</td>	<td>*</td></tr>
</table>

<p>To activate this scheduler, you must use the -r=t option. To run the NuzeBot now and also run the scheduler, provide -r=nt as an argument.

<p>Otherwise, to run NuzeBot automatically every day on Linux, you could use cron. Assuming cron is installed and running, type crontab -e to edit tasks, and append the following line:

<p>0 11 * * * cd /full/path/to/nuzebot;./run.sh

<p>That will run the NuzeBot every day at 11 AM. To run daily at 5 PM and 9 PM, use this:

<p>0 17,21 * * * cd /full/path/to/nuzebot;./run.sh

<p>Edit the path to match the path of the NuzeBot folder on your computer.



<h2>Web Interface</h2>

<p>The web interface uses a simple built-in HTTP server to eliminate the need for a separate web server to share the news found by the NuzeBot on your LAN or WAN. The web interface also provides additional features such as search.

<p>The web interface starts on port 8888 by default, so if it is running on your computer, go to <a href="http://127.0.0.1:8888">http://127.0.0.1:8888</a> in your favorite web browser to use it.

<p>The web interface is designed with security in mind, but HTTPS is not yet supported, so you must use an unencrypted/unsecured HTTP connection for now.



<h2>Files</h2>

<p>The NuzeBot package that you download might contain these files:

<table>

<tr><td>config.txt</td><td>configuration file</td></tr>

<tr><td>sites.txt</td><td>web addresses to scan for headlines</td></tr>

<tr><td>words.txt</td><td>words for scoring</td></tr>
<tr><td>reg.txt</td><td>regular expressions for scoring</td></tr>

<tr><td>nuze.htm</td><td>the output file</td></tr>
<tr><td>nuze.css</td><td>css file</td></tr>

<tr><td>nuze-lib.c</td><td>source code specific to NuzeBot</td></tr>
<tr><td>nuze-lib.h</td><td>source code specific to NuzeBot</td></tr>
<tr><td>std.c</td><td>source code, general</td></tr>
<tr><td>std.h</td><td>source code, general</td></tr>
<tr><td>nuze.c</td><td>source code, main</td></tr>

<tr><td>compile.sh</td><td>Linux script for compiling</td></tr>
<tr><td>crossCompile.sh</td><td>Linux script for compiling for Win32</td></tr>

<tr><td>mem.dat</td><td>memory file</td></tr>

<tr><td>index.htm</td><td>home page for the web interface</td></tr>
<tr><td>mimes.txt</td><td>mime types for files for the web interface</td></tr>

</table>

<p>Below, more explanation is provided for some of the files.



<h3>sites.txt</h3>

<p>This file should contain the addresses of all of the pages that you want to be scanned for news headlines, so the sites.txt file should be customized to contain the addresses of pages that contain interesting headlines in the form of hyperlinks.



<h3>words.txt</h3>

<p>This file contains the words to be used to score the headlines. The format has recently changed to be more efficient. On one line is a number (specifically an integer) specifying a score. Each line after it contain a word with that score, until an empty line is reached. Then we have another integer number score and more words, and so on until the end of the file.

<p>Scoring of words allows headlines to be ranked, placing the most interesting headlines at the top of the output page. Scoring also allows unwanted results to be penalized, moving them down the page or even out of the results entirely. When scanning web pages, the bot doesn't know the difference between links that are news versus links that are ads or even links to those boring "terms of service" pages, so you must use the words.txt file to tell the bot what kind of content you want.

<p>Of course, the words.txt file must be customized if NuzeBot is to find the headlines that suit your personal interests, so simply edit the words.txt to reflect your interests. When the bot has finished running, check the output page (nuze.htm) to see which headlines should have scored higher and which links should have been penalized. Then you have some clues about how to improve your words.txt file.

<p>For example, if you happen to like cheese and butter, you might append the following five lines to your words.txt file:

<p>50
<br>cheese
<br>butter

<p>Any headline containing "cheese" or "butter" is given fifty points. A headline containing both words "cheese" and "butter" gets a hundred points. Neat, eh?

<p>As another example, if you want to get rid of all links about toadstools, you could simply append the following two lines to your words.txt file:

<p>-9999
<br>toadstool

<p>This gives "toadstool" a score of -99, so any link containing that string of characters is penalized by 99 points.

<p>Every headline/hyperlink starts with a score of zero, and depending on the words in the name and address parts of the link, its score is increased or decreased. Headlines with negative scores are usually not shown, but you will be able to change that via the "limit" option later on. It can be useful during testing to see which headlines are being penalized.

<p>As you add more and more words and their scores, the words.txt file can become too big and disorganized to handle. That's why we've made it so you can use the following syntax to include another file:

<p>@otherfile.txt

<p>The '@' tells the bot to look for more words in the file with the name that is specified directly after the @. This way, you can organize your words by topic, with a separate file for each topic.

<p>Sometimes, you might want to match a phrase containing multiple words. But headlines in HTML might contain multiple spaces or even a newline between words, making precise matching difficult with normal string matching. Furthermore, web addresses used for links often contain clues about their content, but they typically contain dashes or other characters between words instead of spaces. That's why we've designed the NuzeBot so that in the words file, the '-' character in the words.txt file will match any number (including zero) of non-alphanumeric characters.

<p>peter-pan
<br>8

<p>That will boost any headline about Peter Pan by 8 points, whether it appears as Peter Pan or Peter-Pan or Peter.Pan or PeterPan. However, it will also match Peter Panda and Peter Panama, and that's why we've made it so that the '_' character matches a single character of a non-alphanumeric type.

<p>_peter-pan_
<br>8

<p>That will exclude Peter Panda, Peter Panama, and other false positives. The '_' at the beginning and the end will help make sure the bot only boosts relevant links. You could use a space in place of any '_' character, but spaces are hard to see in most text editors, so we prefer the '_' character.



<h3>reg.txt</h3>

<p>This is the file for scoring headlines via regular expressions. You can leave this file empty if you don't like regular expressions. The format is slightly different from the word.txt file: The score is given on one line, then all regular expressions with that score are given on consecutive lines. A blank line must be found at the end of each list of regular expressions.

<p>For example:

<p>5
<br>\b(f|do|bur)rito
<br>\bcheeto
<br>\btacos?\b
<br>
<br>3
<br>\b(bubble|chewing)\s*gum
<br>

<p>Regular expressions are a new feature for NuzeBot and have not been tested.



<h2>Command Line Options</h2>

<p>The command line options allow you to change how NuzeBot works without your needing to figure out how to edit the C programming code and recompile. For boolean yes/no true/false options, use 1 for yes or true and 0 for no or false. The same options can be used in the config.txt file. Options that are given on the command line will override the options in the config.txt file.

<p>Input options:
<br>Input options begin with the letter 'i'.

<p>-iw wordfile.txt
<br>The specified file contains the words that will be used for scoring headlines.

<p>-is sitefile.txt
<br>The file contains the web addresses that will be scanned for headlines.

<p>-ih headlines.txt
<br>This tells NuzeBot to load the headlines from the headlines.txt file. This is useful when you want to create multiple output pages about different topics.

<p>-ip "mycmd -myoption"
<br>This tells NuzeBot to use a pipe instead of a library function to load pages. You could specify a custom command, perhaps using wget or curl, or specify 'c' or 'w' to use generic curl or wget command lines.


<p>Output options:
<br>Output options begin with the letter 'o'.

<p>-of outfile.htm
<br>Specify - for stdout.

<p>-ot "My News Page"
<br>title of your news page

<p>-oh headlines.txt
<br>This tells NuzeBot to save the headlines to a file and quit. These headlines can be loaded later using the -ih headlines.txt option. When creating multiple news pages, save time and avoid unnecessary HTTP requests by saving headlines to a file to use for generating all news pages.

<p>-ox cal
<br>This tells NuzeBot to execute a command when done.

<p>-om 100
<br>maximum number of headlines to show

<p>-ol -1
<br>lower limit for scores of headlines to show

<p>-op 0
<br>whether to show the full page or just headlines
<br>Set to zero if you only want the headlines.

<p>-oi 0
<br>whether to include informative stats at the bottom of the output page
<br>Set to zero to disable stats.

<p>-oe 1
<br>whether to show extra information about hyperlinks on mouseover

<p>-ov 1
<br>set verbosity


<p>Web interface options:
<br>Web interface options begin with the letter 'w'.

<p>-wp 8008
<br>port to use for web interface




<p>Other options:

<p>-r ntw
<br>There are three ways to run the NuzeBot: now, timer, and web interface. Specify any combination of 'n', 't', and 'w'.

<p>-d 5
<br>delay in seconds to wait between page loads

<p>-c 9
<br>sets compression level (0-9) for the mem.dat file

<p>-?
<br>shows help



<h2>Compiling</h2>

<p>To compile the NuzeBot on Linux, simply run the new ./setup.sh script.

<p>In an effort to make it easy for you to get started with minimal hassle, we've made it so you can compile the NuzeBot without any of the libraries that add extra features. To change these build options, just define the string in capital letters. The best way is to use the -D arg when calling GCC, which is normally done via the compile.sh file when compiling NuzeBot. For example, to define NOCURL when compiling with GCC, add -DNOCURL to the command line. Build options work on Linux or Windows.

<p>NOCURL
<br>Define this to build without libcurl support. You will need to use the -ip option to specify a pipe.

<p>NOZLIB
<br>Define this to build without zlib support. This option simply disables compression of the memory file.

<p>NOREG
<br>Define this to build without support for regular expressions. You can use the pre-existing syntax for matching with the words.txt file as explained above.

<p>NOWEB
<br>Define this to build without the web interface.

<p>STRNDUP
<br>Define this to build the bot with its own strndup function in case you get an compile error for missing strndup.

<p>LITE
<br>Define this to build the lite NuzeBot, which uses no unusual DLLs. Defining LITE is the same as defining NOREG, NOCURL, and NOZLIB.

<p>Build options are a new feature for NuzeBot and have not been thoroughly tested.



<h2>Websites</h2>

<p>Keep up with the NuzeBot project at the following web address:

<p><a href="https://sourceforge.net/p/nuzebot/">
https://sourceforge.net/p/nuzebot/</a>



<h2>Contact</h2>

<p>If you don't want to use the websites, you can send your comments, bug reports, and feature requests to the following email address:
<br>wrspain@gmx.us



<h2>License</h2>

<p>The files in the NuzeBot project are Copyright &copy;2021 Ron Spain and are provided under the MIT license, a comparatively permissive license for open source projects.

<p><b>Share and enjoy.</b>



</div>

</body></html>