HTML Comments Parser Code

Extracts the HTML and JS style (single and multi-line) comments

Brought to you by: sandeepst

Tree [5ed855] master / History

HTTPS access

File	Date	Author	Commit
dist	2014-08-18	sandeepst	[ffe3b3] updated url.xlsx under dist folder
Readme.txt	2016-05-02	sandeep	[5ed855] minor changes
htmlcomments.py	2014-08-06	sandeepst	[b9a2f4] Initial commit
htmlcomments.spec	2014-08-06	sandeepst	[b9a2f4] Initial commit
keysrch.py	2014-08-06	sandeepst	[b9a2f4] Initial commit
url.xlsx	2014-08-18	sandeepst	[9fccfc] updated url.xlsx and corrected Readme.txt's wit...
urlread.py	2014-08-06	sandeepst	[b9a2f4] Initial commit

Read Me

/***************************************************************
*
*
* First Author : Sandeep Tuppad
* website : https://in.linkedin.com/in/sandeep-tuppad-b4b87840
* Licence : MIT
* contact : sandeep.tuppad@gmail.com
*
***************************************************************/
About the software :
The software visits the user provided url's and extracts the comments(HTML style, javascript style single line and multi line) and writes the summary to a file.
Among the extracted comments it will also search for user provided keywords(like password,pswd, author etc.) and writes those lines containing them to
a file. This is useful when we want to test if any sensitive information is part of the comments. Manually visiting every page and checking for the
comments consumes more time and this software provides a solution by automating the check.

Source Files:
There are three source files written in python.
a) htmlcomments.py
This is a main file which uses the functions exported by the below two modules to achieve the purpose of the software. The software reads the list of
of url's specified in the spread sheet file. It visits a url and extracts the comments from the page and writes them to a file. Then searches the contents
of the file written with each keyword specified the spread sheet. If the keyword found the line containing it is written to another file. This is repeated
for each url and the keywords specified in the spread sheet. the end result is two output log files, one containing the comments from from each url and
another containing the comments containing the keywords for each url.
b) urlread.py
The file is a module which has function(s) to read the specified column from a specified spread sheet file.
c) keysrch.py
The file is module which has function(s) to read lines from a specified file and search for the specified keyword and write the line containing it
to a specified file.

Input Files:
a) url.xlsx
The spread sheet contains a column(column A of sheet 1) of url's in one page and column(column A of sheet 2) of keywords in another page. Edit the variable WHAT in cell A[2]
of sheet 3 to configure the type of comments you would like to extract.This file is an input for the software.

Generated Log Files:
a) comlog
The file(a text file) contains the extracted comments from each url page in formatted way.
b) keysearch
The file(a text file) contains the comments containing each keyword for each url.

Configuration:
a) The software searches for the comments based on the SOM(start of the message) and EOM(End of the message) specified in sheet 3 of url.xlsx file. To search for
1) 0: HTML style comments : SOM=""
2) 1: javascript multi line comments: SOM="/*" and EOM="*/"
3) 2: javascript single line comments: SOM="//" and EOM="\r\n"
4) In future custom option will be added where SOM and EOM can be edited to extract any other line(s) of url page not just the comments.
b) Create file named url.xlsx. Edit sheet 1, column A with the url's to be processed. Edit sheet 2, column A with the keywords to be searched for.Edit the variable
"WHAT" in cell A[2] of sheet 3 in url.xlsx to configure the type of comments you would like to extract. It's 0,1 and 2 for HTML comments, javascript multiline
comments and javascript single line comments respectively.

Folder structure:
1) The pyhon source files, url.xlsx files are in main folder
2) The folder "dist" contains the executable and dependent dll amd other files generated from the python source files.

Limitations of the software:
a) The software searches for SOM and then starts searching for EOM. Everything in between is considered as comment. So there are instances when the software
wrongly extracts the lines as comments though they are active code.

Dependencies:
1) The python 2.7 needs to be installed(if running the python source files).
2) python xlrd package compatible with python version 2.7 needs to be installed(if running the python source files).
3) The software is developed and tested on windows 7 platform even though it should be possible to port easily with or without very little changes to other platforms.

How to Run:
1) Edit the url.xlsx file as described above
2) Run the file htmlcomments.exe under "dist\htmlcomments" folder
3) Alternatively run the python main script htmlcomments.py(if the dependent software installed)
3) The output files comlog and keysrch generated.These are text files.

HTML Comments Parser Code

Extracts the HTML and JS style (single and multi-line) comments

Branches

Tree [5ed855] master / Download Snapshot History

Read Me

Tree [5ed855] master /

History