crawlstat / Wiki / Home

Crawlstat

There are two major portions of crawlstat code.
1 backend coding in python
2 frontend coding in php and jquery

Backend
There are multiple files written in python. the file names are self descriptive so from name we can understand what a file code do.
Path
/home/hduser/crawlstatJobs_hafiz
1 make_dump.py => this file get url and cld2_str from solr and make a dump file.
2 python.py => this files reads domains from dump5.txt and write domains in domains.txt file.
3 crawlstatJobs.sh => this file run all the crawl stat jobs
4 domains_DB.py => this code reads file from domains.txt and write domains in DB.domains table
5 sld_mapper.py & sld_reducer.py => thses codes reads sld and then write in DB
6 tld_mapper.py & tld_reducer.py => same as above
7 Language_Extraction.py => This code reads first,second and third language from dump.txt and then write in first, second and third_language.txt files respectively.
8 first, second , third_language_DB.py => these codes reads first,second and third_languages from text files and save them in DB.

Path
/home/hduser/crawlstatJobs_hafiz/fetch_phase_stats/
1 daily_job_runner.sh => this file runn the code which we need to run on daily basis
2 solrClass.py => this is class for using solr through single point. i.e. In future we can change solr path through single line of code
3 DB.py => this is class for Database. Through this file we will be able to change the db path from single line of code.
4 index_doc_info.py => this file get total documents indexed and total number of web_group and then store it into mysqli in index_doc_info table
5 language_detection_info.py => this file get cle_score and cld2_score from solr and save it to mysqli in Language_detector table
6 main.py => this file total no. of docs appeared in fetch , no.of docs successfully fetched, low urdu_contents docs, timeout and other errors from jobHistory server and then save it to DB to fetch_info table.
7 jobCounterClass.py => this is helping class use in main.py

Database explantaion
All the tables in dashboard database.
cld2 => there are 10 columns starting from zero,ten,twenty to ninty in it.
zero column contain the no. of documents which contain cld2 value from 0 to 10.
and so on for other columns.

cle => CLE also contain ten columns starting from zero,ten,twenty and goes to ninty.
zero columns contains no. of documnets which contain CLE value from zero to ten and so on for other columns.
Language_detector => This table generate foriegn key and date for cld2 and cle table.

The primary key of cld2 and cle table is foriegn key i.e. taken from Language_detector table.

domains => This table contain the name of unique domain and it's occurance.
fetch_info => This table contain the information about the no. of urls appeared in fetch phase , the URLs successfully fetched, low URDU_CONTENT, timeout, unexpected, invalid _uri, failed_to_respond, unkonown_host erros.

first_language => This table contain first_language name and no. of documents which contain this language.

second_language => same as described above.
third_language => same as described above.

section_stats => This table contain the solr groups i.e. web, youtube, books etc and the no. of documents which each group contain

slds => this table contains second level domain
tlds => this table contain top level domain
total_domains => this table contain total unique domains indexed in solr so far.

Last edit: hafiz naser aslam 2018-10-30

crawlstat Wiki

Home

Project Members:

Discussion