Corrected bullet points on frequency distribution and bag of words slides.
Added layout for corpus analysis. Added slides to clarify bag of words scoring, slides that describe Word2Vec, and slides that describe GloVe. Noralized styling of slides and added references for word embedding.
Added worked solutions to stemmer in-class problems. Clarified description slides of lemmatization. Added 2 in-class problems for lemmatization.
Added slide comparing Lancaster and Porter stemmer. Capitalized Lancaster and Porter. Consolidated code to create side-by-side comparison between stemmers.
Combined basic text analysis and text classificatoin slides. Added slides for bag of words and word embedding. Created python file for corpus analysis examples. Clarified POS tagger.
Added slides and code for lemmatization. Added references. Clarified stemming slides, and added stemming example code.
Added details to stemmer slides and part-of-speech slides. Clarified class exercise slide for tokenization. Moved separated stemming slide back with the rest of the stemming section.
Added code for stemmers. Added code to generate part-of-speech acronyms and their meanings. Cleaned up var names and comments.
Combined other text preparation code into one file. Added section for part-of-speech tagging. Fixed issues printing concordance output. Added section for removing stop words.
Added slides for part-of-speech tagging. Added details to nomenclature on language mixing. Added goal slide. Included in-class example and in-clss exercise queue slides for text preparation section.
Reorganized and cleaned up. Added more explicit example for viewing raw text.
Added details to slides on stop words and obtaining other corpuses. Added code to get text from other corpuses, remove stop words, and show words tagged with parts of speech.
Updated slides to include background for text preparation. Added slides for in-class exercises for intro, tokenization, and stemming. Added example book in raw text.
Created file for brief intro to NLTK, tokenization, and stemming.
Uncommented code for class.
Fixed text in print statement to match actual output.
Adjusted number of estimators. Added import for Ada boost.
Updated slides with overviews. Filled in some of the sections on nomenclature, text classification, and background.
Fixed data to match previous year's data.
Clarified variable names. White space cleanup.
Added section for making predicitons on future data. Added code to data preparation that standardizes values for sector and industry categories that had different names.
Added fake data for prediction section of regressor review example code.
Added real-world coastal economic data. Finished section on data preparation with real data.
Added NOAA coastal economic data.
Finished slides and code for regressor review with toy data.
Fixed equations for MSE cost function.
Started slides and code for regressor review.
Removed listed regressors that were not reviewed in slides.
Updated license header text.
Added example code for XML parsing and scraping a website for XML data.
Reorganized code to push sanity check right after data preparation.
Clarified name of data repository listing file. This file contains websites that house large sets of diverse data.
Added scaling example. Added sanity check for perfect scores, comments on the cause, and code to remove the overtrained model.
Added metrics to code. Fixed code so classifiers would run. Polished slides.
Fixed arguments to RocCurveDisplay
Touched up slides, finished preparing data in code.
Started classifier review code.
Updated data sources list. Added data and description for detroit fire alarm inspections.
Added slide deck for classifier review.
Corrected overview slide.
Added slides for gradient boosting. Corrected slides for AdaBoost. Added code for gradient boosting.
Added details for algorithm options.
Added additional model to score comparison.
Added clarification to type of models allowed for AdaBoost.
Moved ML slides to dedicated directory. Split slides into separate files. Updated and added details to random forest slides in the ensemble methods slide deck.
Fixed wording in slides for AdaBoost. Added clarification comments to Adaboost code.
Added slides and code for AdaBoost. Added slides for enriched random forests and regression-enhanced random forests. Corrected random forests description slide.
Testing adding config file to repo.
Testing new sourceforge URL.
Fixed typos in slides, advanced bookmark slide.
Added slides and code for random forests. Added slides for boosting.
Added slides and example code for bagging and pasting ensemble methods.
Added coffee data.
Added make_moons to file.
Fixed plot functions that no longer work in newer versions of Scikit-Learn. Copied classifier and regressor evaluation code to new files.
Added slides and code for voting ensemble methods. Advanced place holder slide.
Added details for MSE slide. Added example code using MSE and coefficient of determination, including how to find best values.
Commented out no-longer-used Boston housing data load function. Added slides and code for regressor metrics.
Added slides and code for classification evaluation.
Added namespace_example.xml file path for XML exercises.
Fixed hyperlink in XML in-class problem 3.
Added JSON encode example.
Add supporting files.
Created new files for each example code section for organizational purposes.
Added JSON example code, slides, and files.
Added slides to cue in-class examples for BeautifulSoup and XML.
Created example XML file with namespaces, added code to read XML namespaces, fixed XML namespace slide to incorporate xmlns URL, added in-class problems for parsing XML.
Fixed menu items XML example to properly store attributes. Added slides for parsing HTML with BeautifulSoup, and using urllib. Added spacing for code output to separate out sections.
Added beautifulsoup examples for HTML parsing.
In code: fixed typos, added different read sql functions, showed how to get list if tables and views. Slides: moved index slide.
Added XML slides and example code for basic XML structure and reading through a non-attributed set of XML data. Added code to pull text data from website.
Added chinook database and license.
Rearranged CSV code to match presentation slides order.
Added code and slides for pulling data from a database into a pandas dataframe. Example uses SQLite. Added layout for pulling web data.
Slides and code for reading CSV, Excel, and ODS data. Added ODS example file.
Updated slides for manual engineered features, updated slides for scikit-learn built-in data sets, code to show types of built-in data sets and examples to plot them.
Added code examples for S-curve and swiss roll before and after LLE.
Added LLE example code.
Added slides for LLE algorithm, added slides for section summary.
Clarified incremental PCA, added code and slides for kernel PCA.
Clarified how incremental PCA reduces data size.
Updated slides for incremental and randomized PCA, added code for incremental and randomized PCA.
Added code to invert PCA and show original feature weights, added code to automatically determine min number of dimensions given desired variance, updated place-holder slide in slides.
Code cleanup - removed erroneous files.
Added PCA example, including scaling and ML training.
Added placeholder slides for other dimensionality reduction slides, added notes for PCA.
Reorganized slides, added details to dimentionality reduction slides, added slides for PCA.
Added slides for dimensionality reduction, clarified code comments and vars.
Added more explanation for p-values. Prepared code for class by commenting out code not in first examples.
Added explanation of p-value along with citation. Reinforcement of how to check correlations.
Added review file that concisely goes over all ideas in the class thus far.
Fixed spelling errors in slides, split Spearman correlation factor slide into 2, commented out future code.
Added slides and example for feature importances.
Added examples for correlation factors, updated slides with more correlation info.
Slides and code for imputer added.
String category encoding slides and example code added. Started slides on adding missing data.
Added new data source that has missing dataand string categories. Will be used for showing how to use Scikit-Learn Imputer and category encoding.
Pipline example for data scaling, moved .pptx to .odp due to move to Debian (.pptx size > 100 MB GitHub limit), clarified interquartile range.
Syncing up files.
Examples of data scaling added, placeholder for categorizing string values.