| File | Date | Author | Commit |
|---|---|---|---|
| CommonGround.py | 2017-01-27 | petebleackley | [r11] Fixed addressing issues |
| Licence.txt | 2016-12-13 | petebleackley | [r1] Initial release of CommonGround reference imple... |
| README.md | 2017-01-13 | petebleackley | [r5] Switched to NLTK implementation of VADER sentim... |
| requirements.txt | 2017-01-13 | petebleackley | [r5] Switched to NLTK implementation of VADER sentim... |
This project is the reference implementation of the Common Ground algorithm, which is intended to provide a means of bypassing filter bubbles in social media recommendation systems.
Confirmation bias, the tendency of people to seek out opinions they find agreeable and avoid those they find disagreeable, can be trained into a recommendation system, leading to the phenomenon known as a filter bubble, where social media will show people only what they agree with to start with, thus reinforcing their entrenched opinions. To counteract this effect, it is desirable to create a recommendation system that will connect people whose opinions differ.
However, simply exposing people to content from those with opposing opinions will not in itself overcome confirmation bias - indeed, it risks reinforcing it by provoking hostile reactions. Therefore, it is necessary to identify common ground between such people - those topics on which they are likely to agree with each other despite their overall differences.
The algorithm represents a user's opinions in terms of an Attitude Vector.
Consider a set of Documents, , posted by a given user. Each of these may be associated with a topic vector
and a sentiment
. The means by which these are calculated is specific to the implementation. A sentiment is defined to be positive when the emotions expressed in the document are
The Attitude Vector is then defined as .
The sympathy between two users u and v is defined as the cosine similarity of their Attitude Vectors, ie
.
This is positive when the users' opinions generally agree, and negative when the generally disagree.
The first stage of finding recommended content for a user is to filter the set of candidate users to those with whom the given user's sympathy score is negative.
For two users u and v, the Common Ground Vector is defined as the element-wise product of their Attitude Vectors, ie . Positive components of this vector correspond to subjects on which users are likely to find common ground.
Given a set of Documents posted by users v who have negative sympathy scores with respect to user u, the Recommendation Score is defined as
A high value of this indicates a document written by a person the given user is likely to disagree with overall, but which primarily concerns topics on which that user is likely to agree with them.
The Reference implementation can be found in the file CommonGround.py. It is written in Python 2.7 and uses the following libraries.
Topic modelling is perfomed with gensim, using Latent Semantic Indexing with 256 topics. The topic model is initialised with a training corpus.
Documents are tokenized and optionally passed through a preprocessing stage. The tokenized document is then converted to a list of (token,TFδH) values,
defined as
where
is the number of occurrences of token w in the document,
N is the total number of tokens in the document
P(D|w) is the probability that a instance of w randomly selected from the training corpus is found in document D
P(D) is the probability that a token randomly selected from the training corpus is found in document D.
Sentiment Analysis is carried out using nltk.sentiment.vader. Each document is split into sentences, and the sum of the compound scores for each sentence in the document is calculated.
class CommonGround(object):
Topics A pandas.DataFrame containing the topic vectors for each document.
Sentiments A pandas.DataFrame containing the sentiment score for each document.
Attitudes A pandas.DataFrame containing the Attitude Vector for each user.
modulus A pandas.Series containing the magnitudes of the Attitude Vectors.
dH A pandas.Series containing the δ H weights used for weighting the topic model.
dictionary A gensim.corpora.dictionary.Dictionary object mapping words to token ids.
model A gensim.models.LsiModel object which performs topic modelling (Latent Semantic Indexing).
preprocess An optional callable which preprocesses documents prior to topic modelling.
def __init__(self,training_corpus,processing_pipeline=None):
"""sets up data structures
training_corpus is an iterable of documents
preprocessing_pipeline is a optional callable that performs tasks
such as stemming, POS tagging and word sense disambiguation on a
tokenized document"""
training_corpus is used to initialise the topic model. It is an iterable containing documents.
processing_pipeline (optional) is a callable used to perform preprocessing on documents prior to topic modelling. This may involve stemming, tagging, word sense disambiguation, or similar. It should accept a list of strings and return a list of hashable objects.
def __call__(self,user,n=10):
"""Finds content from people that the user normally disagrees with
that reflects his/her common ground with those people"""
user a hashable object identifying a user for whom recommendations are to be found
n (optional, default=10) is the number of results to return
A pandas.Series indexed with (user,uri) for the n documents with the highest Recommedation Score for the given user, and containing their Recommedation Scores.
def add_document(self,user,uri,document):
"""Calculates the sentiment score, and topic vector for the document,
and updates the users's attitude vector"""
user A hashable object identifying the user who posted the document
uri A string representing the uri of the document
document The document itself (a string)
def SetupDH(self,training_corpus):
"""Sets up the deltaH weights used for topic modelling"""
Called by __init__, using the training_corpus.
def tokenize(self,document):
"""For a document, returns a list of tokens.
For a corpus (list of documents), returns a list of tokenized documents.
Performs preprocessing if a pipeline was passed in the constructor."""
document A string representing a document or a list of strings representing a corpus of documents
A list of (optionally preprocessed) words if document was a string, or a nested list if it was a list.
def get_features(self,document,is_corpus=False):
"""Extracts features, weighted by TFdH for a document or corpus"""
document A list of (optionally preprocessed) words representing a document, or a nested list of these representing a corpus
is_corpus (optional,default False) indicates whether document represents a corpus
A list of (token,TFdH) tuples representing the document, or a nested list of these representing the corpus
This software is released under the MIT licence
Copyright (c) 2016 Dr Peter J Bleackley
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.