masha_ivenskaya's blog

Text-based fake news detection: DONE!

In the final phase of the Google Summer of Code, Masha fine-tuned the classifier for sensationalism detection and added a left-right bias classifier.

The system is a bit too resource-heavy to run as an on-site demo, but all of the code and data to run the classifier locally is available from Github. The Bias Classifier directory on Github contains the trained model, as well as Python code to train a model and to classify new data, the data used for training, and a sample of test data (a held-out set) with true labels and the classifier scores. In the training data, label ‘0’ corresponds to the ‘least-biased’ class, ‘1’ corresponds to ‘left’, and ‘2’ corresponds to ‘right’.

This classifier takes as input a 2-column CSV file, where the first column corresponds to the headlines and second one corresponds to the article texts.

Usage for the Python code:

python -args

The arguments are:
-t, --trainset: Path to training data (if you are training a model)
-m, --model: Path to model (if you are using a pre-trained model)
-d, --dump: Dump trained model? Default is False
-v, --verbose: Default is non-verbose
-c, --classify: Path to new inputs to classify
-s, --save: Path to the output file (default is 'output.csv')


The output is a number between -1 and 1, where -1 is most left-biased, 1 is most right-biased, and 0 is least-biased.


The articles come from the crawled data - a hand-picked subset of sites that were labeled as "right", "right-center", "left", "left-center", and "least-biased" by   I used one subset of sources for the training data and a different subset of sources for the testing data in order to avoid overfitting.   I also trained a separate model on all of the sources I had available - since it is trained on more data, it may perform better. This model is also available in the Github directory under the name “trained_model_all_sources.pkl”

It is worth noting that articles from  'right-center' and 'left-center' sources often exhibit only a subtle bias, if any at all.  This is because the bias of these sources is often not evident on a per-article basis, but only on a per-source basis.  It may exhibit itself, for example, through story selection rather than through loaded language.  For this reason I did not include articles from 'right-center' and 'left-center' sources in the training data, but I did use them for evaluation. 


The classifier has a two-tiered architecture, where first the unbiased articles are filtered out, and then a second model distinguishes between right and left bias.  Both models are Logistic Regressions based on lexical n-gram features, implemented through scikit-learn.


Both models rely on bag-of-word n-gram features (unigrams, bigrams, trigrams).


The output is a number between -1 and 1, where -1 is most left-biased, 1 is most right-biased, and 0 is least-biased. For evaluation purposes, scores below 0 are considered “left”, above 0 are considered “right”, and 0 is considered “least-biased”. 

As previously mentioned, along with the 3 classes that are present in the training data, there are two addition in-between classes that I used for evaluation only.  

In order to be counted as correct for recall, right-center can be predicted as either 'right' or 'least-biased', and left-center can be predicted as 'left' or 'least-biased'.  In addition, when calculating the precision of the 'least-biased' class,  'least-biased', 'right-center' and 'left-center' true classes all count as correct. 

Class Precision Recall Right 45% 82% Left 70% 71% Right-center N/A 70% Left-center N/A 60% Least-biased 96% 33%


Unlike the Sensationalism classifier, this classifier relies on lexical features, which may be specific to the current political climate etc.  This means that the training data might "expire" and as a result the accuracy could decrease.  

Text-based fake news detection: Phase II

During Phase 2 of Google Summer of Code, I continued my data-aggregation efforts, developed the Source Checker tool, and trained a model that detects sensationalist news articles.

1. Data Aggregation

Throughout Phase 2, I crawled over 200 domains daily, and continued researching news domains and adding them to my crawler. As of today, I have aggregated over 30k news articles. As I plan to use these articles for classification models, below is the breakdown by each potential class:

Sensationalism Classifier:

Sensationalist: 13k Objective: 8.5k

Bias Classifier:

Right: 12k

Right-center: 1k

Least-biased: 3.5k

Left-center: 2k

Left: 4.5k

2. Source Checker

This is a tool that was requested by GSOC-mentors, @vincent_merckx and @amra_dorjbayar. It takes as input a snippet of text - presumably, a news article or part of a news article. It returns a graph output that shows what types of domains publish the text (or parts of the text)"

Example Graph:

  • The circles correspond to returned domains.

  • Circle size corresponds to amount of overlap between the input snippet and the domain.

  • Circle border color corresponds to bias: blue = left, red = right, green = neutral, grey = unknown.

  • Circle fill corresponds to unreliability: black circles are classified by one of the lists as either fake, unreliable, clickbait, questionable, or conspiracy. The blacker the circle - the more unreliable it is.

  • Edges that connect circles correspond to overlap of statements - the thicker the edge, the bigger the overlap.

After GSOC ends, we will localize this tool for Dutch articles as well.

Architecture of the tool:

The text snippet is broken down into n-grams using the Pattern n-gram module. N-grams that consist primarily of stop-words or named entities are discarded. A sample of the remaining n-grams is reconstructed into the original strings and run through the Google API as an exact phrase (in quotation marks) . The returned domains are then rated by the amount of queries that returned that domain (more than 6 out of 10 = "high overlap", 3 to 6 = "some overlap", less than 3 = "minimal overlap"), and matched against our database. The graph is rendered using the Pattern Graph module.

3. Sensationalism Classifier

I used the aforementioned crawled data to train a model that classifies a news article as either sensationalist or not. This model currently achieves an F1-score of 92% (obtained through 5-fold cross-validation).

It takes as input a 2-column CSV file, where the first column corresponds to the headlines and second one corresponds to the article texts. The output file contains a third column with the label - 1 if the input is categorized as sensationalist, 0 if not.

The classifier is an SVM, and it uses the following features:

  • POS tags (unigrams and bigrams)

  • Punctuation

  • Sentence length

  • Number of capitalized tokens (normalized by length of text)

  • Number of words that overlap with the Pattern Profanity word list (normalized by length of text)

  • Polarity and subjectivity scores (obtained through the Pattern Sentiment module)

Text-based fake news detection: Phase I

During the first month of our Google Summer of Code, I have been working along 3 distinct avenues:

1. Compiling news domains

Coming into the project, we had several lists of questionable domains:

  • The OpenSources list that I worked with previously (the BS-detector Chrome extension is based on this list)

  • Guy posted a list from Politifact

  • We were also looking at using since they seem to have a very comprehensive list, with categorization that may align with our needs (for ex. least-biased vs right-biased vs left-biased), as well as some information about each source.

I wanted to aggregate all of this information/categorization in one place, so I put together a CSV of all domains from the three sources above (~2k domains), along with the categories assigned by each, any additional comments, etc. It's been interesting to look at the overlap as well as at the discrepancies among these. This file will probably have several applications throughout the course of the summer and will be made available to the general public.

2. Crawling news domains

Later this summer we may end up building one or more text classifier that would classify a news article based on its content (rather than the source where it was published). For example, we may build a classifier for distinguishing sensationalist vs. objective news style, a classifier for detecting right vs. left bias, etc. The first step for any of these endeavors, of course, is to collect data.

I have started to crawl the domains from the compiled file mentioned above. My approach is to tread carefully and thoughtfully in order to ensure "clean", cohesive datasets, rather than to try to automatically crawl all domains and gather as much data as possible. I hand-pick each domain to be crawled, based on information from MBFC, Open Sources, and Politifact, as well as my own judgement - only picking those domains that clearly exhibit characteristics of a potential category (ex. sensationalist, objective, pseudoscience etc.)

I am still in the process of checking the domains and adding them to the crawler. As of today (6/24), I am crawling over 100 domains, accumulating more than 1k articles daily.

3. Source Checker

GSOC mentor Amra Dorjbayar (VRT) pitched an idea for a useful demo tool - a source checker that takes a text, chops into pieces, googles the result, and returns the sources that publish this text, as well as a warning if one of the sources is not reputable. I have started putting together a prototype for this:

  • Using Pattern's n-gram module, I break the text into n-grams

  • I discard n-grams that would not be useful for googling, such as n-grams that consist primarily of named entities (ex. 'Rand', 'Paul', 'of', 'Kentucky', 'Ted', 'Cruz', 'of', 'Texas', 'Mike', 'Lee') or of stop-words (ex. 'to', 'being', 'able', 'to', 'boast', 'about', 'the', 'adoption', 'of', 'a')

  • I pick a random subset of the remaining n-grams and run them through Pattern's Google API

  • I use Pattern's Intertextuality module to choose only those results that match the text

  • These results can then be matched against our file of domains, and we can return to the user information about the sources that publish the text, potentially along with some sort of graph visualization

For evaluation, I am using a random subset of the crawled news articles (see above) - I break each article into snippets of various lengths, run each snippet through the tool, and check whether the domain from which the article was crawled matches one of the domains returned by the tool.

Unfortunately, this work got stalled because of Google's API query limit, so the parameters have not yet been tested and tuned. We are currently looking into using a peer-to-peer search engine like Faroo and YACY, as well as into getting budget to continue work on the Google functionality.

Overall, I believe our project is off to a great start, and I am excited to see what we achieve in July and August.