Text-based fake news detection: Phase II

During Phase 2 of Google Summer of Code, I continued my data-aggregation efforts, developed the Source Checker tool, and trained a model that detects sensationalist news articles.

1. Data Aggregation

Throughout Phase 2, I crawled over 200 domains daily, and continued researching news domains and adding them to my crawler. As of today, I have aggregated over 30k news articles. As I plan to use these articles for classification models, below is the breakdown by each potential class:

Sensationalism Classifier:

Sensationalist: 13k Objective: 8.5k

Bias Classifier:

Right: 12k

Right-center: 1k

Least-biased: 3.5k

Left-center: 2k

Left: 4.5k

2. Source Checker

This is a tool that was requested by GSOC-mentors, @vincent_merckx and @amra_dorjbayar. It takes as input a snippet of text - presumably, a news article or part of a news article. It returns a graph output that shows what types of domains publish the text (or parts of the text)"

Example Graph:

  • The circles correspond to returned domains.

  • Circle size corresponds to amount of overlap between the input snippet and the domain.

  • Circle border color corresponds to bias: blue = left, red = right, green = neutral, grey = unknown.

  • Circle fill corresponds to unreliability: black circles are classified by one of the lists as either fake, unreliable, clickbait, questionable, or conspiracy. The blacker the circle - the more unreliable it is.

  • Edges that connect circles correspond to overlap of statements - the thicker the edge, the bigger the overlap.

After GSOC ends, we will localize this tool for Dutch articles as well.

Architecture of the tool:

The text snippet is broken down into n-grams using the Pattern n-gram module. N-grams that consist primarily of stop-words or named entities are discarded. A sample of the remaining n-grams is reconstructed into the original strings and run through the Google API as an exact phrase (in quotation marks) . The returned domains are then rated by the amount of queries that returned that domain (more than 6 out of 10 = "high overlap", 3 to 6 = "some overlap", less than 3 = "minimal overlap"), and matched against our database. The graph is rendered using the Pattern Graph module.

3. Sensationalism Classifier

I used the aforementioned crawled data to train a model that classifies a news article as either sensationalist or not. This model currently achieves an F1-score of 92% (obtained through 5-fold cross-validation).

It takes as input a 2-column CSV file, where the first column corresponds to the headlines and second one corresponds to the article texts. The output file contains a third column with the label - 1 if the input is categorized as sensationalist, 0 if not.

The classifier is an SVM, and it uses the following features:

  • POS tags (unigrams and bigrams)

  • Punctuation

  • Sentence length

  • Number of capitalized tokens (normalized by length of text)

  • Number of words that overlap with the Pattern Profanity word list (normalized by length of text)

  • Polarity and subjectivity scores (obtained through the Pattern Sentiment module)