GSOC Reports

Phase 2 - Report: Maja Gwozdz

In the second phase of GSoC, I continued annotating political tweets and corrected some typos in the dataset. I created a more varied corpus by collecting tweets related to American, British, Canadian, and Australian socio-political affairs (I am also collecting New Zealand tweets but they are really rare). As regards the annotation guidelines, I improved the document stylistically and added relevant examples to each section. I also created a short appendix containing the most important politicians and their party affiliations, so as to facilitate future annotations.

As for the dataset itself, I am happy to announce that there were far more idioms and proverbs than in the previous stage. The following list presents the top ten most frequent hashtags extracted from the tweets (the figures in brackets represent the relative frequency of respective hashtags):

1. #Brexit (3.93)

2. #TrudeauMustGo (3.23)

3. #JustinTrudeau (3.07)

4. #MAGA (2.99)

5. #Tories (2.53)

6. #Drumpf (2.23)

7. #Corbyn (2.19)

8. #Labour (2.08)

9. #Tory (1.98)

10. #ImpeachTrump (1.73)

Our core set of hashtags (balanced with respect to political bias) was as follows: #MAGA, #GOP, #resist, #ImpeachTrump, #Brexit, #Labour, #Tory, #TheresaMay, #Corbyn, #UKIP, #auspol, #PaulineHanson, #Turnbull, #nzpol, #canpoli, #cpc, #NDP, #JustinTrudeau, #TrudeauMustGo, #MCGA. Many more hashtags are being used but they usually yield fewer results than the above set.

Below are a few figures that aptly summarise the current shape of the corpus: Left-wing bias: ca 55%

Male authors: ca 49%

Polarity: ca 44% negative, ca 47% neutral, ca 9% positive

Mood: ca 50% agitated, ca 21% sarcasm, ca 13% anger, ca 9% neutral, ca 4% joy

Offensive language: present in approximately 17% of all tweets

Swearing by gender: ca 53% males

Speech acts: ca 76% assertive, ca 38% expressive, ca 10% directive, ca 3% commissive, 0.2% metalocutionary

In the third stage I will continue annotating political tweets and write a comprehensive report about the task. My Mentors have also kindly suggested that they could hire another student to provide additional judgments on the subjective categories (especially, polarity and mood). Having more annotators will undoubtedly make the dataset a more valuable resource.

GSOC 2018 Phase 1 Reports

  • Posted on: 29 June 2018
  • By: Guy

The write-ups for Phase 1 of GSOC 2018 are ready! Check out the progress reports of our talented team of coders: Alexander Rossa, Maja Gwozdz, Rudresh Panchal and Максим Филин.

Phase 1 - Report: Maja Gwozdz

In the first phase of GSoC 2018, I started annotating political tweets. The corpus of political tweets includes, for instance, tweets related to US, Canadian, UK, Australian politics and current social affairs. The categories included in the database include information about the author, their gender, the political bias, the polarity of a given entry (I'm using a discrete scale: -1 for a negative utterance, 0 for a neutral one, 1 for a positive entry), speech acts, mood of the tweet (for instance, sarcasm or anger), any swear words / offensive language, and the keywords, that is, concrete parts of the tweet that led to the polarity judgment.

In order to obtain the relevant political tweets, I used Grasp and a list of popular political hashtags (to mention but a few: #MAGA, #TrudeauMustGo, #auspoli, #Brexit, #canpoli, #TheresaMay). I also prepared the annotation guidelines, so that other people interested in the project could offer their own judgment and provide additional annotations. Having more judgments will render the corpus more valuable. In the next stage of GSoC, I hope to have enough judgments from other people to estimate the agreement score and arrive at (more) objective scores.

The database is currently available as a Google Sheet --- this is a relatively easy way to store data and allow for parallel annotation.

Phase 1 - Report: Rudresh Panchal

With the first coding phase of GSoC 2018 coming to an end, this post reflects upon some of the milestones achieved in the past month.

I first worked on finalizing the architecture of the Text Anonymization system. This system is being built with the European Union's General Data Protection Regulations in mind. The system seeks to offer a seamless solution to a company's text anonymization needs. The many existing solutions to GDPR mainly focus on anonymization in Database entries, and not on anonymizing plain text snippets.

My system pipeline consists of two principal components.

  1. Entity Recognition: In this part, the entity is recognized using various approaches including Named Entity Recognition (implemented), Regular Expression based patterns (implemented) and TF-IDF based scores (To be implemented in 2nd Phase).

  2. Subsequent action: Once the entity is recognized, the system looks up the configuration mapped to that particular attribute, and carries out one of the following actions to anonymize the data:

  3. Suppression (implemented)

  4. Deletion/Replacement (implemented)

  5. Generalization (To be implemented in 2nd phase).

The methods to generalize the attribute include a novel word vector based generalization and extraction of part holonyms.

Some of the coding milestones achieved include:

  • Setup the coding environment for the development phase.

  • Setup the Django Web App and the Database.

  • Wrote a function to carry out the text pre-processing, including removal of illegal characters, tokenization, expansion of contractions etc.

  • Wrote and integrated wrappers for the Stanford NER system. Wrote the entity replacement function for the same.

  • Wrote and integrated wrappers for the Spacy NER system. Wrote the entity replacement function for this too.

  • Wrote the suppression and deletion functions. Integrated the two with a DB lookup for configurations.

  • Wrote the Regular Expression based pattern search function.

  • Implemented the backend and frontend of the entire Django WebApp.

Major things planned for Phase 2:

  • Implement a dual TF-IDF system. One gives scores based on the documents the user has uploaded, and one which gives scores based on TF-IDF trained on a larger, external corpora.

  • Implement a word vector closest neighbor based generalization.

  • Implement the holonym lookup and extraction functions,.

Picture 1: Shows the Dashboard of the users, allowing them to add new attribute configurations, modify existing configurations, add Aliases for the NER lookup, add Regex patterns and carry out text anonymization.

Picture 2: Shows the text anonymization system in action. The various entities and regex patterns were recognized and replaced as per the configuration.

Phase 1 - Report: Максим Филин

During the first month of Google Summer of Code, I have been working with the following tasks:

  1. Refactoring social media Twitter API.

  2. Trying all framework examples and fixing bugs and errors.

The project in master branch now supports only python 2, but we want to add the ability of using python 3. In this regard, it is necessary in some places to refactor the code. So, most of errors have been fixed.

3. Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box. SVM examples now work on all platforms.

4. Looking at Travis CI failing unit tests.

There were some errors: in TestParser.test_find_keywords and TestClassifier.test_igtree. The problem was in wrong interpretation of test examples and they are now rewritten. But there are still sometimes errors in Search Engine because of APIs licences and permissions.

5. Working on VK API.

We think that it would be great to add the api to VKontakte, the biggest Russian social network. It is used in many countries and supports over than 50 languages. It also has very multifunctional api, so it can be added in pattern. The structure for working with api have already been created.

In the following part of Google Summer of Code we will focus on Python 3 Pattern release and testing. Also we will try to implement pattern.ru module.

Phase 1 - Report: Alexander Rossa

There are two distinct focal points in my GSoC work. The first is a functional Twitter bot, achieved by a pipeline consisting of Python machine learning backend for tweet analysis, Node.js frontend for accessing Twitter API and integration of Seed into the frontend for templated procedural text generation. The other is extending the capabilities of Seed, making it into a Node.js module (available through npm) and adding conditional generation.

The work I have done during my first month gravitated largely around the first task. The tasks that were completed include:

  • Node.js frontend for manipulating Twitter API, retrieving and reacting to Tweets, sending back responses etc.

  • Python ML backend for Topic Analysis and Sentiment Analysis of Tweets

  • API for Python backend so that it can be accessed as a microservice rather than one monolithic deployment bundled with Node.js frontend

  • Seed miniprogram for generating templated responses. This does not yet have conditional generation enabled and serves more like a testing rig for trying to generate text conforming to non-violent communication principles.

Apart from these tasks, I spent time exploring approaches for deployment and getting to grips with new technologies.

The next focus for this project will be:

  • Extending Seed with conditional generation and API for easier access

  • Extending Seed miniprogram with newly acquired conditional generation capabilities

  • Testing and improving the bot. This may include:

  • adding ML module for paraphrasing (which will enable bot to revisit topics mentioned in communication in a more natural way)

  • improving quality of Seed miniprogram generation capabilities (more templates for sentences, more randomness...)

  • adding new rules for participation (currently working by hashtagging)

After the first two bullet points are completed, the work on Twitter bot is basically done and is just about improving the quality of the text it is able to produce.

Google Summer of Code 2017: wrap up

  • Posted on: 24 September 2017
  • By: Guy

The Google Summer of Code 2017 is officially over and we just can't believe how much work @markus and @masha_ivenskaya have done over the last couple of months. Pattern3 is close to being ready for public release. And an amazing left-right bias classifier has been finished and published on github. A valuable tool in the fight against fake news! Check their blog posts below.

THANK YOU!

We want to thank the mentors Vincent Merckx and Amra Dorjbayar for their most valuable input. We will now start localizing the fake news algorithms to Dutch and are happy to report that they have agreed to lend their continued support. Also a big shout out to Google for putting all of this together! The Google Summer of Code is an amazing project and we can't recommend joining this initiative enough!

Finally, of course a huuuuuuge thank you to our students. It was an honor for us to work with such talented people and we wish Masha Ivenskaya and Markus Beuckelmann all the best in their future careers. We are sure that they will be very successful in their future endeavors!

Porting Pattern to Python 3: DONE!

  • Posted on: 24 September 2017
  • By: markus

The final days of this year's Google Summer of Code have arrived and I am wrapping up my project. The last three months have been full of intense coding on the Pattern library and I'm happy to say that all milestones described in my project proposal are knocked off within the official coding period.

An exhaustive list of all my commits to the clips/pattern repository can be found here. A very nice commit–based comparison is available here (full diff and full patch). The official commit graph can be seen here as soon as the changes have been merged into the master branch. The Travis CI build for different branches can be looked at here together with the automated unit test coverage reports on coveralls.io. The last official GSoC commit is ec95f97 on the development branch.

Overview & Synopsis

This is what the official project description reads:

The purpose of this GSoC project will be to modernize Pattern, a Python library for machine learning, natural language processing (NLP) and web mining. A substantial part of this undertaking is to port a majority of the code base to Python 3. This involves porting the individual modules and sub–modules piece by piece, where the whole process will be guided by unit tests. In the beginning, I will remove all tests from the pipeline that do not pass for Python 3 and take this pared–down code base as a starting point, porting parts of the code and putting the respective unit tests back in as I go along. Missing unit tests must be added before moving on. Since porting Python 2 code to Python 3 code is a standard problem for the Python community, there are many different tools available that can help in this regard. In addition to that, I'd like to extend this project to a bit of a Hausmeister project (housekeeping for Pattern), and optimize/modernize the code base in terms of execution speed, memory usage and documentation.

At the beginning of the project in May (launch time machine), Pattern was in a position where it wasn't actively maintained due to time constraints. Many unit tests were failing, some features were deprecated (e.g. in pattern.web) but most importantly, it lacked Python 3 support which effectively made it unavailable for a large user base. Now, three months later, we are at a point where all of Pattern's modules (i.e. pattern.text, pattern.vector, pattern.web, pattern.graph, pattern.db, pattern.metrics and the language modules pattern.en, pattern.de, pattern.nl. pattern.fr, pattern.es, pattern.it) except for pattern.server are fully ported to Python 3. This task included working on some other major milestones such as removing the bundled PyWordNet in favor of NLTK's WordNet interface, transitioning to BeautifulSoup 4, removing sgmllib etc.. However, the biggest challenge for a joint Python 2 / Python 3 code base is always to carefully deal with unicode handling in all parts of the library, which can sometimes be tedious. Whenever possible we attempted to write forward–compatible code, i.e. code that handles Python 3 as the default and Python 2 as the exception, which required some extra effort, but will hopefully make the code more readable in the long term and makes it easier to drop Python 2 support entirely at some point. The next release will deprecate Python 2.6- support in favor of Python 2.7 – which will be the last Python 2 version – and Python 3.6+.

Furthermore, several general maintenance tasks have been performed such as code cleanup, documentation, refactoring of duplicate code to an additional pattern.helpers as well as general PEP 8 compliance.

Roadmap & Milestones

By far the largest chunk of work was dealing with the subtle differences between Python 2 and Python 3 to ensure that the code works identically regardless of the interpreter. Moving to a joint code base is a major undertaking, since there are many differences when it comes to strings (unicode vs. byte strings), generators and iterators, package import precedence, division and even fundamental data types such as dict and set. It is more or less hopeless to obtain a joint code base for Python 2.5- and Python 3, but fortunately it is possible to make it work for Python 2.6+ with some precautions, even without using six.

However, the following points on the roadmap were important milestones that don't necessarily have anything to do with the actual porting per se:

  • The following bundled packages and (vendorized) libraries have been removed in favor of external dependencies: feedparser, BeautifulSoup, pdfminer, simplejson, docx, cherrypy, PyWordNet. This also involved adapting Pattern's code to changes introduced in these external libraries.
  • The removal of PyWordNet came with the need for a new interface for Pattern to wrap NLTK's WordNet interface. This was quite time–consuming, since there had of course been many incompatible changes over the years that needed to be dealt with.
  • We set up Travis CI, a continuous integration platform to keep track of passing or failing unit tests on different branches / Python versions. This will run automatically for every PR and report changes in unit test coverage.
  • libSVM and libLINEAR have been updated to the latest versions. The pre–compiled libraries have been removed for now because they were incompatible with the newer libsvm/liblinear versions.
  • The unit tests were refactored to work with pytest. There is more work that can be done and it might be a good idea to leave the unittest entirely behind at some point in the future.
  • In the last days of the official coding period we went through a big PEP 8 (Style Guide for Python Code) cleanup which aims for a more consistent code base. However, we decided not to aggressively enforce all PEP 8 guidelines.

The new release will introduce the following external dependencies: future, mysqlclient, beautifulsoup4, lxml, feedparser, pdfminer (or pdfminer.six for Python 3), numpy, scipy, nltk, python-docx, cherrypy. For a more in–depth discussion of each of these items, check out my detailed progress reports (phase #1, phase #2).

Statistics

Let's play the numbers game: Over the course of the last three months, I have pushed > 403 commits to four different branches on the clips/pattern repository. This affected 238 files with a total of 11129 insertions and 53735 deletions (git diff --stat). My first commit was 1e17011 on the python3 branch. My last commit was ec95f97 on the development branch.

Here is what the contribution graph and the heat map on GitHub looks like: GSoC: Commits

The following panel shows deletions (left) and insertions (right) as a function of time: GSoC: Insertions and Deletions

This graph seems to reflect the roadmap pretty accurately. The majority of the deletions in the first period correspond to the removal of vendorized libraries. As the project progressed, more and more insertions took place and new or modified lines found their way into the code base.

Future Work & Next Steps

We will do some more testing and release the next major version of Pattern in autumn. The following items are predominantly independent of my particular project, but should be tackled before the next major release:

  • The only Python 3 related issue currently remaining is a bug in pattern.vector that affects the information gain tree classifier IGTree. It's hard to debug but it looks to me like it has something to do with order differences when iterating over dict objects. In any case, someone needs to take a closer look. This issue will be tracked on GitHub in the near future.
  • The pattern.server module has the major parts ported, but since there are no unit tests available for this module, it's hard to test it in a systematic way apart from running the examples. I believe it's currently not fully functional on Python 3.
  • The pattern.web module contains code to access some popular web APIs. Some of the APIs are deprecated or changed in some other way that requires refactoring. Some APIs have moved to paid subscription models without free quotas.
  • The current unit test coverage seems to be around 70%. This is okay for now, but there certainly is room for improvement.

Resources

The following resources proved to be invaluable during the porting, especially when it comes to the more subtle differences between Python 2 to Python 3:

Acknowledgments

So this is it – the end of the official GSoC coding period, time to sign off for a couple days. Thank you to the Google Open Source Team for bringing this project to life, and of course special thanks to my mentors Tom and Guy for their valuable feedback and guidance throughout the entire project.

Altogether, this was a great experience and I will remain an active contributor for the foreseeable future. Happy coding!

Text-based fake news detection: DONE!

In the final phase of the Google Summer of Code, Masha fine-tuned the classifier for sensationalism detection and added a left-right bias classifier.

The system is a bit too resource-heavy to run as an on-site demo, but all of the code and data to run the classifier locally is available from Github. The Bias Classifier directory on Github contains the trained model, as well as Python code to train a model and to classify new data, the data used for training, and a sample of test data (a held-out set) with true labels and the classifier scores. In the training data, label ‘0’ corresponds to the ‘least-biased’ class, ‘1’ corresponds to ‘left’, and ‘2’ corresponds to ‘right’.

This classifier takes as input a 2-column CSV file, where the first column corresponds to the headlines and second one corresponds to the article texts.

Usage for the Python code:

python bias_classifier.py -args

The arguments are:
-t, --trainset: Path to training data (if you are training a model)
-m, --model: Path to model (if you are using a pre-trained model)
-d, --dump: Dump trained model? Default is False
-v, --verbose: Default is non-verbose
-c, --classify: Path to new inputs to classify
-s, --save: Path to the output file (default is 'output.csv')

Output:

The output is a number between -1 and 1, where -1 is most left-biased, 1 is most right-biased, and 0 is least-biased.

Data:

The articles come from the crawled data - a hand-picked subset of sites that were labeled as "right", "right-center", "left", "left-center", and "least-biased" by mediabiasfactcheck.com.   I used one subset of sources for the training data and a different subset of sources for the testing data in order to avoid overfitting.   I also trained a separate model on all of the sources I had available - since it is trained on more data, it may perform better. This model is also available in the Github directory under the name “trained_model_all_sources.pkl”

It is worth noting that articles from  'right-center' and 'left-center' sources often exhibit only a subtle bias, if any at all.  This is because the bias of these sources is often not evident on a per-article basis, but only on a per-source basis.  It may exhibit itself, for example, through story selection rather than through loaded language.  For this reason I did not include articles from 'right-center' and 'left-center' sources in the training data, but I did use them for evaluation. 

Architecture:

The classifier has a two-tiered architecture, where first the unbiased articles are filtered out, and then a second model distinguishes between right and left bias.  Both models are Logistic Regressions based on lexical n-gram features, implemented through scikit-learn.

Features:

Both models rely on bag-of-word n-gram features (unigrams, bigrams, trigrams).

Results:

The output is a number between -1 and 1, where -1 is most left-biased, 1 is most right-biased, and 0 is least-biased. For evaluation purposes, scores below 0 are considered “left”, above 0 are considered “right”, and 0 is considered “least-biased”. 

As previously mentioned, along with the 3 classes that are present in the training data, there are two addition in-between classes that I used for evaluation only.  

In order to be counted as correct for recall, right-center can be predicted as either 'right' or 'least-biased', and left-center can be predicted as 'left' or 'least-biased'.  In addition, when calculating the precision of the 'least-biased' class,  'least-biased', 'right-center' and 'left-center' true classes all count as correct. 

Class Precision Recall Right 45% 82% Left 70% 71% Right-center N/A 70% Left-center N/A 60% Least-biased 96% 33%

Note:

Unlike the Sensationalism classifier, this classifier relies on lexical features, which may be specific to the current political climate etc.  This means that the training data might "expire" and as a result the accuracy could decrease.  

Phase II Completed!

  • Posted on: 3 August 2017
  • By: Guy

Phase II saw our students picking up even more steam. @markus has almost completed his port of pattern 3 and @masha_ivenskaya has developed two tools that will aid in the detection of fake news with more to come. One of the tools is available as a demo on this site. Let us know what you think!

On to the final stage!

Text-based fake news detection: Phase II

During Phase 2 of Google Summer of Code, I continued my data-aggregation efforts, developed the Source Checker tool, and trained a model that detects sensationalist news articles.

1. Data Aggregation

Throughout Phase 2, I crawled over 200 domains daily, and continued researching news domains and adding them to my crawler. As of today, I have aggregated over 30k news articles. As I plan to use these articles for classification models, below is the breakdown by each potential class:

Sensationalism Classifier:

Sensationalist: 13k Objective: 8.5k

Bias Classifier:

Right: 12k

Right-center: 1k

Least-biased: 3.5k

Left-center: 2k

Left: 4.5k

2. Source Checker

This is a tool that was requested by GSOC-mentors, @vincent_merckx and @amra_dorjbayar. It takes as input a snippet of text - presumably, a news article or part of a news article. It returns a graph output that shows what types of domains publish the text (or parts of the text)"

Example Graph:

  • The circles correspond to returned domains.

  • Circle size corresponds to amount of overlap between the input snippet and the domain.

  • Circle border color corresponds to bias: blue = left, red = right, green = neutral, grey = unknown.

  • Circle fill corresponds to unreliability: black circles are classified by one of the lists as either fake, unreliable, clickbait, questionable, or conspiracy. The blacker the circle - the more unreliable it is.

  • Edges that connect circles correspond to overlap of statements - the thicker the edge, the bigger the overlap.

After GSOC ends, we will localize this tool for Dutch articles as well.

Architecture of the tool:

The text snippet is broken down into n-grams using the Pattern n-gram module. N-grams that consist primarily of stop-words or named entities are discarded. A sample of the remaining n-grams is reconstructed into the original strings and run through the Google API as an exact phrase (in quotation marks) . The returned domains are then rated by the amount of queries that returned that domain (more than 6 out of 10 = "high overlap", 3 to 6 = "some overlap", less than 3 = "minimal overlap"), and matched against our database. The graph is rendered using the Pattern Graph module.

3. Sensationalism Classifier

I used the aforementioned crawled data to train a model that classifies a news article as either sensationalist or not. This model currently achieves an F1-score of 92% (obtained through 5-fold cross-validation).

It takes as input a 2-column CSV file, where the first column corresponds to the headlines and second one corresponds to the article texts. The output file contains a third column with the label - 1 if the input is categorized as sensationalist, 0 if not.

The classifier is an SVM, and it uses the following features:

  • POS tags (unigrams and bigrams)

  • Punctuation

  • Sentence length

  • Number of capitalized tokens (normalized by length of text)

  • Number of words that overlap with the Pattern Profanity word list (normalized by length of text)

  • Polarity and subjectivity scores (obtained through the Pattern Sentiment module)

Porting Pattern to Python 3: Phase II

  • Posted on: 3 August 2017
  • By: markus

The second GSoC coding period is over and has brought substantial progress. As of today, all of the submodules with the exception of pattern.server have been ported to Python 3.6. Pattern now shows consistent behavior for both Python 2.7 and Python 3.6 across all modules. All unit tests for pattern.db, pattern.metrics, pattern.graph and the language modules pattern.en, pattern.nl, pattern.de, pattern.fr, pattern.it, pattern.es pass, but there are still one or two failing test cases in pattern.text, and pattern.vector, as well as some skipped tests in pattern.web due to changes in some web services' APIs.

Specifically, I have been working on the following issues in the second coding period:

June, 26 – July 26

  • I continued working on the removal of a bundled pywordnet version which has been deprecated since many years. A good part of the functionality is now integrated into NLTK, however, there have been many backward incompatible changes to the interface over the years, which required significant changes to en/wordnet/__init__.py. I tried my best to hide all the changes in the backend from the Pattern user wherever possible, wrapping the new interface and maintaining the current Pattern en.wordnet interface. Since we now make use of NLTK's WordNet interface, this also makes the nltk package a dependency from now on. The bundled pywordnet version is completely removed now.

  • Pattern comes with a bundled version of libsvm and liblinear which provide various fast, low–level routines for support vector machines (SVMs) and linear classification in pattern.vector. Both bundled versions were quite old, so I replaced both libraries with the most recent release and made the necessary changes to make them work with the Pattern code base and support Python 3. The pre–compiled libraries have been removed for now because they were incompatible with the newer libsvm/liblinear versions. However, we might put some pre–compiled binaries for some platforms back in at some point.

  • Another major issue was some refactoring in pattern.web, most importantly the removal of sgmllib which is deprecated in Python 3. Fortunately, we are able to base HTMLParser in pattern.web upon the same class in html.parser with some small adjustments (da00ff).

  • In the first coding period, I removed the bundled version of BeautifulSoup from the code base and made it an external dependency. This period, I upgraded the code to make use of the most recent version BeautifulSoup 4 which also supports Python 3. As a result of this, some refactoring was done in pattern.web to account for backward incompatible changes to the parser interface. Furthermore, we now explicitly make use of the fast lxml parser for HTML/XML and consequently, the lxml package is another dependency now.

  • I removed the custom JSON parser in pattern.db since the json module is part of the standard library now.

  • pattern.web contains routines to deal with PDF documents through the pdfminer library. There have been some inconsistencies between Python 2.7 and Python 3.6 which resulted in weird exceptions being raised. Currently, the problem is solved by using the pdfminer package for Python 2 and pdfminer.six for Python 3, however, this should ideally be refactored and unified at some point.

  • There has been a long-standing bug with the single layer perceptron (SLP) (#182) that was haunting me and that I couldn't resolve for weeks. As a consequence of this bug, the majority of the unit tests for pattern.en failed. Last week, I ended up manually going through the commit history using essentially a binary search approach until I narrowed down the cause of the problem. Finally, all the problems are fixed as of 93235fe and the unit test landscape looks much cleaner now!

  • I also spend a lot of time making Python 2 and Python 3 behave consistently throughout all modules. This involved taking care of many of the subtle differences under the hood that I talked about in my first report. In order to avoid surprises for future developers who might not be aware of the differences between Python 2 and Python 3, I decided to put the following imports to the top of every non–trivial file to enforce consist behavior for the most important parts:

    from __future__ import unicode_literals
    from __future__ import print_function
    from __future__ import absolute_import
    from __future__ import division
    
    from builtins import str, bytes, int
    from builtins import map, zip, filter
    from builtins import object, range
    

    This should cover the most important differences and enforce Python 3–like division, imports, handling of literals and classes derived from object. Hunting down bugs in either Python 2 or Python 3 is laborious and time-consuming when you are unaware of what is really happening and different interpreters yield different results. Consequently, there should be a "no surprises" philosophy when it comes the behavior of rudimentary data types such as str, bytes, int or functions such as map(), zip(), filter(), justifying the above explicit declarations even if not all of them are strictly necessary right now.

  • There were many encoding issues to be covered in various modules this period to make the code base work with both Python 2 and Python 3, predominantly in pattern.text, pattern.en, pattern.vector and pattern.web. All string literals are now unicode by default (from __future__ import unicode literals), and functions expect unicode inputs if not stated otherwise. The str object from future makes Python 2 behave like a Python 3 str (which is always unicode).

Phase I Completed!

  • Posted on: 27 June 2017
  • By: Guy

It's amazing what the GSOC team has accomplished in just a little over a month. Be sure to check out their blog posts below, in which they detail their progress.

Text-based fake news detection: Phase I

During the first month of our Google Summer of Code, I have been working along 3 distinct avenues:

1. Compiling news domains

Coming into the project, we had several lists of questionable domains:

  • The OpenSources list that I worked with previously (the BS-detector Chrome extension is based on this list)

  • Guy posted a list from Politifact

  • We were also looking at using MediaBiasFactCheck.com since they seem to have a very comprehensive list, with categorization that may align with our needs (for ex. least-biased vs right-biased vs left-biased), as well as some information about each source.

I wanted to aggregate all of this information/categorization in one place, so I put together a CSV of all domains from the three sources above (~2k domains), along with the categories assigned by each, any additional comments, etc. It's been interesting to look at the overlap as well as at the discrepancies among these. This file will probably have several applications throughout the course of the summer and will be made available to the general public.

2. Crawling news domains

Later this summer we may end up building one or more text classifier that would classify a news article based on its content (rather than the source where it was published). For example, we may build a classifier for distinguishing sensationalist vs. objective news style, a classifier for detecting right vs. left bias, etc. The first step for any of these endeavors, of course, is to collect data.

I have started to crawl the domains from the compiled file mentioned above. My approach is to tread carefully and thoughtfully in order to ensure "clean", cohesive datasets, rather than to try to automatically crawl all domains and gather as much data as possible. I hand-pick each domain to be crawled, based on information from MBFC, Open Sources, and Politifact, as well as my own judgement - only picking those domains that clearly exhibit characteristics of a potential category (ex. sensationalist, objective, pseudoscience etc.)

I am still in the process of checking the domains and adding them to the crawler. As of today (6/24), I am crawling over 100 domains, accumulating more than 1k articles daily.

3. Source Checker

GSOC mentor Amra Dorjbayar (VRT) pitched an idea for a useful demo tool - a source checker that takes a text, chops into pieces, googles the result, and returns the sources that publish this text, as well as a warning if one of the sources is not reputable. I have started putting together a prototype for this:

  • Using Pattern's n-gram module, I break the text into n-grams

  • I discard n-grams that would not be useful for googling, such as n-grams that consist primarily of named entities (ex. 'Rand', 'Paul', 'of', 'Kentucky', 'Ted', 'Cruz', 'of', 'Texas', 'Mike', 'Lee') or of stop-words (ex. 'to', 'being', 'able', 'to', 'boast', 'about', 'the', 'adoption', 'of', 'a')

  • I pick a random subset of the remaining n-grams and run them through Pattern's Google API

  • I use Pattern's Intertextuality module to choose only those results that match the text

  • These results can then be matched against our file of domains, and we can return to the user information about the sources that publish the text, potentially along with some sort of graph visualization

For evaluation, I am using a random subset of the crawled news articles (see above) - I break each article into snippets of various lengths, run each snippet through the tool, and check whether the domain from which the article was crawled matches one of the domains returned by the tool.

Unfortunately, this work got stalled because of Google's API query limit, so the parameters have not yet been tested and tuned. We are currently looking into using a peer-to-peer search engine like Faroo and YACY, as well as into getting budget to continue work on the Google functionality.

Overall, I believe our project is off to a great start, and I am excited to see what we achieve in July and August.

Detection of image manipulation: Phase I

During the initial phase, I have been doing a lot of work involving images of both edited images and website appearance, to see if we can statistically model what a "fake news" site might look like. To find edited images, I have been using a method called Error Level Analysis, which can detect different levels of compression in JPEG images. The technique has been very effective so far in finding edited images using a training set from reddit's r/photoshopbattles, although it has taken quite some time to collect a training set from this source.

Using ELA, I have trained a random forest classifier to quite accurately detect edited / non edited images, which will be an input to a meta classifier that we will develop later this summer.

On the website analysis part, I have been using Masha's excellent sources list + PhantomJS to take screen shots of credible news as well as historically incredulous news sites. Again, the training set has been the biggest hurdle to overcome, but progress is good. While these two features may not be indicative of a fake news / real news article, they have seemed to be very good indicators of fake / real news in preliminary analyses. As we train our metaclassifier in the coming months, I see us weighing the NLP features much more than the images, but using image-based features as a way to verify or confirm our beliefs when we are on the fence about how to automatically classify images.

I would like to see myself doing some more NLP work in addition with the image processing stuff, as that (as soon as we finish and clean up our training data) will be done in a few days and be ready for implementation alongside a metaclassifier. I am eager to reconvene with the team to see where I can help with more textual based analysis, and I am so excited to see what we can accomplish in the coming months!

Pages