GSOC Reports

Text-based fake news detection: Phase I

During the first month of our Google Summer of Code, I have been working along 3 distinct avenues:

1. Compiling news domains

Coming into the project, we had several lists of questionable domains:

  • The OpenSources list that I worked with previously (the BS-detector Chrome extension is based on this list)

  • Guy posted a list from Politifact

  • We were also looking at using MediaBiasFactCheck.com since they seem to have a very comprehensive list, with categorization that may align with our needs (for ex. least-biased vs right-biased vs left-biased), as well as some information about each source.

I wanted to aggregate all of this information/categorization in one place, so I put together a CSV of all domains from the three sources above (~2k domains), along with the categories assigned by each, any additional comments, etc. It's been interesting to look at the overlap as well as at the discrepancies among these. This file will probably have several applications throughout the course of the summer and will be made available to the general public.

2. Crawling news domains

Later this summer we may end up building one or more text classifier that would classify a news article based on its content (rather than the source where it was published). For example, we may build a classifier for distinguishing sensationalist vs. objective news style, a classifier for detecting right vs. left bias, etc. The first step for any of these endeavors, of course, is to collect data.

I have started to crawl the domains from the compiled file mentioned above. My approach is to tread carefully and thoughtfully in order to ensure "clean", cohesive datasets, rather than to try to automatically crawl all domains and gather as much data as possible. I hand-pick each domain to be crawled, based on information from MBFC, Open Sources, and Politifact, as well as my own judgement - only picking those domains that clearly exhibit characteristics of a potential category (ex. sensationalist, objective, pseudoscience etc.)

I am still in the process of checking the domains and adding them to the crawler. As of today (6/24), I am crawling over 100 domains, accumulating more than 1k articles daily.

3. Source Checker

GSOC mentor Amra Dorjbayar (VRT) pitched an idea for a useful demo tool - a source checker that takes a text, chops into pieces, googles the result, and returns the sources that publish this text, as well as a warning if one of the sources is not reputable. I have started putting together a prototype for this:

  • Using Pattern's n-gram module, I break the text into n-grams

  • I discard n-grams that would not be useful for googling, such as n-grams that consist primarily of named entities (ex. 'Rand', 'Paul', 'of', 'Kentucky', 'Ted', 'Cruz', 'of', 'Texas', 'Mike', 'Lee') or of stop-words (ex. 'to', 'being', 'able', 'to', 'boast', 'about', 'the', 'adoption', 'of', 'a')

  • I pick a random subset of the remaining n-grams and run them through Pattern's Google API

  • I use Pattern's Intertextuality module to choose only those results that match the text

  • These results can then be matched against our file of domains, and we can return to the user information about the sources that publish the text, potentially along with some sort of graph visualization

For evaluation, I am using a random subset of the crawled news articles (see above) - I break each article into snippets of various lengths, run each snippet through the tool, and check whether the domain from which the article was crawled matches one of the domains returned by the tool.

Unfortunately, this work got stalled because of Google's API query limit, so the parameters have not yet been tested and tuned. We are currently looking into using a peer-to-peer search engine like Faroo and YACY, as well as into getting budget to continue work on the Google functionality.

Overall, I believe our project is off to a great start, and I am excited to see what we achieve in July and August.

Detection of image manipulation: Phase I

During the initial phase, I have been doing a lot of work involving images of both edited images and website appearance, to see if we can statistically model what a "fake news" site might look like. To find edited images, I have been using a method called Error Level Analysis, which can detect different levels of compression in JPEG images. The technique has been very effective so far in finding edited images using a training set from reddit's r/photoshopbattles, although it has taken quite some time to collect a training set from this source.

Using ELA, I have trained a random forest classifier to quite accurately detect edited / non edited images, which will be an input to a meta classifier that we will develop later this summer.

On the website analysis part, I have been using Masha's excellent sources list + PhantomJS to take screen shots of credible news as well as historically incredulous news sites. Again, the training set has been the biggest hurdle to overcome, but progress is good. While these two features may not be indicative of a fake news / real news article, they have seemed to be very good indicators of fake / real news in preliminary analyses. As we train our metaclassifier in the coming months, I see us weighing the NLP features much more than the images, but using image-based features as a way to verify or confirm our beliefs when we are on the fence about how to automatically classify images.

I would like to see myself doing some more NLP work in addition with the image processing stuff, as that (as soon as we finish and clean up our training data) will be done in a few days and be ready for implementation alongside a metaclassifier. I am eager to reconvene with the team to see where I can help with more textual based analysis, and I am so excited to see what we can accomplish in the coming months!

Porting Pattern to Python 3: Phase I

  • Posted on: 27 June 2017
  • By: markus

The first weeks of GSoC are coming to an end, so let's take some time to reflect on the overall progress during the first phase of the coding period.

The following decisions have been taken in phase one:

  • We will aim for a single code base for Pattern that supports both Python 2.7 and Python 3.5+. Notably, this involves dropping support for Python 2.6 and less.
  • We will aim to write forward–compatible code, i.e. code that handles Python 3 as the default and Python 2 as the exception. This requires some effort but will hopefully make the code more readable in the long term and makes it easy to drop Python 2 support entirely at some point.
  • Wherever possible, we will avoid using the six module since it tends to obfuscate the source code. We will however make use of the future package wherever suitable.
  • I will not touch the master branch on the clips/pattern GitHub repository, but decided to commit changes working towards a stable Python 2.7 version to the development branch. Apart from this, I am mostly working on the python3 branch which will incrementally build towards the Python 2/3 code base. Two minor branches debundle and wordnet have been created to rip out vendorized libraries and help with moving away from pywordnet.

The following will list the steps taken in the last weeks in roughly chronological order:

May, 30 – June, 11

  • I set up Travis CI, a continuous integration platform that helps us keeping track of which unit tests pass or fail for different Python versions. Every time that changes are commited to one of the branches, Travis CI will run the unit tests, show a build matrix on the project's status page and list the log of all unit tests.

  • In the current version, Pattern comes bundled with many libraries that are directly integrated into the Pattern code base, especially in the pattern.web module. However, this should be discouraged since it requires keeping up with the development of each library individually and merging upstream changes back to Pattern, which is quite laborious. Since we nowadays have decent setup procedures available that can deal with resolving dependencies, these modules should be entirely removed from the code base and added as external dependencies to setup.py. Specifically, the following bundled libraries were removed from the code base and now merely remain external dependencies: feedparser, BeautifulSoup, pdfminer, simplejson, docx and cherrypy.

  • There used to be a <> operator in Python 2 which is no longer available in Python 3. I replaced all occurrences with the equivalent != (i.e. not equal) operator.

  • In Python 3, only absolute imports and explicit relative imports are supported. I adapted a good part of the import statements in various modules.

  • There were some changes to the way numerals are handled by the interpreter. Numbers with leading zeros, e.g. 01 are unsupported in Python 3, as well as explicit long integer declarations such as 1L. I adapted the code base accordingly.

  • Python 3 removes one of the two ways in Python 2 to catch exceptions, except SomeException, e: in favor of the universal except SomeException as e. Similarly, when raising exceptions, raise SomeException, "Something is wrong!" is deprecated in favor of raise SomeException("Something is wrong!"). I adapted the code base accordingly.

  • Some of the packages or functions in the standard library have been renamed or refactored in some other way, e.g. urllib and htmlentitydefs. In general, Python 3 provides a more consistent naming scheme. I adapted the source code to deal with this, either using try: ... except: around import statements, or making use of the future library.

  • Furthermore, Python 3 turns functions like range(), zip(), map() into generators by default. reduce() must be separately imported from functools. This required some code refactoring since generators can neither be indexed nor sliced.

  • I did a bit of community work on GitHub, closing resolved or ancient issues or pull requests and opening some issues to address more recent developments. I plan to expand on this during the next two periods.

June, 12 – June, 25

  • The sorted() function no longer accepts custom comparison functions with the cmp keyword in Python 3. Instead, one must move over to using key functions. There is a helper function cmp_to_key available in functools which can deal with this quite easily.

  • In pattern.graph, Node objects could not be added to sets because they became unhashable in Python 3 due to the fact that the __eq__() method was overwritten. The solution was to simply specify __hash__ = object.__hash__ in the class definition to explicitly use the default hashing procedure.

  • In Python 3, the __getslice__() method for slicing is deprecated. Instead, everything is deferred to __getitem__(). I had to do some code refactoring to account for this, mostly for the Datasheet object in db/__init__.py.

  • I refactored the unit test test_db.py to do the initialization work (mostly setup of MySQL/MariaDB database handle) in a slightly different way. This is because when running the test with nose or pytest, sometimes the initialization failed, resulting in failing unit tests due to a closed database handle or similar problems.

  • I noticed that the MySQLdb package is not available on Python 3, so some of the tests in test_db.py were not actually discovered until after the refactoring. However, there is a package called mysqlclient which can substitute MySQLdb and supports both Python 2 and Python 3.

  • I added an option for pytest to report code coverage information.

  • I made numpy and scipy dependencies in setup.py.

  • Since pywordnet is deprecated (since 2001!) and integrated into nltk, I refactored the code in pattern/text/en/wordnet/__init__.py to support the new interface, which has changed extensively. This is still work in progress as of today...

  • I decided to move over to pytest instead of nose for unit testing since it has become the de–facto standard over the last years and nose has been deprecated for some time now. This does not require any refactoring right now because pytest is able to discover and run the classical unittest test cases. However, at some point it might be desirable to port all of the unit tests to pytest, but this is not exactly of high priority right now.

Next Steps

  • As of right now, the following modules have been ported to Python 3: pattern.metrics, pattern.graph, pattern.db, pattern.vector.
  • In the upcoming weeks, I will work towards porting the two juicy modules, pattern.text and pattern.web which both require a lot of unicode handling.

Ready, Set, Go!

  • Posted on: 4 June 2017
  • By: Guy

Coding has officially begun in our Google Summer of Code! This year, we have two tasks ahead:

(1) porting Pattern to Python3 and
(2) develop tools that will help detect fake news

No small feat, but we are blessed with 3 great candidates to help us out: Markus Beuckelmann is in charge of Pattern's migration to Python 3 and is already coding and committing. Masha Ivenskaya is ready to expand on her previous work on fake news detection through text analytics and is now looking at best practices for evaluating the task ahead. Bhairav Mehta will be finding ways to detect suspicious patterns in images, but will also apply his thorough knowledge of deep neural nets to the text analytics side of things.

Add to this a diverse group of mentors to help us along the way and I'm sure we'll make GSOC 2017 one to remember.

Stay tuned!

Pages