Porting Pattern to Python 3: Phase II

  • Posted on: 3 August 2017
  • By: markus

The second GSoC coding period is over and has brought substantial progress. As of today, all of the submodules with the exception of pattern.server have been ported to Python 3.6. Pattern now shows consistent behavior for both Python 2.7 and Python 3.6 across all modules. All unit tests for pattern.db, pattern.metrics, pattern.graph and the language modules pattern.en, pattern.nl, pattern.de, pattern.fr, pattern.it, pattern.es pass, but there are still one or two failing test cases in pattern.text, and pattern.vector, as well as some skipped tests in pattern.web due to changes in some web services' APIs.

Specifically, I have been working on the following issues in the second coding period:

June, 26 – July 26

  • I continued working on the removal of a bundled pywordnet version which has been deprecated since many years. A good part of the functionality is now integrated into NLTK, however, there have been many backward incompatible changes to the interface over the years, which required significant changes to en/wordnet/__init__.py. I tried my best to hide all the changes in the backend from the Pattern user wherever possible, wrapping the new interface and maintaining the current Pattern en.wordnet interface. Since we now make use of NLTK's WordNet interface, this also makes the nltk package a dependency from now on. The bundled pywordnet version is completely removed now.

  • Pattern comes with a bundled version of libsvm and liblinear which provide various fast, low–level routines for support vector machines (SVMs) and linear classification in pattern.vector. Both bundled versions were quite old, so I replaced both libraries with the most recent release and made the necessary changes to make them work with the Pattern code base and support Python 3. The pre–compiled libraries have been removed for now because they were incompatible with the newer libsvm/liblinear versions. However, we might put some pre–compiled binaries for some platforms back in at some point.

  • Another major issue was some refactoring in pattern.web, most importantly the removal of sgmllib which is deprecated in Python 3. Fortunately, we are able to base HTMLParser in pattern.web upon the same class in html.parser with some small adjustments (da00ff).

  • In the first coding period, I removed the bundled version of BeautifulSoup from the code base and made it an external dependency. This period, I upgraded the code to make use of the most recent version BeautifulSoup 4 which also supports Python 3. As a result of this, some refactoring was done in pattern.web to account for backward incompatible changes to the parser interface. Furthermore, we now explicitly make use of the fast lxml parser for HTML/XML and consequently, the lxml package is another dependency now.

  • I removed the custom JSON parser in pattern.db since the json module is part of the standard library now.

  • pattern.web contains routines to deal with PDF documents through the pdfminer library. There have been some inconsistencies between Python 2.7 and Python 3.6 which resulted in weird exceptions being raised. Currently, the problem is solved by using the pdfminer package for Python 2 and pdfminer.six for Python 3, however, this should ideally be refactored and unified at some point.

  • There has been a long-standing bug with the single layer perceptron (SLP) (#182) that was haunting me and that I couldn't resolve for weeks. As a consequence of this bug, the majority of the unit tests for pattern.en failed. Last week, I ended up manually going through the commit history using essentially a binary search approach until I narrowed down the cause of the problem. Finally, all the problems are fixed as of 93235fe and the unit test landscape looks much cleaner now!

  • I also spend a lot of time making Python 2 and Python 3 behave consistently throughout all modules. This involved taking care of many of the subtle differences under the hood that I talked about in my first report. In order to avoid surprises for future developers who might not be aware of the differences between Python 2 and Python 3, I decided to put the following imports to the top of every non–trivial file to enforce consist behavior for the most important parts:

    from __future__ import unicode_literals
    from __future__ import print_function
    from __future__ import absolute_import
    from __future__ import division
    from builtins import str, bytes, int
    from builtins import map, zip, filter
    from builtins import object, range

    This should cover the most important differences and enforce Python 3–like division, imports, handling of literals and classes derived from object. Hunting down bugs in either Python 2 or Python 3 is laborious and time-consuming when you are unaware of what is really happening and different interpreters yield different results. Consequently, there should be a "no surprises" philosophy when it comes the behavior of rudimentary data types such as str, bytes, int or functions such as map(), zip(), filter(), justifying the above explicit declarations even if not all of them are strictly necessary right now.

  • There were many encoding issues to be covered in various modules this period to make the code base work with both Python 2 and Python 3, predominantly in pattern.text, pattern.en, pattern.vector and pattern.web. All string literals are now unicode by default (from __future__ import unicode literals), and functions expect unicode inputs if not stated otherwise. The str object from future makes Python 2 behave like a Python 3 str (which is always unicode).