Porting Pattern to Python 3: Phase II
The second GSoC coding period is over and has brought substantial progress. As of today, all of the submodules with the exception of
pattern.server have been ported to Python 3.6. Pattern now shows consistent behavior for both Python 2.7 and Python 3.6 across all modules. All unit tests for
pattern.graph and the language modules
pattern.es pass, but there are still one or two failing test cases in
pattern.vector, as well as some skipped tests in
pattern.web due to changes in some web services' APIs.
Specifically, I have been working on the following issues in the second coding period:
June, 26 – July 26
I continued working on the removal of a bundled
pywordnetversion which has been deprecated since many years. A good part of the functionality is now integrated into NLTK, however, there have been many backward incompatible changes to the interface over the years, which required significant changes to
en/wordnet/__init__.py. I tried my best to hide all the changes in the backend from the Pattern user wherever possible, wrapping the new interface and maintaining the current Pattern
en.wordnetinterface. Since we now make use of NLTK's WordNet interface, this also makes the
nltkpackage a dependency from now on. The bundled
pywordnetversion is completely removed now.
Pattern comes with a bundled version of libsvm and liblinear which provide various fast, low–level routines for support vector machines (SVMs) and linear classification in
pattern.vector. Both bundled versions were quite old, so I replaced both libraries with the most recent release and made the necessary changes to make them work with the Pattern code base and support Python 3. The pre–compiled libraries have been removed for now because they were incompatible with the newer
liblinearversions. However, we might put some pre–compiled binaries for some platforms back in at some point.
Another major issue was some refactoring in
pattern.web, most importantly the removal of
sgmllibwhich is deprecated in Python 3. Fortunately, we are able to base
pattern.webupon the same class in
html.parserwith some small adjustments (da00ff).
In the first coding period, I removed the bundled version of BeautifulSoup from the code base and made it an external dependency. This period, I upgraded the code to make use of the most recent version BeautifulSoup 4 which also supports Python 3. As a result of this, some refactoring was done in
pattern.webto account for backward incompatible changes to the parser interface. Furthermore, we now explicitly make use of the fast
lxmlparser for HTML/XML and consequently, the
lxmlpackage is another dependency now.
I removed the custom JSON parser in
jsonmodule is part of the standard library now.
pattern.webcontains routines to deal with PDF documents through the
pdfminerlibrary. There have been some inconsistencies between Python 2.7 and Python 3.6 which resulted in weird exceptions being raised. Currently, the problem is solved by using the
pdfminerpackage for Python 2 and
pdfminer.sixfor Python 3, however, this should ideally be refactored and unified at some point.
There has been a long-standing bug with the single layer perceptron (SLP) (#182) that was haunting me and that I couldn't resolve for weeks. As a consequence of this bug, the majority of the unit tests for
pattern.enfailed. Last week, I ended up manually going through the commit history using essentially a binary search approach until I narrowed down the cause of the problem. Finally, all the problems are fixed as of 93235fe and the unit test landscape looks much cleaner now!
I also spend a lot of time making Python 2 and Python 3 behave consistently throughout all modules. This involved taking care of many of the subtle differences under the hood that I talked about in my first report. In order to avoid surprises for future developers who might not be aware of the differences between Python 2 and Python 3, I decided to put the following imports to the top of every non–trivial file to enforce consist behavior for the most important parts:
from __future__ import unicode_literals from __future__ import print_function from __future__ import absolute_import from __future__ import division from builtins import str, bytes, int from builtins import map, zip, filter from builtins import object, range
This should cover the most important differences and enforce Python 3–like division, imports, handling of literals and classes derived from
object. Hunting down bugs in either Python 2 or Python 3 is laborious and time-consuming when you are unaware of what is really happening and different interpreters yield different results. Consequently, there should be a "no surprises" philosophy when it comes the behavior of rudimentary data types such as
intor functions such as
map(), zip(), filter(), justifying the above explicit declarations even if not all of them are strictly necessary right now.
There were many encoding issues to be covered in various modules this period to make the code base work with both Python 2 and Python 3, predominantly in
pattern.web. All string literals are now unicode by default (
from __future__ import unicode literals), and functions expect unicode inputs if not stated otherwise. The
futuremakes Python 2 behave like a Python 3
str(which is always unicode).