markus's blog

Porting Pattern to Python 3: DONE!

  • Posted on: 24 September 2017
  • By: markus

The final days of this year's Google Summer of Code have arrived and I am wrapping up my project. The last three months have been full of intense coding on the Pattern library and I'm happy to say that all milestones described in my project proposal are knocked off within the official coding period.

An exhaustive list of all my commits to the clips/pattern repository can be found here. A very nice commit–based comparison is available here (full diff and full patch). The official commit graph can be seen here as soon as the changes have been merged into the master branch. The Travis CI build for different branches can be looked at here together with the automated unit test coverage reports on coveralls.io. The last official GSoC commit is ec95f97 on the development branch.

Overview & Synopsis

This is what the official project description reads:

The purpose of this GSoC project will be to modernize Pattern, a Python library for machine learning, natural language processing (NLP) and web mining. A substantial part of this undertaking is to port a majority of the code base to Python 3. This involves porting the individual modules and sub–modules piece by piece, where the whole process will be guided by unit tests. In the beginning, I will remove all tests from the pipeline that do not pass for Python 3 and take this pared–down code base as a starting point, porting parts of the code and putting the respective unit tests back in as I go along. Missing unit tests must be added before moving on. Since porting Python 2 code to Python 3 code is a standard problem for the Python community, there are many different tools available that can help in this regard. In addition to that, I'd like to extend this project to a bit of a Hausmeister project (housekeeping for Pattern), and optimize/modernize the code base in terms of execution speed, memory usage and documentation.

At the beginning of the project in May (launch time machine), Pattern was in a position where it wasn't actively maintained due to time constraints. Many unit tests were failing, some features were deprecated (e.g. in pattern.web) but most importantly, it lacked Python 3 support which effectively made it unavailable for a large user base. Now, three months later, we are at a point where all of Pattern's modules (i.e. pattern.text, pattern.vector, pattern.web, pattern.graph, pattern.db, pattern.metrics and the language modules pattern.en, pattern.de, pattern.nl. pattern.fr, pattern.es, pattern.it) except for pattern.server are fully ported to Python 3. This task included working on some other major milestones such as removing the bundled PyWordNet in favor of NLTK's WordNet interface, transitioning to BeautifulSoup 4, removing sgmllib etc.. However, the biggest challenge for a joint Python 2 / Python 3 code base is always to carefully deal with unicode handling in all parts of the library, which can sometimes be tedious. Whenever possible we attempted to write forward–compatible code, i.e. code that handles Python 3 as the default and Python 2 as the exception, which required some extra effort, but will hopefully make the code more readable in the long term and makes it easier to drop Python 2 support entirely at some point. The next release will deprecate Python 2.6- support in favor of Python 2.7 – which will be the last Python 2 version – and Python 3.6+.

Furthermore, several general maintenance tasks have been performed such as code cleanup, documentation, refactoring of duplicate code to an additional pattern.helpers as well as general PEP 8 compliance.

Roadmap & Milestones

By far the largest chunk of work was dealing with the subtle differences between Python 2 and Python 3 to ensure that the code works identically regardless of the interpreter. Moving to a joint code base is a major undertaking, since there are many differences when it comes to strings (unicode vs. byte strings), generators and iterators, package import precedence, division and even fundamental data types such as dict and set. It is more or less hopeless to obtain a joint code base for Python 2.5- and Python 3, but fortunately it is possible to make it work for Python 2.6+ with some precautions, even without using six.

However, the following points on the roadmap were important milestones that don't necessarily have anything to do with the actual porting per se:

  • The following bundled packages and (vendorized) libraries have been removed in favor of external dependencies: feedparser, BeautifulSoup, pdfminer, simplejson, docx, cherrypy, PyWordNet. This also involved adapting Pattern's code to changes introduced in these external libraries.
  • The removal of PyWordNet came with the need for a new interface for Pattern to wrap NLTK's WordNet interface. This was quite time–consuming, since there had of course been many incompatible changes over the years that needed to be dealt with.
  • We set up Travis CI, a continuous integration platform to keep track of passing or failing unit tests on different branches / Python versions. This will run automatically for every PR and report changes in unit test coverage.
  • libSVM and libLINEAR have been updated to the latest versions. The pre–compiled libraries have been removed for now because they were incompatible with the newer libsvm/liblinear versions.
  • The unit tests were refactored to work with pytest. There is more work that can be done and it might be a good idea to leave the unittest entirely behind at some point in the future.
  • In the last days of the official coding period we went through a big PEP 8 (Style Guide for Python Code) cleanup which aims for a more consistent code base. However, we decided not to aggressively enforce all PEP 8 guidelines.

The new release will introduce the following external dependencies: future, mysqlclient, beautifulsoup4, lxml, feedparser, pdfminer (or pdfminer.six for Python 3), numpy, scipy, nltk, python-docx, cherrypy. For a more in–depth discussion of each of these items, check out my detailed progress reports (phase #1, phase #2).

Statistics

Let's play the numbers game: Over the course of the last three months, I have pushed > 403 commits to four different branches on the clips/pattern repository. This affected 238 files with a total of 11129 insertions and 53735 deletions (git diff --stat). My first commit was 1e17011 on the python3 branch. My last commit was ec95f97 on the development branch.

Here is what the contribution graph and the heat map on GitHub looks like: GSoC: Commits

The following panel shows deletions (left) and insertions (right) as a function of time: GSoC: Insertions and Deletions

This graph seems to reflect the roadmap pretty accurately. The majority of the deletions in the first period correspond to the removal of vendorized libraries. As the project progressed, more and more insertions took place and new or modified lines found their way into the code base.

Future Work & Next Steps

We will do some more testing and release the next major version of Pattern in autumn. The following items are predominantly independent of my particular project, but should be tackled before the next major release:

  • The only Python 3 related issue currently remaining is a bug in pattern.vector that affects the information gain tree classifier IGTree. It's hard to debug but it looks to me like it has something to do with order differences when iterating over dict objects. In any case, someone needs to take a closer look. This issue will be tracked on GitHub in the near future.
  • The pattern.server module has the major parts ported, but since there are no unit tests available for this module, it's hard to test it in a systematic way apart from running the examples. I believe it's currently not fully functional on Python 3.
  • The pattern.web module contains code to access some popular web APIs. Some of the APIs are deprecated or changed in some other way that requires refactoring. Some APIs have moved to paid subscription models without free quotas.
  • The current unit test coverage seems to be around 70%. This is okay for now, but there certainly is room for improvement.

Resources

The following resources proved to be invaluable during the porting, especially when it comes to the more subtle differences between Python 2 to Python 3:

Acknowledgments

So this is it – the end of the official GSoC coding period, time to sign off for a couple days. Thank you to the Google Open Source Team for bringing this project to life, and of course special thanks to my mentors Tom and Guy for their valuable feedback and guidance throughout the entire project.

Altogether, this was a great experience and I will remain an active contributor for the foreseeable future. Happy coding!

Porting Pattern to Python 3: Phase II

  • Posted on: 3 August 2017
  • By: markus

The second GSoC coding period is over and has brought substantial progress. As of today, all of the submodules with the exception of pattern.server have been ported to Python 3.6. Pattern now shows consistent behavior for both Python 2.7 and Python 3.6 across all modules. All unit tests for pattern.db, pattern.metrics, pattern.graph and the language modules pattern.en, pattern.nl, pattern.de, pattern.fr, pattern.it, pattern.es pass, but there are still one or two failing test cases in pattern.text, and pattern.vector, as well as some skipped tests in pattern.web due to changes in some web services' APIs.

Specifically, I have been working on the following issues in the second coding period:

June, 26 – July 26

  • I continued working on the removal of a bundled pywordnet version which has been deprecated since many years. A good part of the functionality is now integrated into NLTK, however, there have been many backward incompatible changes to the interface over the years, which required significant changes to en/wordnet/__init__.py. I tried my best to hide all the changes in the backend from the Pattern user wherever possible, wrapping the new interface and maintaining the current Pattern en.wordnet interface. Since we now make use of NLTK's WordNet interface, this also makes the nltk package a dependency from now on. The bundled pywordnet version is completely removed now.

  • Pattern comes with a bundled version of libsvm and liblinear which provide various fast, low–level routines for support vector machines (SVMs) and linear classification in pattern.vector. Both bundled versions were quite old, so I replaced both libraries with the most recent release and made the necessary changes to make them work with the Pattern code base and support Python 3. The pre–compiled libraries have been removed for now because they were incompatible with the newer libsvm/liblinear versions. However, we might put some pre–compiled binaries for some platforms back in at some point.

  • Another major issue was some refactoring in pattern.web, most importantly the removal of sgmllib which is deprecated in Python 3. Fortunately, we are able to base HTMLParser in pattern.web upon the same class in html.parser with some small adjustments (da00ff).

  • In the first coding period, I removed the bundled version of BeautifulSoup from the code base and made it an external dependency. This period, I upgraded the code to make use of the most recent version BeautifulSoup 4 which also supports Python 3. As a result of this, some refactoring was done in pattern.web to account for backward incompatible changes to the parser interface. Furthermore, we now explicitly make use of the fast lxml parser for HTML/XML and consequently, the lxml package is another dependency now.

  • I removed the custom JSON parser in pattern.db since the json module is part of the standard library now.

  • pattern.web contains routines to deal with PDF documents through the pdfminer library. There have been some inconsistencies between Python 2.7 and Python 3.6 which resulted in weird exceptions being raised. Currently, the problem is solved by using the pdfminer package for Python 2 and pdfminer.six for Python 3, however, this should ideally be refactored and unified at some point.

  • There has been a long-standing bug with the single layer perceptron (SLP) (#182) that was haunting me and that I couldn't resolve for weeks. As a consequence of this bug, the majority of the unit tests for pattern.en failed. Last week, I ended up manually going through the commit history using essentially a binary search approach until I narrowed down the cause of the problem. Finally, all the problems are fixed as of 93235fe and the unit test landscape looks much cleaner now!

  • I also spend a lot of time making Python 2 and Python 3 behave consistently throughout all modules. This involved taking care of many of the subtle differences under the hood that I talked about in my first report. In order to avoid surprises for future developers who might not be aware of the differences between Python 2 and Python 3, I decided to put the following imports to the top of every non–trivial file to enforce consist behavior for the most important parts:

    from __future__ import unicode_literals
    from __future__ import print_function
    from __future__ import absolute_import
    from __future__ import division
    
    from builtins import str, bytes, int
    from builtins import map, zip, filter
    from builtins import object, range
    

    This should cover the most important differences and enforce Python 3–like division, imports, handling of literals and classes derived from object. Hunting down bugs in either Python 2 or Python 3 is laborious and time-consuming when you are unaware of what is really happening and different interpreters yield different results. Consequently, there should be a "no surprises" philosophy when it comes the behavior of rudimentary data types such as str, bytes, int or functions such as map(), zip(), filter(), justifying the above explicit declarations even if not all of them are strictly necessary right now.

  • There were many encoding issues to be covered in various modules this period to make the code base work with both Python 2 and Python 3, predominantly in pattern.text, pattern.en, pattern.vector and pattern.web. All string literals are now unicode by default (from __future__ import unicode literals), and functions expect unicode inputs if not stated otherwise. The str object from future makes Python 2 behave like a Python 3 str (which is always unicode).

Porting Pattern to Python 3: Phase I

  • Posted on: 27 June 2017
  • By: markus

The first weeks of GSoC are coming to an end, so let's take some time to reflect on the overall progress during the first phase of the coding period.

The following decisions have been taken in phase one:

  • We will aim for a single code base for Pattern that supports both Python 2.7 and Python 3.5+. Notably, this involves dropping support for Python 2.6 and less.
  • We will aim to write forward–compatible code, i.e. code that handles Python 3 as the default and Python 2 as the exception. This requires some effort but will hopefully make the code more readable in the long term and makes it easy to drop Python 2 support entirely at some point.
  • Wherever possible, we will avoid using the six module since it tends to obfuscate the source code. We will however make use of the future package wherever suitable.
  • I will not touch the master branch on the clips/pattern GitHub repository, but decided to commit changes working towards a stable Python 2.7 version to the development branch. Apart from this, I am mostly working on the python3 branch which will incrementally build towards the Python 2/3 code base. Two minor branches debundle and wordnet have been created to rip out vendorized libraries and help with moving away from pywordnet.

The following will list the steps taken in the last weeks in roughly chronological order:

May, 30 – June, 11

  • I set up Travis CI, a continuous integration platform that helps us keeping track of which unit tests pass or fail for different Python versions. Every time that changes are commited to one of the branches, Travis CI will run the unit tests, show a build matrix on the project's status page and list the log of all unit tests.

  • In the current version, Pattern comes bundled with many libraries that are directly integrated into the Pattern code base, especially in the pattern.web module. However, this should be discouraged since it requires keeping up with the development of each library individually and merging upstream changes back to Pattern, which is quite laborious. Since we nowadays have decent setup procedures available that can deal with resolving dependencies, these modules should be entirely removed from the code base and added as external dependencies to setup.py. Specifically, the following bundled libraries were removed from the code base and now merely remain external dependencies: feedparser, BeautifulSoup, pdfminer, simplejson, docx and cherrypy.

  • There used to be a <> operator in Python 2 which is no longer available in Python 3. I replaced all occurrences with the equivalent != (i.e. not equal) operator.

  • In Python 3, only absolute imports and explicit relative imports are supported. I adapted a good part of the import statements in various modules.

  • There were some changes to the way numerals are handled by the interpreter. Numbers with leading zeros, e.g. 01 are unsupported in Python 3, as well as explicit long integer declarations such as 1L. I adapted the code base accordingly.

  • Python 3 removes one of the two ways in Python 2 to catch exceptions, except SomeException, e: in favor of the universal except SomeException as e. Similarly, when raising exceptions, raise SomeException, "Something is wrong!" is deprecated in favor of raise SomeException("Something is wrong!"). I adapted the code base accordingly.

  • Some of the packages or functions in the standard library have been renamed or refactored in some other way, e.g. urllib and htmlentitydefs. In general, Python 3 provides a more consistent naming scheme. I adapted the source code to deal with this, either using try: ... except: around import statements, or making use of the future library.

  • Furthermore, Python 3 turns functions like range(), zip(), map() into generators by default. reduce() must be separately imported from functools. This required some code refactoring since generators can neither be indexed nor sliced.

  • I did a bit of community work on GitHub, closing resolved or ancient issues or pull requests and opening some issues to address more recent developments. I plan to expand on this during the next two periods.

June, 12 – June, 25

  • The sorted() function no longer accepts custom comparison functions with the cmp keyword in Python 3. Instead, one must move over to using key functions. There is a helper function cmp_to_key available in functools which can deal with this quite easily.

  • In pattern.graph, Node objects could not be added to sets because they became unhashable in Python 3 due to the fact that the __eq__() method was overwritten. The solution was to simply specify __hash__ = object.__hash__ in the class definition to explicitly use the default hashing procedure.

  • In Python 3, the __getslice__() method for slicing is deprecated. Instead, everything is deferred to __getitem__(). I had to do some code refactoring to account for this, mostly for the Datasheet object in db/__init__.py.

  • I refactored the unit test test_db.py to do the initialization work (mostly setup of MySQL/MariaDB database handle) in a slightly different way. This is because when running the test with nose or pytest, sometimes the initialization failed, resulting in failing unit tests due to a closed database handle or similar problems.

  • I noticed that the MySQLdb package is not available on Python 3, so some of the tests in test_db.py were not actually discovered until after the refactoring. However, there is a package called mysqlclient which can substitute MySQLdb and supports both Python 2 and Python 3.

  • I added an option for pytest to report code coverage information.

  • I made numpy and scipy dependencies in setup.py.

  • Since pywordnet is deprecated (since 2001!) and integrated into nltk, I refactored the code in pattern/text/en/wordnet/__init__.py to support the new interface, which has changed extensively. This is still work in progress as of today...

  • I decided to move over to pytest instead of nose for unit testing since it has become the de–facto standard over the last years and nose has been deprecated for some time now. This does not require any refactoring right now because pytest is able to discover and run the classical unittest test cases. However, at some point it might be desirable to port all of the unit tests to pytest, but this is not exactly of high priority right now.

Next Steps

  • As of right now, the following modules have been ported to Python 3: pattern.metrics, pattern.graph, pattern.db, pattern.vector.
  • In the upcoming weeks, I will work towards porting the two juicy modules, pattern.text and pattern.web which both require a lot of unicode handling.