Porting Pattern to Python 3: Phase I

  • Posted on: 27 June 2017
  • By: markus

The first weeks of GSoC are coming to an end, so let's take some time to reflect on the overall progress during the first phase of the coding period.

The following decisions have been taken in phase one:

  • We will aim for a single code base for Pattern that supports both Python 2.7 and Python 3.5+. Notably, this involves dropping support for Python 2.6 and less.
  • We will aim to write forward–compatible code, i.e. code that handles Python 3 as the default and Python 2 as the exception. This requires some effort but will hopefully make the code more readable in the long term and makes it easy to drop Python 2 support entirely at some point.
  • Wherever possible, we will avoid using the six module since it tends to obfuscate the source code. We will however make use of the future package wherever suitable.
  • I will not touch the master branch on the clips/pattern GitHub repository, but decided to commit changes working towards a stable Python 2.7 version to the development branch. Apart from this, I am mostly working on the python3 branch which will incrementally build towards the Python 2/3 code base. Two minor branches debundle and wordnet have been created to rip out vendorized libraries and help with moving away from pywordnet.

The following will list the steps taken in the last weeks in roughly chronological order:

May, 30 – June, 11

  • I set up Travis CI, a continuous integration platform that helps us keeping track of which unit tests pass or fail for different Python versions. Every time that changes are commited to one of the branches, Travis CI will run the unit tests, show a build matrix on the project's status page and list the log of all unit tests.

  • In the current version, Pattern comes bundled with many libraries that are directly integrated into the Pattern code base, especially in the pattern.web module. However, this should be discouraged since it requires keeping up with the development of each library individually and merging upstream changes back to Pattern, which is quite laborious. Since we nowadays have decent setup procedures available that can deal with resolving dependencies, these modules should be entirely removed from the code base and added as external dependencies to setup.py. Specifically, the following bundled libraries were removed from the code base and now merely remain external dependencies: feedparser, BeautifulSoup, pdfminer, simplejson, docx and cherrypy.

  • There used to be a <> operator in Python 2 which is no longer available in Python 3. I replaced all occurrences with the equivalent != (i.e. not equal) operator.

  • In Python 3, only absolute imports and explicit relative imports are supported. I adapted a good part of the import statements in various modules.

  • There were some changes to the way numerals are handled by the interpreter. Numbers with leading zeros, e.g. 01 are unsupported in Python 3, as well as explicit long integer declarations such as 1L. I adapted the code base accordingly.

  • Python 3 removes one of the two ways in Python 2 to catch exceptions, except SomeException, e: in favor of the universal except SomeException as e. Similarly, when raising exceptions, raise SomeException, "Something is wrong!" is deprecated in favor of raise SomeException("Something is wrong!"). I adapted the code base accordingly.

  • Some of the packages or functions in the standard library have been renamed or refactored in some other way, e.g. urllib and htmlentitydefs. In general, Python 3 provides a more consistent naming scheme. I adapted the source code to deal with this, either using try: ... except: around import statements, or making use of the future library.

  • Furthermore, Python 3 turns functions like range(), zip(), map() into generators by default. reduce() must be separately imported from functools. This required some code refactoring since generators can neither be indexed nor sliced.

  • I did a bit of community work on GitHub, closing resolved or ancient issues or pull requests and opening some issues to address more recent developments. I plan to expand on this during the next two periods.

June, 12 – June, 25

  • The sorted() function no longer accepts custom comparison functions with the cmp keyword in Python 3. Instead, one must move over to using key functions. There is a helper function cmp_to_key available in functools which can deal with this quite easily.

  • In pattern.graph, Node objects could not be added to sets because they became unhashable in Python 3 due to the fact that the __eq__() method was overwritten. The solution was to simply specify __hash__ = object.__hash__ in the class definition to explicitly use the default hashing procedure.

  • In Python 3, the __getslice__() method for slicing is deprecated. Instead, everything is deferred to __getitem__(). I had to do some code refactoring to account for this, mostly for the Datasheet object in db/__init__.py.

  • I refactored the unit test test_db.py to do the initialization work (mostly setup of MySQL/MariaDB database handle) in a slightly different way. This is because when running the test with nose or pytest, sometimes the initialization failed, resulting in failing unit tests due to a closed database handle or similar problems.

  • I noticed that the MySQLdb package is not available on Python 3, so some of the tests in test_db.py were not actually discovered until after the refactoring. However, there is a package called mysqlclient which can substitute MySQLdb and supports both Python 2 and Python 3.

  • I added an option for pytest to report code coverage information.

  • I made numpy and scipy dependencies in setup.py.

  • Since pywordnet is deprecated (since 2001!) and integrated into nltk, I refactored the code in pattern/text/en/wordnet/__init__.py to support the new interface, which has changed extensively. This is still work in progress as of today...

  • I decided to move over to pytest instead of nose for unit testing since it has become the de–facto standard over the last years and nose has been deprecated for some time now. This does not require any refactoring right now because pytest is able to discover and run the classical unittest test cases. However, at some point it might be desirable to port all of the unit tests to pytest, but this is not exactly of high priority right now.

Next Steps

  • As of right now, the following modules have been ported to Python 3: pattern.metrics, pattern.graph, pattern.db, pattern.vector.
  • In the upcoming weeks, I will work towards porting the two juicy modules, pattern.text and pattern.web which both require a lot of unicode handling.