Porting Pattern to Python 3: Phase I
The first weeks of GSoC are coming to an end, so let's take some time to reflect on the overall progress during the first phase of the coding period.
The following decisions have been taken in phase one:
- We will aim for a single code base for Pattern that supports both Python 2.7 and Python 3.5+. Notably, this involves dropping support for Python 2.6 and less.
- We will aim to write forward–compatible code, i.e. code that handles Python 3 as the default and Python 2 as the exception. This requires some effort but will hopefully make the code more readable in the long term and makes it easy to drop Python 2 support entirely at some point.
- Wherever possible, we will avoid using the
six
module since it tends to obfuscate the source code. We will however make use of thefuture
package wherever suitable. - I will not touch the
master
branch on theclips/pattern
GitHub repository, but decided to commit changes working towards a stable Python 2.7 version to thedevelopment
branch. Apart from this, I am mostly working on thepython3
branch which will incrementally build towards the Python 2/3 code base. Two minor branchesdebundle
andwordnet
have been created to rip out vendorized libraries and help with moving away frompywordnet
.
The following will list the steps taken in the last weeks in roughly chronological order:
May, 30 – June, 11
I set up Travis CI, a continuous integration platform that helps us keeping track of which unit tests pass or fail for different Python versions. Every time that changes are commited to one of the branches, Travis CI will run the unit tests, show a build matrix on the project's status page and list the log of all unit tests.
In the current version, Pattern comes bundled with many libraries that are directly integrated into the Pattern code base, especially in the
pattern.web
module. However, this should be discouraged since it requires keeping up with the development of each library individually and merging upstream changes back to Pattern, which is quite laborious. Since we nowadays have decent setup procedures available that can deal with resolving dependencies, these modules should be entirely removed from the code base and added as external dependencies tosetup.py
. Specifically, the following bundled libraries were removed from the code base and now merely remain external dependencies:feedparser
,BeautifulSoup
,pdfminer
,simplejson
,docx
andcherrypy
.There used to be a
<>
operator in Python 2 which is no longer available in Python 3. I replaced all occurrences with the equivalent!=
(i.e. not equal) operator.In Python 3, only absolute imports and explicit relative imports are supported. I adapted a good part of the
import
statements in various modules.There were some changes to the way numerals are handled by the interpreter. Numbers with leading zeros, e.g.
01
are unsupported in Python 3, as well as explicit long integer declarations such as1L
. I adapted the code base accordingly.Python 3 removes one of the two ways in Python 2 to catch exceptions,
except SomeException, e:
in favor of the universalexcept SomeException as e
. Similarly, when raising exceptions,raise SomeException, "Something is wrong!"
is deprecated in favor ofraise SomeException("Something is wrong!")
. I adapted the code base accordingly.Some of the packages or functions in the standard library have been renamed or refactored in some other way, e.g.
urllib
andhtmlentitydefs
. In general, Python 3 provides a more consistent naming scheme. I adapted the source code to deal with this, either usingtry: ... except:
aroundimport
statements, or making use of thefuture
library.Furthermore, Python 3 turns functions like
range(), zip(), map()
into generators by default.reduce()
must be separately imported fromfunctools
. This required some code refactoring since generators can neither be indexed nor sliced.I did a bit of community work on GitHub, closing resolved or ancient issues or pull requests and opening some issues to address more recent developments. I plan to expand on this during the next two periods.
June, 12 – June, 25
The
sorted()
function no longer accepts custom comparison functions with thecmp
keyword in Python 3. Instead, one must move over to using key functions. There is a helper functioncmp_to_key
available infunctools
which can deal with this quite easily.In
pattern.graph
,Node
objects could not be added to sets because they became unhashable in Python 3 due to the fact that the__eq__()
method was overwritten. The solution was to simply specify__hash__ = object.__hash__
in the class definition to explicitly use the default hashing procedure.In Python 3, the
__getslice__()
method for slicing is deprecated. Instead, everything is deferred to__getitem__()
. I had to do some code refactoring to account for this, mostly for theDatasheet
object indb/__init__.py
.I refactored the unit test
test_db.py
to do the initialization work (mostly setup of MySQL/MariaDB database handle) in a slightly different way. This is because when running the test withnose
orpytest
, sometimes the initialization failed, resulting in failing unit tests due to a closed database handle or similar problems.I noticed that the
MySQLdb
package is not available on Python 3, so some of the tests intest_db.py
were not actually discovered until after the refactoring. However, there is a package calledmysqlclient
which can substituteMySQLdb
and supports both Python 2 and Python 3.I added an option for
pytest
to report code coverage information.I made
numpy
andscipy
dependencies insetup.py
.Since
pywordnet
is deprecated (since 2001!) and integrated intonltk
, I refactored the code inpattern/text/en/wordnet/__init__.py
to support the new interface, which has changed extensively. This is still work in progress as of today...I decided to move over to
pytest
instead ofnose
for unit testing since it has become the de–facto standard over the last years andnose
has been deprecated for some time now. This does not require any refactoring right now becausepytest
is able to discover and run the classicalunittest
test cases. However, at some point it might be desirable to port all of the unit tests topytest
, but this is not exactly of high priority right now.
Next Steps
- As of right now, the following modules have been ported to Python 3:
pattern.metrics
,pattern.graph
,pattern.db
,pattern.vector
. - In the upcoming weeks, I will work towards porting the two juicy modules,
pattern.text
andpattern.web
which both require a lot of unicode handling.