Максим Филин's blog

Final Report: Pattern 3

Project overview

During the Google Summer of Code event I was focused on the development of the Pattern 3 framework. Pattern consists of many modules which help users to work with web data, use machine learning algorithms, apply natural language processing technics and many other useful functions. The main task was to complete porting Pattern to Python 3 and refactor code. The problem of fixing bugs was also important and as a result all tests in the automatic testing system Travis CI are executed successfully. The new functions and modules have also been added. The pattern.ru module allows users to work with Russian texts. The pattern.web module was improved by adding the VKontakte API class.

Main completed tasks

  • Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box.
  • Refactoring social media Twitter API.
  • Testing all modules, fixing bugs and Travis CI tests.
  • Creating VKontakte API class which allows users to get information from the biggest Russian social network. With this you can retrieve user's profile description and profile picture, user's posts from the profile wall and posts from the newsfeed for a search keyword.
  • Creating pattern.ru module and collecting the necessary data: Named Entities List, Frequency Dictionaries, Part of Speech Wordlist, Spelling List. The parser for part of speech tagging and spellchecker are now available for Russian language.
  • Pattern Python 3 Release

Future Work

There are many opportunities to continue improving the Pattern framework and introduce new functionality. For example, the web mining module can be extended with some other features helping users to analyze the collected data from social media. Also it is important to add the sentiment analysis to pattern.ru part.

While I was working on Pattern Python 3 release I was collecting the political tweets from Twitter and posts from VKontakte to make big dataset which can help to analyze political debate tweets and political discussions. When the collection process is completed, the data set will be available to researchers.

Phase 2 - Report: Максим Филин

During the second month of Google Summer of Code, I continued working with Pattern 3 Framework and focused on the following tasks:

1. VKontakte API

I’ve implemented several methods which allow users to get information from VK social network.

Possibilities: 1) retrieving a user's profile description and profile picture 2) retrieving user's posts from the profile wall 3) retrieving posts from the newsfeed for a search keyword.

I’ve taken into account the request limitations and added the rules in VK class.

We’ve considered the authorization method in the social network VKontakte by a direct link through the VKontakte API (based on the OAuth protocol), called Implicit flow. Authorization by this method is performed through the VKontakte application specified in the form of an ID. This is the most secure method of authorization.

Also the instruction about how to obtain access_token was prepared. It’s in class description now but later will be added to pattern documentation.

2. Testing the most important parts

Twitter, SVM, Datasheet and DOM (the HTML parser in pattern.web), adding new tests and fixing bugs.

3. pattern.ru

The structure of the module for the Russian language was created and part of the functional was implemented. Also the necessary data for module was collected: - Named Entities List - Frequency Dictionary - Part of Speech Wordlist - Spelling List - The following functional was implemented: 1) Parser for part of speech tagging 2) Spellchecker

The new methods will be gradually added.

4. Collecting sentiment political Russian tweets and posts from Twitter and VKontakte.

There is a lot of data from Twitter and other social networks in English, German, Arabic and other languages but nothing in Russian. So, we want to collect Russian Twitter/VKontakte dataset, particulary Russian political debate tweets. Then we can compare the data that we already have and determine general political and social trends.

The main idea is to use Yandex Toloka for dataset creation. It is a crowdsourcing platform which helps developers and researches to perform various tasks including marking texts. The creation of a task on this platform is not a trivial and should be formulated as accurately as possible, but with the help of the assessors it is possible to get the result more quickly than by own forces.

5. Prepare Pattern 3 for release.

It was important that all tests running through Travis CI were executed without errors. Now all tests pass. In the following weeks we are going to make the release of Pattern Py3.

Phase 1 - Report: Максим Филин

During the first month of Google Summer of Code, I have been working with the following tasks:

  1. Refactoring social media Twitter API.

  2. Trying all framework examples and fixing bugs and errors.

The project in master branch now supports only python 2, but we want to add the ability of using python 3. In this regard, it is necessary in some places to refactor the code. So, most of errors have been fixed.

3. Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box. SVM examples now work on all platforms.

4. Looking at Travis CI failing unit tests.

There were some errors: in TestParser.test_find_keywords and TestClassifier.test_igtree. The problem was in wrong interpretation of test examples and they are now rewritten. But there are still sometimes errors in Search Engine because of APIs licences and permissions.

5. Working on VK API.

We think that it would be great to add the api to VKontakte, the biggest Russian social network. It is used in many countries and supports over than 50 languages. It also has very multifunctional api, so it can be added in pattern. The structure for working with api have already been created.

In the following part of Google Summer of Code we will focus on Python 3 Pattern release and testing. Also we will try to implement pattern.ru module.