Phase 2 - Report: Максим Филин
During the second month of Google Summer of Code, I continued working with Pattern 3 Framework and focused on the following tasks:
1. VKontakte API
I’ve implemented several methods which allow users to get information from VK social network.
Possibilities: 1) retrieving a user's profile description and profile picture 2) retrieving user's posts from the profile wall 3) retrieving posts from the newsfeed for a search keyword.
I’ve taken into account the request limitations and added the rules in VK class.
We’ve considered the authorization method in the social network VKontakte by a direct link through the VKontakte API (based on the OAuth protocol), called Implicit flow. Authorization by this method is performed through the VKontakte application specified in the form of an ID. This is the most secure method of authorization.
Also the instruction about how to obtain access_token was prepared. It’s in class description now but later will be added to pattern documentation.
2. Testing the most important parts
Twitter, SVM, Datasheet and DOM (the HTML parser in pattern.web), adding new tests and fixing bugs.
The structure of the module for the Russian language was created and part of the functional was implemented. Also the necessary data for module was collected: - Named Entities List - Frequency Dictionary - Part of Speech Wordlist - Spelling List - The following functional was implemented: 1) Parser for part of speech tagging 2) Spellchecker
The new methods will be gradually added.
4. Collecting sentiment political Russian tweets and posts from Twitter and VKontakte.
There is a lot of data from Twitter and other social networks in English, German, Arabic and other languages but nothing in Russian. So, we want to collect Russian Twitter/VKontakte dataset, particulary Russian political debate tweets. Then we can compare the data that we already have and determine general political and social trends.
The main idea is to use Yandex Toloka for dataset creation. It is a crowdsourcing platform which helps developers and researches to perform various tasks including marking texts. The creation of a task on this platform is not a trivial and should be formulated as accurately as possible, but with the help of the assessors it is possible to get the result more quickly than by own forces.
5. Prepare Pattern 3 for release.
It was important that all tests running through Travis CI were executed without errors. Now all tests pass. In the following weeks we are going to make the release of Pattern Py3.