The progress reports for Phase 2 of GSOC 2018 are in! Alexander Rossa, Maja Gwozdz, Rudresh Panchal and Максим Филин have documented their work and are ready for the final phase!
GSOC 2018 Reports
During the second month of Google Summer of Code, I continued working with Pattern 3 Framework and focused on the following tasks:
1. VKontakte API
I’ve implemented several methods which allow users to get information from VK social network.
Possibilities: 1) retrieving a user's profile description and profile picture 2) retrieving user's posts from the profile wall 3) retrieving posts from the newsfeed for a search keyword.
I’ve taken into account the request limitations and added the rules in VK class.
We’ve considered the authorization method in the social network VKontakte by a direct link through the VKontakte API (based on the OAuth protocol), called Implicit flow. Authorization by this method is performed through the VKontakte application specified in the form of an ID. This is the most secure method of authorization.
Also the instruction about how to obtain access_token was prepared. It’s in class description now but later will be added to pattern documentation.
2. Testing the most important parts
Twitter, SVM, Datasheet and DOM (the HTML parser in pattern.web), adding new tests and fixing bugs.
The structure of the module for the Russian language was created and part of the functional was implemented. Also the necessary data for module was collected: - Named Entities List - Frequency Dictionary - Part of Speech Wordlist - Spelling List - The following functional was implemented: 1) Parser for part of speech tagging 2) Spellchecker
The new methods will be gradually added.
4. Collecting sentiment political Russian tweets and posts from Twitter and VKontakte.
There is a lot of data from Twitter and other social networks in English, German, Arabic and other languages but nothing in Russian. So, we want to collect Russian Twitter/VKontakte dataset, particulary Russian political debate tweets. Then we can compare the data that we already have and determine general political and social trends.
The main idea is to use Yandex Toloka for dataset creation. It is a crowdsourcing platform which helps developers and researches to perform various tasks including marking texts. The creation of a task on this platform is not a trivial and should be formulated as accurately as possible, but with the help of the assessors it is possible to get the result more quickly than by own forces.
5. Prepare Pattern 3 for release.
It was important that all tests running through Travis CI were executed without errors. Now all tests pass. In the following weeks we are going to make the release of Pattern Py3.
During my second GSoC term I was focusing on finishing various parts of the Twitter Deescalation bot and on extending the Seed module.
Created dataset of several thousands of Tweets for both topic prediction (keyword labeled and checked for correctness) and anger/participation classification (manually labeled)
Improved and tested neural network models used on said dataset
Did some pretotyping work with the bot - participating in online discussions and sort of impersonating an ideal version of the bot to see what the bot will have to deal with "in the real world" - the logs are from "bot's perspective" and closely follow the actual execution of the bot
Transformed Seed into an NPM module
Wrote up some documentation for using Seed as an NPM module
Almost finished implementing the conditional generation, still need to do a bit of work on connecting all the outputs and do some testing for correctness of the solution
The next focus for this project will be:
Finishing the conditional generation for Seed
Reworking the collected dataset a bit (turns out that there were too many classes for too little data which plateaued the test set accuracy on about 60% even with heavy regularization) - I collected more data for smaller number of classes and am hand labeling it right now
Testing and improving the bot in the real world
Retrospectively rewriting the original Seed repository with using Seed as a Node module instead and adding the ability to easily create Twitter bots from the Seed website https://seed.emrg.be/
This post reflects upon some of the milestones achieved in GSoC 2018's Phase two.
The Phase 2 mainly concentrated on on expanding the rare entity detection pipeline, adding the generalization features and increasing accessibility to the system being built. The following features were successfully implemented:
Built a custom TF-IDF system to help recognize rare entities. The TF-IDF system saved the intermediate token counts, so that whenever a new document/text snippet is added to the knowledgebase, the TF-IDF scores for all the tokens do not have to be recalculated. The stored counts are loaded, incremented and the relevant scores calculated.
Implemented the "Part Holonym" based generalization feature. This feature relies on lexical databases like Wordnet to extract part holonyms. This generalizes tokens to their lexical supersets. For example: London gets generalized to England, Beijing to China at level one generalization and to Europe and Asia Respectively for level two generalization. The user is given the option of choosing the generalization level for each attribute.
Implemented the "Word Vector" based generalization feature. This maps the nearest vector space neighbour of a token in pretrained embeddings like GLoVE and replaces it with the same. For example: India gets replaced with Pakistan.
Implemented a general anonymization RESTful API. This gives people the option to utilize our system across different tech stacks.
Implemented a Token level RESTful API. This API endpoint gives token level information of various things including, the original token, replaced token, entity type and the anonymization type.
The API utilizes Django's token based authentication system. Implemented a dashboard to manage the authentication tokens for the users.
Some of the major things planned for the 3rd and final phase are:
Code cleanup: As the project progressed, some of the code has become redundant which needs to be removed.
Documentation: While the code is well commented and easy to understand, the project currently lacks thorough external documentation. A quick usage guide for non-programmer end users also could be helpful.
A simple scaffolding system for the user. The system currently ships without any predefined configurations (including entities, aliases etc). Script(s) which can quickly setup a ready to use system with certain default values (including pre-defined attribute actions, threshold values etc) would be useful.
GUI based and API based file upload system. The user currently has to currently paste plaintext in the GUI and set it as a parameter in the API. The option to directly upload text files will increase user convenience.
Experiment with language localization. The system currently works well with the English language, but it needs to be tried out with other languages.
Picture 1: The Token level API in action
In the second phase of GSoC, I continued annotating political tweets and corrected some typos in the dataset. I created a more varied corpus by collecting tweets related to American, British, Canadian, and Australian socio-political affairs (I am also collecting New Zealand tweets but they are really rare). As regards the annotation guidelines, I improved the document stylistically and added relevant examples to each section. I also created a short appendix containing the most important politicians and their party affiliations, so as to facilitate future annotations.
As for the dataset itself, I am happy to announce that there were far more idioms and proverbs than in the previous stage. The following list presents the top ten most frequent hashtags extracted from the tweets (the figures in brackets represent the relative frequency of respective hashtags):
1. #Brexit (3.93)
2. #TrudeauMustGo (3.23)
3. #JustinTrudeau (3.07)
4. #MAGA (2.99)
5. #Tories (2.53)
6. #Drumpf (2.23)
7. #Corbyn (2.19)
8. #Labour (2.08)
9. #Tory (1.98)
10. #ImpeachTrump (1.73)
Our core set of hashtags (balanced with respect to political bias) was as follows: #MAGA, #GOP, #resist, #ImpeachTrump, #Brexit, #Labour, #Tory, #TheresaMay, #Corbyn, #UKIP, #auspol, #PaulineHanson, #Turnbull, #nzpol, #canpoli, #cpc, #NDP, #JustinTrudeau, #TrudeauMustGo, #MCGA. Many more hashtags are being used but they usually yield fewer results than the above set.
Below are a few figures that aptly summarise the current shape of the corpus: Left-wing bias: ca 55%
Male authors: ca 49%
Polarity: ca 44% negative, ca 47% neutral, ca 9% positive
Mood: ca 50% agitated, ca 21% sarcasm, ca 13% anger, ca 9% neutral, ca 4% joy
Offensive language: present in approximately 17% of all tweets
Swearing by gender: ca 53% males
Speech acts: ca 76% assertive, ca 38% expressive, ca 10% directive, ca 3% commissive, 0.2% metalocutionary
In the third stage I will continue annotating political tweets and write a comprehensive report about the task. My Mentors have also kindly suggested that they could hire another student to provide additional judgments on the subjective categories (especially, polarity and mood). Having more annotators will undoubtedly make the dataset a more valuable resource.