The write-ups for Phase 1 of GSOC 2018 are ready! Check out the progress reports of our talented team of coders: Alexander Rossa, Maja Gwozdz, Rudresh Panchal and Максим Филин.
GSOC 2018 Reports
In the first phase of GSoC 2018, I started annotating political tweets. The corpus of political tweets includes, for instance, tweets related to US, Canadian, UK, Australian politics and current social affairs. The categories included in the database include information about the author, their gender, the political bias, the polarity of a given entry (I'm using a discrete scale: -1 for a negative utterance, 0 for a neutral one, 1 for a positive entry), speech acts, mood of the tweet (for instance, sarcasm or anger), any swear words / offensive language, and the keywords, that is, concrete parts of the tweet that led to the polarity judgment.
In order to obtain the relevant political tweets, I used Grasp and a list of popular political hashtags (to mention but a few: #MAGA, #TrudeauMustGo, #auspoli, #Brexit, #canpoli, #TheresaMay). I also prepared the annotation guidelines, so that other people interested in the project could offer their own judgment and provide additional annotations. Having more judgments will render the corpus more valuable. In the next stage of GSoC, I hope to have enough judgments from other people to estimate the agreement score and arrive at (more) objective scores.
The database is currently available as a Google Sheet --- this is a relatively easy way to store data and allow for parallel annotation.
With the first coding phase of GSoC 2018 coming to an end, this post reflects upon some of the milestones achieved in the past month.
I first worked on finalizing the architecture of the Text Anonymization system. This system is being built with the European Union's General Data Protection Regulations in mind. The system seeks to offer a seamless solution to a company's text anonymization needs. The many existing solutions to GDPR mainly focus on anonymization in Database entries, and not on anonymizing plain text snippets.
My system pipeline consists of two principal components.
Entity Recognition: In this part, the entity is recognized using various approaches including Named Entity Recognition (implemented), Regular Expression based patterns (implemented) and TF-IDF based scores (To be implemented in 2nd Phase).
Subsequent action: Once the entity is recognized, the system looks up the configuration mapped to that particular attribute, and carries out one of the following actions to anonymize the data:
Generalization (To be implemented in 2nd phase).
The methods to generalize the attribute include a novel word vector based generalization and extraction of part holonyms.
Some of the coding milestones achieved include:
Setup the coding environment for the development phase.
Setup the Django Web App and the Database.
Wrote a function to carry out the text pre-processing, including removal of illegal characters, tokenization, expansion of contractions etc.
Wrote and integrated wrappers for the Stanford NER system. Wrote the entity replacement function for the same.
Wrote and integrated wrappers for the Spacy NER system. Wrote the entity replacement function for this too.
Wrote the suppression and deletion functions. Integrated the two with a DB lookup for configurations.
Wrote the Regular Expression based pattern search function.
Implemented the backend and frontend of the entire Django WebApp.
Major things planned for Phase 2:
Implement a dual TF-IDF system. One gives scores based on the documents the user has uploaded, and one which gives scores based on TF-IDF trained on a larger, external corpora.
Implement a word vector closest neighbor based generalization.
Implement the holonym lookup and extraction functions,.
Picture 1: Shows the Dashboard of the users, allowing them to add new attribute configurations, modify existing configurations, add Aliases for the NER lookup, add Regex patterns and carry out text anonymization.
Picture 2: Shows the text anonymization system in action. The various entities and regex patterns were recognized and replaced as per the configuration.
During the first month of Google Summer of Code, I have been working with the following tasks:
Refactoring social media Twitter API.
Trying all framework examples and fixing bugs and errors.
The project in master branch now supports only python 2, but we want to add the ability of using python 3. In this regard, it is necessary in some places to refactor the code. So, most of errors have been fixed.
3. Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box. SVM examples now work on all platforms.
4. Looking at Travis CI failing unit tests.
There were some errors: in TestParser.test_find_keywords and TestClassifier.test_igtree. The problem was in wrong interpretation of test examples and they are now rewritten. But there are still sometimes errors in Search Engine because of APIs licences and permissions.
5. Working on VK API.
We think that it would be great to add the api to VKontakte, the biggest Russian social network. It is used in many countries and supports over than 50 languages. It also has very multifunctional api, so it can be added in pattern. The structure for working with api have already been created.
In the following part of Google Summer of Code we will focus on Python 3 Pattern release and testing. Also we will try to implement pattern.ru module.
There are two distinct focal points in my GSoC work. The first is a functional Twitter bot, achieved by a pipeline consisting of Python machine learning backend for tweet analysis, Node.js frontend for accessing Twitter API and integration of Seed into the frontend for templated procedural text generation. The other is extending the capabilities of Seed, making it into a Node.js module (available through npm) and adding conditional generation.
The work I have done during my first month gravitated largely around the first task. The tasks that were completed include:
Node.js frontend for manipulating Twitter API, retrieving and reacting to Tweets, sending back responses etc.
Python ML backend for Topic Analysis and Sentiment Analysis of Tweets
API for Python backend so that it can be accessed as a microservice rather than one monolithic deployment bundled with Node.js frontend
Seed miniprogram for generating templated responses. This does not yet have conditional generation enabled and serves more like a testing rig for trying to generate text conforming to non-violent communication principles.
Apart from these tasks, I spent time exploring approaches for deployment and getting to grips with new technologies.
The next focus for this project will be:
Extending Seed with conditional generation and API for easier access
Extending Seed miniprogram with newly acquired conditional generation capabilities
Testing and improving the bot. This may include:
adding ML module for paraphrasing (which will enable bot to revisit topics mentioned in communication in a more natural way)
improving quality of Seed miniprogram generation capabilities (more templates for sentences, more randomness...)
adding new rules for participation (currently working by hashtagging)
After the first two bullet points are completed, the work on Twitter bot is basically done and is just about improving the quality of the text it is able to produce.