GSOC Reports

Final Report: GDPR Anonymization Tool

Github Page

Installation Guide

Usage Guide

Live Demo


What is text anonymization?

Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.

Architecture

This system consists of two main components.

Sensitive Attribute Detection System

Before text can be anonymized, the sensitive attributes in the text which give out information have to be identified. We use two methods to do the same (that can be used in tandem or as standalone systems as well):

  1. Named Entity Recognition Based Detection: This relies on tagging of tagging of sensitive entities in text. The user can setup different configurations for different entities which would determine how the given entity anonymized. The different options available are: Deletion/Replacement, Supression and Generalization. The system currently ships with Spacy's NER system, but can very easily be switched out for other NER models.

  2. TF-IDF Based rare entity detection: Certain sensitive attributes in text might not neccesarily be tagged/identified by the NER system. These sensitive tokens can be identified by the TF-IDF system. The term frequency–inverse document frequency identifies possible rare entities in text based on the distribution and occurence of tokens across sample text snippets supplied by the user. Once the TF-IDF score threshold is set, tokens with scores above the same are determined to be sensitive and anonymized.

Sensitive Attribute Anonymization System

Once the sensitive attributes/tokens are detected, they need to be anonymized depening on the kind of token they are. The user can set different anonymization actions for different tokens. The currently available options are:

  1. Deletion/Replacement: To be used in certain cases where retaining a part of the attribute through the other anonymization methods too is not appropriate. Completely replaces the attribute with a pre-set replacement. Example: My name is John Doe would be replaced by My name is <Name>.

  2. Supression: To be used when hiding a part of the information is enough to protect the user's anonymity. The user can supply the percentage or the number of bits they want to be supressed. Example: My phone number is 9876543210 would be replaced by My phone number is 98675***** if the user chooses 50% supression.

  3. Generalization: To be used when the entity is sensitive enough to need anonymization but can still be partially retained to provide information. This system has two methods of carrying out generalization

    • Word Vector Based: In this option of generalization, the nearest neighbor of the word in the vector space is utlilized to generalize the attribute. Example: I live in India get's generalized to I live in Pakistan. This method, while completely changing the word largely retains vector space information useful in most NLP and Text Processing tasks

    • Part Holonym Based: In this option, the system parses the Wordnet Lexical Database to extract part holonyms. This method works exceptionally well with geographical entities. In this, the user can choose the level of generalization. Example: I live in Beijing get's generalized to I live in China at level 1 generalization and to I live in Asia at level 2 of generalization.

Final Report: Pattern 3

Project overview

During the Google Summer of Code event I was focused on the development of the Pattern 3 framework. Pattern consists of many modules which help users to work with web data, use machine learning algorithms, apply natural language processing technics and many other useful functions. The main task was to complete porting Pattern to Python 3 and refactor code. The problem of fixing bugs was also important and as a result all tests in the automatic testing system Travis CI are executed successfully. The new functions and modules have also been added. The pattern.ru module allows users to work with Russian texts. The pattern.web module was improved by adding the VKontakte API class.

Main completed tasks

  • Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box.
  • Refactoring social media Twitter API.
  • Testing all modules, fixing bugs and Travis CI tests.
  • Creating VKontakte API class which allows users to get information from the biggest Russian social network. With this you can retrieve user's profile description and profile picture, user's posts from the profile wall and posts from the newsfeed for a search keyword.
  • Creating pattern.ru module and collecting the necessary data: Named Entities List, Frequency Dictionaries, Part of Speech Wordlist, Spelling List. The parser for part of speech tagging and spellchecker are now available for Russian language.
  • Pattern Python 3 Release

Future Work

There are many opportunities to continue improving the Pattern framework and introduce new functionality. For example, the web mining module can be extended with some other features helping users to analyze the collected data from social media. Also it is important to add the sentiment analysis to pattern.ru part.

While I was working on Pattern Python 3 release I was collecting the political tweets from Twitter and posts from VKontakte to make big dataset which can help to analyze political debate tweets and political discussions. When the collection process is completed, the data set will be available to researchers.

Final Report: De-escalation Twitter Bot

Project Overview

My Google Summer of Code project aimed to create a Twitter bot that would be able to participate in debates around Twitter, find tweets and debates that are overly angry and try to calm them down. While creating this bot, it has also been my task to take the existing Seed application, transform it into a Node module and add some capabilities to it. This secondary aim tied in nicely with the main focus as Seed was used for response generation. All of the aims of the project were fulfilled. However, there still is potential for improvements and future work on the bot by extending its capabilities.

Twitter Bot

The Twitter bot is a mixture of a Node.js frontend and a Python backend. The backend is there for analysis of tweets using pretrained neural network models. It is implemented as a microservice, served by a Flask server routing requests to relevant analysers. Apart from sending the tweet analysis requests, the frontend provides capabilities for connecting to Twitter for reading and sending Tweets and connecting to Seed for generating responses. A detailed description of all of its parts can be seen in the official GSoC repo.

Dataset

Since the task at hand is quite specific, there was a need for custom made datasets. There are two datasets in total, one for each analysis task. The final_dataset.csv is a dataset of about 7000 tweets, downloaded, cleaned and annotated by me, including columns for text, topic and anger level. The anger_dataset.csv contains 5000+ tweets, split 50/50 into angry and non-angry ones. These are partially outsourced and partially made up of the tweets from the final_dataset. The code for creating the datasets is in the GSoC repository, along with both of the datasets.

Context

There are many approaches for tackling the de-escalation in online settings. In this project we decided to follow the rules of the non-violent communication and conflict resolution. This type of communication aims for a calm discourse with the other party, regardless of their manner of communication. The aim is to shift focus from the sentiment (anger/hate..) on to the real source of the sentiment. It is characterized by honest attempts to understand what and why the other party feels. This can be achieved for example by posing questions about the topic, giving the other party space for expressing themselves and arriving at the root of the actual problem and so on. The ability to analyse topic of the tweet, as well as having access to the historic anger levels of the communication, makes this Twitter bot quite adept at carrying out tasks like these. The generated text is made random and human-like enough with the use of Seed that the 'botness' of the bot does not pose as big of a problem.

Pretotyping

Pretotyping is a type of pre-release product testing which focuses on making sure the final product is worth making, rather than being concerned how to make that product. Before releasing the bot into the wild waters of the internet, I did a number of sessions on a Twitter account called OpenDiscussion during which I was acting like a bot myself, looking for angry tweets to respond to, taking detailed notes of each step and 'generating' responses. This was done as a way of actually seeing how people would react to our bot, what sort of interaction makes them even more angry and which actually seems to help. It also benefited the creation of the bot in making us aware of many pitfalls that there are and which the bot would have to deal with. The notes can be seen in the pretotyping subfolder of the official repo.

All in all, the pretotyping sessions were a success. Although the response rates were not that great (people mostly did not respond) this is not an issue for a bot. When the people did respond, however, in the majority of cases, the anger levels dropped or disappeared completely. Being a bot for a minute, trying to act without any preconceived notions or opinions that I have, I could see that the biggest flaws of a bot when leading discussions with humans (no preconceived notions or opinions, does not experience anger, does not get offended) are also its biggest strengths. By not getting angry itself, the bot automatically takes down the anger levels of discussion, making it more civil and calm.

Seed Improvements

Over the course of this project, the original Seed application has been made into a Node module, which is now available as seedtext at the NPM repository. Here is the GitHub repository of seedtext. It has also been extended with capabilities of conditional generation, mentioned in this GitHub issue of the original Seed app, and with the possibility to define custom methods for importing Seed sketches or using built-in methods.

Seed by default generates totally random variations of the text. Through conditional generation, this randomness can be controlled. The Twitter bot uses this functionality to vary its wording and tone depending on the range of the anger level (using words like 'upset' at anger level > 0.5 and < 0.6 and using words like 'enraged' at anger level > 0.9). This enables the generated text to be much more subtle and human-like.

Changes to the importing of sketches and transformation into Node module make Seed available to a much wider audience. Previously, the capabilities of the application were tied to the web environment, now it can be used virtually anywhere.

Future Work

There are several possible ways in which my summer work on this GSoC project could be extended. One of the possible ways would be by extending the Twitter bot capabilities. Currently, the bot lacks a comprehensive database solution for participating in true debates. There is also the possibility of improving the Seed source files, making the generated text better. Improving the capabilities of the analysis backend and the neural networks in it is also a possibility. At the moment the test set accuracy for topic analysis is around 81%. With a bigger and better dataset, this could easily go higher.

I already talked with my supervisor about possible ways in which I could collaborate on the project in the future. We talked about rewriting the original Seed repository so that it uses newly made seedtext module and adding the possibility of creating bots with the push of a button to the web application that the original Seed currently has. I am looking forward to helping make these plans reality.

Acknowledgements

At the end of this report, I would like to express my thanks to the whole team of people from CLiPS and beyond whom I had the honour to meet this summer. Special shoutout goes to my supervisor, Frederik, who has been really great over the whole duration of the internship. I am also happy that I got the chance to meet other GSoC students this year and I hope we'll stay in touch in the future. Last, but not least, I would like to thank Google for making GSoC possible, it was a wonderful experience.

GSOC 2018 Phase 2 Reports

  • Posted on: 24 July 2018
  • By: Guy

The progress reports for Phase 2 of GSOC 2018 are in! Alexander Rossa, Maja Gwozdz, Rudresh Panchal and Максим Филин have documented their work and are ready for the final phase!

Phase 2 - Report: Максим Филин

During the second month of Google Summer of Code, I continued working with Pattern 3 Framework and focused on the following tasks:

1. VKontakte API

I’ve implemented several methods which allow users to get information from VK social network.

Possibilities: 1) retrieving a user's profile description and profile picture 2) retrieving user's posts from the profile wall 3) retrieving posts from the newsfeed for a search keyword.

I’ve taken into account the request limitations and added the rules in VK class.

We’ve considered the authorization method in the social network VKontakte by a direct link through the VKontakte API (based on the OAuth protocol), called Implicit flow. Authorization by this method is performed through the VKontakte application specified in the form of an ID. This is the most secure method of authorization.

Also the instruction about how to obtain access_token was prepared. It’s in class description now but later will be added to pattern documentation.

2. Testing the most important parts

Twitter, SVM, Datasheet and DOM (the HTML parser in pattern.web), adding new tests and fixing bugs.

3. pattern.ru

The structure of the module for the Russian language was created and part of the functional was implemented. Also the necessary data for module was collected: - Named Entities List - Frequency Dictionary - Part of Speech Wordlist - Spelling List - The following functional was implemented: 1) Parser for part of speech tagging 2) Spellchecker

The new methods will be gradually added.

4. Collecting sentiment political Russian tweets and posts from Twitter and VKontakte.

There is a lot of data from Twitter and other social networks in English, German, Arabic and other languages but nothing in Russian. So, we want to collect Russian Twitter/VKontakte dataset, particulary Russian political debate tweets. Then we can compare the data that we already have and determine general political and social trends.

The main idea is to use Yandex Toloka for dataset creation. It is a crowdsourcing platform which helps developers and researches to perform various tasks including marking texts. The creation of a task on this platform is not a trivial and should be formulated as accurately as possible, but with the help of the assessors it is possible to get the result more quickly than by own forces.

5. Prepare Pattern 3 for release.

It was important that all tests running through Travis CI were executed without errors. Now all tests pass. In the following weeks we are going to make the release of Pattern Py3.

Phase 2 - Report: Alexander Rossa

During my second GSoC term I was focusing on finishing various parts of the Twitter Deescalation bot and on extending the Seed module.

Twitter bot:

  • Created dataset of several thousands of Tweets for both topic prediction (keyword labeled and checked for correctness) and anger/participation classification (manually labeled)

  • Improved and tested neural network models used on said dataset

  • Did some pretotyping work with the bot - participating in online discussions and sort of impersonating an ideal version of the bot to see what the bot will have to deal with "in the real world" - the logs are from "bot's perspective" and closely follow the actual execution of the bot

Seed module:

  • Transformed Seed into an NPM module

  • Wrote up some documentation for using Seed as an NPM module

  • Almost finished implementing the conditional generation, still need to do a bit of work on connecting all the outputs and do some testing for correctness of the solution

The next focus for this project will be:

  • Finishing the conditional generation for Seed

  • Reworking the collected dataset a bit (turns out that there were too many classes for too little data which plateaued the test set accuracy on about 60% even with heavy regularization) - I collected more data for smaller number of classes and am hand labeling it right now

  • Testing and improving the bot in the real world

  • Retrospectively rewriting the original Seed repository with using Seed as a Node module instead and adding the ability to easily create Twitter bots from the Seed website https://seed.emrg.be/

Phase 2 - Report: Rudresh Panchal

This post reflects upon some of the milestones achieved in GSoC 2018's Phase two.

The Phase 2 mainly concentrated on on expanding the rare entity detection pipeline, adding the generalization features and increasing accessibility to the system being built. The following features were successfully implemented:

  • Built a custom TF-IDF system to help recognize rare entities. The TF-IDF system saved the intermediate token counts, so that whenever a new document/text snippet is added to the knowledgebase, the TF-IDF scores for all the tokens do not have to be recalculated. The stored counts are loaded, incremented and the relevant scores calculated.

  • Implemented the "Part Holonym" based generalization feature. This feature relies on lexical databases like Wordnet to extract part holonyms. This generalizes tokens to their lexical supersets. For example: London gets generalized to England, Beijing to China at level one generalization and to Europe and Asia Respectively for level two generalization. The user is given the option of choosing the generalization level for each attribute.

  • Implemented the "Word Vector" based generalization feature. This maps the nearest vector space neighbour of a token in pretrained embeddings like GLoVE and replaces it with the same. For example: India gets replaced with Pakistan.

  • Implemented a general anonymization RESTful API. This gives people the option to utilize our system across different tech stacks.

  • Implemented a Token level RESTful API. This API endpoint gives token level information of various things including, the original token, replaced token, entity type and the anonymization type.

  • The API utilizes Django's token based authentication system. Implemented a dashboard to manage the authentication tokens for the users.

Some of the major things planned for the 3rd and final phase are:

  • Code cleanup: As the project progressed, some of the code has become redundant which needs to be removed.

  • Documentation: While the code is well commented and easy to understand, the project currently lacks thorough external documentation. A quick usage guide for non-programmer end users also could be helpful.

  • A simple scaffolding system for the user. The system currently ships without any predefined configurations (including entities, aliases etc). Script(s) which can quickly setup a ready to use system with certain default values (including pre-defined attribute actions, threshold values etc) would be useful.

  • GUI based and API based file upload system. The user currently has to currently paste plaintext in the GUI and set it as a parameter in the API. The option to directly upload text files will increase user convenience.

  • Experiment with language localization. The system currently works well with the English language, but it needs to be tried out with other languages.

Picture 1: The Token level API in action

Phase 2 - Report: Maja Gwozdz

In the second phase of GSoC, I continued annotating political tweets and corrected some typos in the dataset. I created a more varied corpus by collecting tweets related to American, British, Canadian, and Australian socio-political affairs (I am also collecting New Zealand tweets but they are really rare). As regards the annotation guidelines, I improved the document stylistically and added relevant examples to each section. I also created a short appendix containing the most important politicians and their party affiliations, so as to facilitate future annotations.

As for the dataset itself, I am happy to announce that there were far more idioms and proverbs than in the previous stage. The following list presents the top ten most frequent hashtags extracted from the tweets (the figures in brackets represent the relative frequency of respective hashtags):

1. #Brexit (3.93)

2. #TrudeauMustGo (3.23)

3. #JustinTrudeau (3.07)

4. #MAGA (2.99)

5. #Tories (2.53)

6. #Drumpf (2.23)

7. #Corbyn (2.19)

8. #Labour (2.08)

9. #Tory (1.98)

10. #ImpeachTrump (1.73)

Our core set of hashtags (balanced with respect to political bias) was as follows: #MAGA, #GOP, #resist, #ImpeachTrump, #Brexit, #Labour, #Tory, #TheresaMay, #Corbyn, #UKIP, #auspol, #PaulineHanson, #Turnbull, #nzpol, #canpoli, #cpc, #NDP, #JustinTrudeau, #TrudeauMustGo, #MCGA. Many more hashtags are being used but they usually yield fewer results than the above set.

Below are a few figures that aptly summarise the current shape of the corpus: Left-wing bias: ca 55%

Male authors: ca 49%

Polarity: ca 44% negative, ca 47% neutral, ca 9% positive

Mood: ca 50% agitated, ca 21% sarcasm, ca 13% anger, ca 9% neutral, ca 4% joy

Offensive language: present in approximately 17% of all tweets

Swearing by gender: ca 53% males

Speech acts: ca 76% assertive, ca 38% expressive, ca 10% directive, ca 3% commissive, 0.2% metalocutionary

In the third stage I will continue annotating political tweets and write a comprehensive report about the task. My Mentors have also kindly suggested that they could hire another student to provide additional judgments on the subjective categories (especially, polarity and mood). Having more annotators will undoubtedly make the dataset a more valuable resource.

GSOC 2018 Phase 1 Reports

  • Posted on: 29 June 2018
  • By: Guy

The write-ups for Phase 1 of GSOC 2018 are ready! Check out the progress reports of our talented team of coders: Alexander Rossa, Maja Gwozdz, Rudresh Panchal and Максим Филин.

Phase 1 - Report: Maja Gwozdz

In the first phase of GSoC 2018, I started annotating political tweets. The corpus of political tweets includes, for instance, tweets related to US, Canadian, UK, Australian politics and current social affairs. The categories included in the database include information about the author, their gender, the political bias, the polarity of a given entry (I'm using a discrete scale: -1 for a negative utterance, 0 for a neutral one, 1 for a positive entry), speech acts, mood of the tweet (for instance, sarcasm or anger), any swear words / offensive language, and the keywords, that is, concrete parts of the tweet that led to the polarity judgment.

In order to obtain the relevant political tweets, I used Grasp and a list of popular political hashtags (to mention but a few: #MAGA, #TrudeauMustGo, #auspoli, #Brexit, #canpoli, #TheresaMay). I also prepared the annotation guidelines, so that other people interested in the project could offer their own judgment and provide additional annotations. Having more judgments will render the corpus more valuable. In the next stage of GSoC, I hope to have enough judgments from other people to estimate the agreement score and arrive at (more) objective scores.

The database is currently available as a Google Sheet --- this is a relatively easy way to store data and allow for parallel annotation.

Phase 1 - Report: Rudresh Panchal

With the first coding phase of GSoC 2018 coming to an end, this post reflects upon some of the milestones achieved in the past month.

I first worked on finalizing the architecture of the Text Anonymization system. This system is being built with the European Union's General Data Protection Regulations in mind. The system seeks to offer a seamless solution to a company's text anonymization needs. The many existing solutions to GDPR mainly focus on anonymization in Database entries, and not on anonymizing plain text snippets.

My system pipeline consists of two principal components.

  1. Entity Recognition: In this part, the entity is recognized using various approaches including Named Entity Recognition (implemented), Regular Expression based patterns (implemented) and TF-IDF based scores (To be implemented in 2nd Phase).

  2. Subsequent action: Once the entity is recognized, the system looks up the configuration mapped to that particular attribute, and carries out one of the following actions to anonymize the data:

  3. Suppression (implemented)

  4. Deletion/Replacement (implemented)

  5. Generalization (To be implemented in 2nd phase).

The methods to generalize the attribute include a novel word vector based generalization and extraction of part holonyms.

Some of the coding milestones achieved include:

  • Setup the coding environment for the development phase.

  • Setup the Django Web App and the Database.

  • Wrote a function to carry out the text pre-processing, including removal of illegal characters, tokenization, expansion of contractions etc.

  • Wrote and integrated wrappers for the Stanford NER system. Wrote the entity replacement function for the same.

  • Wrote and integrated wrappers for the Spacy NER system. Wrote the entity replacement function for this too.

  • Wrote the suppression and deletion functions. Integrated the two with a DB lookup for configurations.

  • Wrote the Regular Expression based pattern search function.

  • Implemented the backend and frontend of the entire Django WebApp.

Major things planned for Phase 2:

  • Implement a dual TF-IDF system. One gives scores based on the documents the user has uploaded, and one which gives scores based on TF-IDF trained on a larger, external corpora.

  • Implement a word vector closest neighbor based generalization.

  • Implement the holonym lookup and extraction functions,.

Picture 1: Shows the Dashboard of the users, allowing them to add new attribute configurations, modify existing configurations, add Aliases for the NER lookup, add Regex patterns and carry out text anonymization.

Picture 2: Shows the text anonymization system in action. The various entities and regex patterns were recognized and replaced as per the configuration.

Phase 1 - Report: Максим Филин

During the first month of Google Summer of Code, I have been working with the following tasks:

  1. Refactoring social media Twitter API.

  2. Trying all framework examples and fixing bugs and errors.

The project in master branch now supports only python 2, but we want to add the ability of using python 3. In this regard, it is necessary in some places to refactor the code. So, most of errors have been fixed.

3. Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box. SVM examples now work on all platforms.

4. Looking at Travis CI failing unit tests.

There were some errors: in TestParser.test_find_keywords and TestClassifier.test_igtree. The problem was in wrong interpretation of test examples and they are now rewritten. But there are still sometimes errors in Search Engine because of APIs licences and permissions.

5. Working on VK API.

We think that it would be great to add the api to VKontakte, the biggest Russian social network. It is used in many countries and supports over than 50 languages. It also has very multifunctional api, so it can be added in pattern. The structure for working with api have already been created.

In the following part of Google Summer of Code we will focus on Python 3 Pattern release and testing. Also we will try to implement pattern.ru module.

Phase 1 - Report: Alexander Rossa

There are two distinct focal points in my GSoC work. The first is a functional Twitter bot, achieved by a pipeline consisting of Python machine learning backend for tweet analysis, Node.js frontend for accessing Twitter API and integration of Seed into the frontend for templated procedural text generation. The other is extending the capabilities of Seed, making it into a Node.js module (available through npm) and adding conditional generation.

The work I have done during my first month gravitated largely around the first task. The tasks that were completed include:

  • Node.js frontend for manipulating Twitter API, retrieving and reacting to Tweets, sending back responses etc.

  • Python ML backend for Topic Analysis and Sentiment Analysis of Tweets

  • API for Python backend so that it can be accessed as a microservice rather than one monolithic deployment bundled with Node.js frontend

  • Seed miniprogram for generating templated responses. This does not yet have conditional generation enabled and serves more like a testing rig for trying to generate text conforming to non-violent communication principles.

Apart from these tasks, I spent time exploring approaches for deployment and getting to grips with new technologies.

The next focus for this project will be:

  • Extending Seed with conditional generation and API for easier access

  • Extending Seed miniprogram with newly acquired conditional generation capabilities

  • Testing and improving the bot. This may include:

  • adding ML module for paraphrasing (which will enable bot to revisit topics mentioned in communication in a more natural way)

  • improving quality of Seed miniprogram generation capabilities (more templates for sentences, more randomness...)

  • adding new rules for participation (currently working by hashtagging)

After the first two bullet points are completed, the work on Twitter bot is basically done and is just about improving the quality of the text it is able to produce.

Pages