Final Report: GDPR Anonymization Tool

Github Page

Installation Guide

Usage Guide

Live Demo


What is text anonymization?

Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.

Architecture

This system consists of two main components.

Sensitive Attribute Detection System

Before text can be anonymized, the sensitive attributes in the text which give out information have to be identified. We use two methods to do the same (that can be used in tandem or as standalone systems as well):

  1. Named Entity Recognition Based Detection: This relies on tagging of tagging of sensitive entities in text. The user can setup different configurations for different entities which would determine how the given entity anonymized. The different options available are: Deletion/Replacement, Supression and Generalization. The system currently ships with Spacy's NER system, but can very easily be switched out for other NER models.

  2. TF-IDF Based rare entity detection: Certain sensitive attributes in text might not neccesarily be tagged/identified by the NER system. These sensitive tokens can be identified by the TF-IDF system. The term frequency–inverse document frequency identifies possible rare entities in text based on the distribution and occurence of tokens across sample text snippets supplied by the user. Once the TF-IDF score threshold is set, tokens with scores above the same are determined to be sensitive and anonymized.

Sensitive Attribute Anonymization System

Once the sensitive attributes/tokens are detected, they need to be anonymized depening on the kind of token they are. The user can set different anonymization actions for different tokens. The currently available options are:

  1. Deletion/Replacement: To be used in certain cases where retaining a part of the attribute through the other anonymization methods too is not appropriate. Completely replaces the attribute with a pre-set replacement. Example: My name is John Doe would be replaced by My name is <Name>.

  2. Supression: To be used when hiding a part of the information is enough to protect the user's anonymity. The user can supply the percentage or the number of bits they want to be supressed. Example: My phone number is 9876543210 would be replaced by My phone number is 98675***** if the user chooses 50% supression.

  3. Generalization: To be used when the entity is sensitive enough to need anonymization but can still be partially retained to provide information. This system has two methods of carrying out generalization

    • Word Vector Based: In this option of generalization, the nearest neighbor of the word in the vector space is utlilized to generalize the attribute. Example: I live in India get's generalized to I live in Pakistan. This method, while completely changing the word largely retains vector space information useful in most NLP and Text Processing tasks

    • Part Holonym Based: In this option, the system parses the Wordnet Lexical Database to extract part holonyms. This method works exceptionally well with geographical entities. In this, the user can choose the level of generalization. Example: I live in Beijing get's generalized to I live in China at level 1 generalization and to I live in Asia at level 2 of generalization.

Final Report: Pattern 3

Project overview

During the Google Summer of Code event I was focused on the development of the Pattern 3 framework. Pattern consists of many modules which help users to work with web data, use machine learning algorithms, apply natural language processing technics and many other useful functions. The main task was to complete porting Pattern to Python 3 and refactor code. The problem of fixing bugs was also important and as a result all tests in the automatic testing system Travis CI are executed successfully. The new functions and modules have also been added. The pattern.ru module allows users to work with Russian texts. The pattern.web module was improved by adding the VKontakte API class.

Main completed tasks

  • Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box.
  • Refactoring social media Twitter API.
  • Testing all modules, fixing bugs and Travis CI tests.
  • Creating VKontakte API class which allows users to get information from the biggest Russian social network. With this you can retrieve user's profile description and profile picture, user's posts from the profile wall and posts from the newsfeed for a search keyword.
  • Creating pattern.ru module and collecting the necessary data: Named Entities List, Frequency Dictionaries, Part of Speech Wordlist, Spelling List. The parser for part of speech tagging and spellchecker are now available for Russian language.
  • Pattern Python 3 Release

Future Work

There are many opportunities to continue improving the Pattern framework and introduce new functionality. For example, the web mining module can be extended with some other features helping users to analyze the collected data from social media. Also it is important to add the sentiment analysis to pattern.ru part.

While I was working on Pattern Python 3 release I was collecting the political tweets from Twitter and posts from VKontakte to make big dataset which can help to analyze political debate tweets and political discussions. When the collection process is completed, the data set will be available to researchers.

Final Report: De-escalation Twitter Bot

Project Overview

My Google Summer of Code project aimed to create a Twitter bot that would be able to participate in debates around Twitter, find tweets and debates that are overly angry and try to calm them down. While creating this bot, it has also been my task to take the existing Seed application, transform it into a Node module and add some capabilities to it. This secondary aim tied in nicely with the main focus as Seed was used for response generation. All of the aims of the project were fulfilled. However, there still is potential for improvements and future work on the bot by extending its capabilities.

Twitter Bot

The Twitter bot is a mixture of a Node.js frontend and a Python backend. The backend is there for analysis of tweets using pretrained neural network models. It is implemented as a microservice, served by a Flask server routing requests to relevant analysers. Apart from sending the tweet analysis requests, the frontend provides capabilities for connecting to Twitter for reading and sending Tweets and connecting to Seed for generating responses. A detailed description of all of its parts can be seen in the official GSoC repo.

Dataset

Since the task at hand is quite specific, there was a need for custom made datasets. There are two datasets in total, one for each analysis task. The final_dataset.csv is a dataset of about 7000 tweets, downloaded, cleaned and annotated by me, including columns for text, topic and anger level. The anger_dataset.csv contains 5000+ tweets, split 50/50 into angry and non-angry ones. These are partially outsourced and partially made up of the tweets from the final_dataset. The code for creating the datasets is in the GSoC repository, along with both of the datasets.

Context

There are many approaches for tackling the de-escalation in online settings. In this project we decided to follow the rules of the non-violent communication and conflict resolution. This type of communication aims for a calm discourse with the other party, regardless of their manner of communication. The aim is to shift focus from the sentiment (anger/hate..) on to the real source of the sentiment. It is characterized by honest attempts to understand what and why the other party feels. This can be achieved for example by posing questions about the topic, giving the other party space for expressing themselves and arriving at the root of the actual problem and so on. The ability to analyse topic of the tweet, as well as having access to the historic anger levels of the communication, makes this Twitter bot quite adept at carrying out tasks like these. The generated text is made random and human-like enough with the use of Seed that the 'botness' of the bot does not pose as big of a problem.

Pretotyping

Pretotyping is a type of pre-release product testing which focuses on making sure the final product is worth making, rather than being concerned how to make that product. Before releasing the bot into the wild waters of the internet, I did a number of sessions on a Twitter account called OpenDiscussion during which I was acting like a bot myself, looking for angry tweets to respond to, taking detailed notes of each step and 'generating' responses. This was done as a way of actually seeing how people would react to our bot, what sort of interaction makes them even more angry and which actually seems to help. It also benefited the creation of the bot in making us aware of many pitfalls that there are and which the bot would have to deal with. The notes can be seen in the pretotyping subfolder of the official repo.

All in all, the pretotyping sessions were a success. Although the response rates were not that great (people mostly did not respond) this is not an issue for a bot. When the people did respond, however, in the majority of cases, the anger levels dropped or disappeared completely. Being a bot for a minute, trying to act without any preconceived notions or opinions that I have, I could see that the biggest flaws of a bot when leading discussions with humans (no preconceived notions or opinions, does not experience anger, does not get offended) are also its biggest strengths. By not getting angry itself, the bot automatically takes down the anger levels of discussion, making it more civil and calm.

Seed Improvements

Over the course of this project, the original Seed application has been made into a Node module, which is now available as seedtext at the NPM repository. Here is the GitHub repository of seedtext. It has also been extended with capabilities of conditional generation, mentioned in this GitHub issue of the original Seed app, and with the possibility to define custom methods for importing Seed sketches or using built-in methods.

Seed by default generates totally random variations of the text. Through conditional generation, this randomness can be controlled. The Twitter bot uses this functionality to vary its wording and tone depending on the range of the anger level (using words like 'upset' at anger level > 0.5 and < 0.6 and using words like 'enraged' at anger level > 0.9). This enables the generated text to be much more subtle and human-like.

Changes to the importing of sketches and transformation into Node module make Seed available to a much wider audience. Previously, the capabilities of the application were tied to the web environment, now it can be used virtually anywhere.

Future Work

There are several possible ways in which my summer work on this GSoC project could be extended. One of the possible ways would be by extending the Twitter bot capabilities. Currently, the bot lacks a comprehensive database solution for participating in true debates. There is also the possibility of improving the Seed source files, making the generated text better. Improving the capabilities of the analysis backend and the neural networks in it is also a possibility. At the moment the test set accuracy for topic analysis is around 81%. With a bigger and better dataset, this could easily go higher.

I already talked with my supervisor about possible ways in which I could collaborate on the project in the future. We talked about rewriting the original Seed repository so that it uses newly made seedtext module and adding the possibility of creating bots with the push of a button to the web application that the original Seed currently has. I am looking forward to helping make these plans reality.

Acknowledgements

At the end of this report, I would like to express my thanks to the whole team of people from CLiPS and beyond whom I had the honour to meet this summer. Special shoutout goes to my supervisor, Frederik, who has been really great over the whole duration of the internship. I am also happy that I got the chance to meet other GSoC students this year and I hope we'll stay in touch in the future. Last, but not least, I would like to thank Google for making GSoC possible, it was a wonderful experience.