GSOC Reports

English Profanity and Offensive Word List Constitution

As part of Google Summer of Code 2019, I undertook the constitution of an annotated list of English terms related to the notions of profanity, hatefulness and offensiveness. In this post, I will describe the different steps taken towards building it up and annotating it.

The first step was to determine what technique to use to generate a list that would reflect the aspects being researched. My choice was to use 2 comparable corpora that would have as their main difference the presence/absence of offensive and hateful language, or not. A technique called pointwise mutual information (PMI) can then be applied to see what words are more typical of one corpus relative to the other. It is good at ignoring common and (usually) uninteresting words such as “the”, “an”, etc. while singling out typical terms of a given corpus.

To that end, I used textual data collected from the controversial social media platform Gab came in the public spotlight in the aftermath of the Tree of Life shooting where it was then said that the shooter was a gab user and that the platform might have played in a role in his radicalization. Manually going through a couple of posts can quickly give one a hint of why such claims were made, as the platform is filled with openly racist, conspirationist, anti-Semitic and overall hateful and toxic content. It thus seemed like a “good” place to start. I manually selected a few dozens of users that were openly racist and hateful to be scraped in the hope that they would indeed reflect the toxic language I was looking for. In total, around 250,000 posts were retrieved from approximately 60 users over a span of 3 years (from late August 2016, when Gab first came online until late February 2019). The data was cleaned from URLs and usernames, as that data doesn’t convey useful information for our task as well as not being privacy-friendly.

The second step was to collect a reference corpus against which our toxic-language corpus could be compared. The main point when applying such techniques is to find data that is as close as possible to our target corpus, but for that one dimension we are researching, profanity and offensiveness in this case. I thus collected data from another social media platform, i.e. Reddit. The advantage here is that mere Internet slang would be less likely to show up after the comparison of both corpora, which is something that might have been a problem if the reference corpus had been, e.g. the Brown corpus, that is much too standard for our current purpose. A downside, however, is that Reddit, while being more mainstream, moderate and moderated than Gab, is also not free from toxic content and this could lead to some offensive language slipping through. Yet, the platform has recently been taking action against hateful and toxic content by banning posts, users and even entire subreddits deemed inappropriate, so Reddit still felt like a good reference in contrast to Gab. Reddit posts were simply retrieved using a public archive, and there was more than enough data to match that of Gab.

Once both corpora had been put together, we applied a PMI analysis with Gab as the target corpus, and kept the top 2000 words (ranked by PMI score). It yielded rather instinctive results with “Jew”, “nigger”, “kike” (offensive word for “Jew”) and other niceties showing up in at the very top. However, there was also a lot of non-offensive and semi-related terms that showed up such as “America”, “white” or “election” that would be interesting for topic modeling, but that did not entirely fit our purpose. Of course, it also output a lot of entirely unrelated words that would need to be cleaned up during the annotation phase. We thus needed another way to enrich the list.

The idea was to use lexical proximity between words represented as embeddings in a high-dimensional vector space. When applied toa sufficient amount of data, this technique can deliver surprisingly intuitive results. Given that words are represented in a mathematical form, they can be added and subtracted to and from one another, such that “Merkel” – “Germany” + “France” yields “Macron”. Needless to say that such models are powerful tools to capture all sorts of lexical relationships. For our purpose, we trained a basic word-embedding model from our Gab corpus. However, lexical relationships don’t jumped at me out of the blue and I needed seed words with which to compute the lexical proximity within the embedding space. Those were found heuristically by searching the web for lists of insults and rude language in general. We used 2 lists: a list of insults (thus excluding “rude” words such as “fucking”, as it is not an insult) put together collaboratively in “Wiki” format and a “Offensive/Profane Word List” by Luis von Ahn (creator of the language-learning app Duolinguo, among other things).

Each word itself was added to the final list, before being compared to the other words in the vector space using the cosine distance as a means of comparison. The 10 most similar words were kept and their respective distance to the seed word were added to that of previous words retrieved this way. For instance, we used “nigger” as a seed word, yielding “niggar” as a very similar one, and if “niggar” had previously been retrieved, the current cosine distance between “nigger” and “niggar” was added to that of the previous occurrence of “niggar”. In the end, we had generated a list of words mapped to accumulated cosine distances that could be sorted to retrieve the words most commonly associated to insults and other offensive words from our 2 original lists. Adding up the cosine distances of each retrieved word proved useful as the vector space of Gab was trained using a rather small amount of data for such a task (250,000 posts) and this cosine-distance-based retrieval technique also generated noise and irrelevant data.

Each word in the list was then annotated along to 2 axes/dimensions: one representing the level/degree of offensiveness (from 0 to 4) and another reflecting the nature or the topic associated with said word (racial, political, religious, etc.) based on previous work by CLiPS in German and Dutch. Topics were not mutually exclusive and multiple topics can be associated to one word. The manual review of words one by one is the opportunity to get rid of irrelevant words. However, it must be noted that the limit between relevant or not can sometimes be fuzzy, as sensationalist or controversial words (“refugee”, “supremacist”, etc.) can also prove useful. Thus, when in doubt, the word remained in the list, as deleted words cannot be retrieved, while irrelevant words can always be removed later if necessary.

I hope this post was enjoyable to read and gave a good overview on how to filter out specific data by comparison. I think the method described above works well for high-resource languages like English, given the quantitative nature of the techniques involved. Should it be transposed to other (and more specific) topics, as well as to languages less represented online, more precise techniques should be considered.

First month in CLIPs: outcomes and thoughts.

In the beginning of the month I knew that there was a single danger, awaiting me, and it wasn't connected to software development. I was afraid to be too fascinated by the variety of opportunities that lie ahead of me. Of course, my goal was to concentrate on one thing and bring it to some sort of satisfying completion. But also there were so many things I wanted to do... I wanted to work with Pattern, CLiPS'es package, written in Python; to experiment with cross-language analysis, and always held some psycholinguistic-related thoughts in my head.

Luckily for me, my mentors were very careful from the very beginning. They did everything possible so I wouldn't start working on all the projects at once. This month I had two tasks and I will talk in details about them further.

Part one: Pattern.

Pattern makes a strong impression on the young developer. It resemebles some sophisticated architectural structure. There are many things happening in it at the same time and it seems that Pattern can be used for any purpose imaginable.

In the beginning it was difficult to run the whole program at once. For example, several tests lost their relevance during the months, in which the new build was not launched. The program behaved differently on Windows and Linux, and wasn't working on the Python 2.7. In a few days it got a little easier. The main issue I worked on was: the required package named “mysqlclient”, which failed on some computers with no apparent reason. It was annoying for the users, who didn’t plan to use databases whatsoever. I figured out the way around it, but more issues were coming up and slowed me down a little bit.

For example, Travis CL was causing strange problems in the combination of Python 2.7 and cherrypy package, because of that Python 2.7. build was always failing to my annoyance. Some tests were outdated and were using things like “StopIteration” instead of simple “return”, contradicting to the recent PEP updates.

In the process, I also accepted one of the pull-requests to improve English verb inflexion etc (the request also has alternatives for solving problems with tests) and added a list of profanity words in Russian.

By the end of the month there is only one test in the whole program which still behaves semi-randomly. Very soon I will merge everything to the master branch, and will be able to say that the task is completed!

Part two: Hate speech

Before I was planning to compare politically-motivated hate speech concerning American and Russian relationship; but there were a lot of problems with that and it was making an impression of too complicated task for the GSOC period. So instead I decided to focus on the hate speech in Russian with a possibility to combine it later with cross-lingual analysis. I read a few of papers, before starting the project, and tried to write down all the findings, comparing approaches, error analysis and data sets. The table of all recorded findings is freely available here and will be updated.

Definition of hate speech by itself was already quite complicated and a lot of people confuse it with offensive speech (as noted in the article "Measuring the Reliability of Hate Speech Annotations..." by Ross et al ), so I decided to concentrate on one type of hate speech – sexism. I had several reasons to pick it: first of all, Russia had several sexist laws accepted recently, which was always provoking discussions, and second, I found a few sexist datasets in English online, which could help me a lot if I was to decide working with cross-lingual analysis.

I collected two types of data.

For my first type of data I collected comments from the news posts in Russian social network “Vkontakte”. I picked three major news pages, each of them having slightly different audience and censorship rules. I looked for news which could possibly trigger sexist discussions, for that I used VK API and a query of search words. The code which I wrote for that can be found here. The second type of data is more dense with sexism. It is a forum, which name is literally could be translated as “Anti-females”, which is created by the people who are trying to live with “patriarchy standards". They are discussing (and being serious!) such things as “female logic” (as an example of females being incapable of thinking logically), place of a female in the world etc.

It seems to be perfect for sexism detection, and the only problem is form of the data, differentiating from the previously collected comments. (But there are several approaches I am planning to try out on that).

So far I collected approximately 16 000 of units of sexist and non-sexist comments. Annotated corpora can be found here and the ones, which are to be annotated can be found here. Right now annotation is imperfect, because it is made only by me and for higher accuracy we would have to hire a few more Russian speakers. I hope to be done with my annotation by 3 of July, and I am planning to start working on the model.

Fabricio Layedra: Report - Phase 1

Initial work on viNLaP:

  • Exploratory Data Analysis of the MAGA Corpus
  • Creation of scripts upon scattertext, a python library, for the topical charts of viNLaP.
  • Front-end development version 1.0
  • Creation of scripts for downloading the tweets of MAGA Corpus in order to create the next bunch of visualizations based on geospatial data (latitude and longitude of the tweets).
  • Literature review for novel visualizations of TF-IDF scores.

CLiPS GSOC 2019 kicks off

  • Posted on: 22 May 2019
  • By: Guy

We are happy and proud to once again have an extremely talented team of GSOC students this year.

Fabricio Layedra will be working on viNLaP, an interactive and data-driven web dashboard with three modules. Each module is based on one of these three main analysis: Spatial, Temporal and Statistical/Traditional. Each module will include traditional visualizations related to the respective analysis but also novel visualizations based on the proposed ones in the literature. The proposed viNLaP is to visualize in this first scenario: polarized data; but it is built to be useful for new types of dataset that would come in the future.

Final Report: GDPR Anonymization Tool

Github Page

Installation Guide

Usage Guide

Live Demo

What is text anonymization?

Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.


This system consists of two main components.

Sensitive Attribute Detection System

Before text can be anonymized, the sensitive attributes in the text which give out information have to be identified. We use two methods to do the same (that can be used in tandem or as standalone systems as well):

  1. Named Entity Recognition Based Detection: This relies on tagging of tagging of sensitive entities in text. The user can setup different configurations for different entities which would determine how the given entity anonymized. The different options available are: Deletion/Replacement, Supression and Generalization. The system currently ships with Spacy's NER system, but can very easily be switched out for other NER models.

  2. TF-IDF Based rare entity detection: Certain sensitive attributes in text might not neccesarily be tagged/identified by the NER system. These sensitive tokens can be identified by the TF-IDF system. The term frequency–inverse document frequency identifies possible rare entities in text based on the distribution and occurence of tokens across sample text snippets supplied by the user. Once the TF-IDF score threshold is set, tokens with scores above the same are determined to be sensitive and anonymized.

Sensitive Attribute Anonymization System

Once the sensitive attributes/tokens are detected, they need to be anonymized depening on the kind of token they are. The user can set different anonymization actions for different tokens. The currently available options are:

  1. Deletion/Replacement: To be used in certain cases where retaining a part of the attribute through the other anonymization methods too is not appropriate. Completely replaces the attribute with a pre-set replacement. Example: My name is John Doe would be replaced by My name is <Name>.

  2. Supression: To be used when hiding a part of the information is enough to protect the user's anonymity. The user can supply the percentage or the number of bits they want to be supressed. Example: My phone number is 9876543210 would be replaced by My phone number is 98675***** if the user chooses 50% supression.

  3. Generalization: To be used when the entity is sensitive enough to need anonymization but can still be partially retained to provide information. This system has two methods of carrying out generalization

    • Word Vector Based: In this option of generalization, the nearest neighbor of the word in the vector space is utlilized to generalize the attribute. Example: I live in India get's generalized to I live in Pakistan. This method, while completely changing the word largely retains vector space information useful in most NLP and Text Processing tasks

    • Part Holonym Based: In this option, the system parses the Wordnet Lexical Database to extract part holonyms. This method works exceptionally well with geographical entities. In this, the user can choose the level of generalization. Example: I live in Beijing get's generalized to I live in China at level 1 generalization and to I live in Asia at level 2 of generalization.

Final Report: Pattern 3

Project overview

During the Google Summer of Code event I was focused on the development of the Pattern 3 framework. Pattern consists of many modules which help users to work with web data, use machine learning algorithms, apply natural language processing technics and many other useful functions. The main task was to complete porting Pattern to Python 3 and refactor code. The problem of fixing bugs was also important and as a result all tests in the automatic testing system Travis CI are executed successfully. The new functions and modules have also been added. The module allows users to work with Russian texts. The pattern.web module was improved by adding the VKontakte API class.

Main completed tasks

  • Compiling libsvm and liblinear binaries for macos, ubuntu and windows and adding them in pattern to make pattern.vector work out of box.
  • Refactoring social media Twitter API.
  • Testing all modules, fixing bugs and Travis CI tests.
  • Creating VKontakte API class which allows users to get information from the biggest Russian social network. With this you can retrieve user's profile description and profile picture, user's posts from the profile wall and posts from the newsfeed for a search keyword.
  • Creating module and collecting the necessary data: Named Entities List, Frequency Dictionaries, Part of Speech Wordlist, Spelling List. The parser for part of speech tagging and spellchecker are now available for Russian language.
  • Pattern Python 3 Release

Future Work

There are many opportunities to continue improving the Pattern framework and introduce new functionality. For example, the web mining module can be extended with some other features helping users to analyze the collected data from social media. Also it is important to add the sentiment analysis to part.

While I was working on Pattern Python 3 release I was collecting the political tweets from Twitter and posts from VKontakte to make big dataset which can help to analyze political debate tweets and political discussions. When the collection process is completed, the data set will be available to researchers.

Final Report: De-escalation Twitter Bot

Project Overview

My Google Summer of Code project aimed to create a Twitter bot that would be able to participate in debates around Twitter, find tweets and debates that are overly angry and try to calm them down. While creating this bot, it has also been my task to take the existing Seed application, transform it into a Node module and add some capabilities to it. This secondary aim tied in nicely with the main focus as Seed was used for response generation. All of the aims of the project were fulfilled. However, there still is potential for improvements and future work on the bot by extending its capabilities.

Twitter Bot

The Twitter bot is a mixture of a Node.js frontend and a Python backend. The backend is there for analysis of tweets using pretrained neural network models. It is implemented as a microservice, served by a Flask server routing requests to relevant analysers. Apart from sending the tweet analysis requests, the frontend provides capabilities for connecting to Twitter for reading and sending Tweets and connecting to Seed for generating responses. A detailed description of all of its parts can be seen in the official GSoC repo.


Since the task at hand is quite specific, there was a need for custom made datasets. There are two datasets in total, one for each analysis task. The final_dataset.csv is a dataset of about 7000 tweets, downloaded, cleaned and annotated by me, including columns for text, topic and anger level. The anger_dataset.csv contains 5000+ tweets, split 50/50 into angry and non-angry ones. These are partially outsourced and partially made up of the tweets from the final_dataset. The code for creating the datasets is in the GSoC repository, along with both of the datasets.


There are many approaches for tackling the de-escalation in online settings. In this project we decided to follow the rules of the non-violent communication and conflict resolution. This type of communication aims for a calm discourse with the other party, regardless of their manner of communication. The aim is to shift focus from the sentiment (anger/hate..) on to the real source of the sentiment. It is characterized by honest attempts to understand what and why the other party feels. This can be achieved for example by posing questions about the topic, giving the other party space for expressing themselves and arriving at the root of the actual problem and so on. The ability to analyse topic of the tweet, as well as having access to the historic anger levels of the communication, makes this Twitter bot quite adept at carrying out tasks like these. The generated text is made random and human-like enough with the use of Seed that the 'botness' of the bot does not pose as big of a problem.


Pretotyping is a type of pre-release product testing which focuses on making sure the final product is worth making, rather than being concerned how to make that product. Before releasing the bot into the wild waters of the internet, I did a number of sessions on a Twitter account called OpenDiscussion during which I was acting like a bot myself, looking for angry tweets to respond to, taking detailed notes of each step and 'generating' responses. This was done as a way of actually seeing how people would react to our bot, what sort of interaction makes them even more angry and which actually seems to help. It also benefited the creation of the bot in making us aware of many pitfalls that there are and which the bot would have to deal with. The notes can be seen in the pretotyping subfolder of the official repo.

All in all, the pretotyping sessions were a success. Although the response rates were not that great (people mostly did not respond) this is not an issue for a bot. When the people did respond, however, in the majority of cases, the anger levels dropped or disappeared completely. Being a bot for a minute, trying to act without any preconceived notions or opinions that I have, I could see that the biggest flaws of a bot when leading discussions with humans (no preconceived notions or opinions, does not experience anger, does not get offended) are also its biggest strengths. By not getting angry itself, the bot automatically takes down the anger levels of discussion, making it more civil and calm.

Seed Improvements

Over the course of this project, the original Seed application has been made into a Node module, which is now available as seedtext at the NPM repository. Here is the GitHub repository of seedtext. It has also been extended with capabilities of conditional generation, mentioned in this GitHub issue of the original Seed app, and with the possibility to define custom methods for importing Seed sketches or using built-in methods.

Seed by default generates totally random variations of the text. Through conditional generation, this randomness can be controlled. The Twitter bot uses this functionality to vary its wording and tone depending on the range of the anger level (using words like 'upset' at anger level > 0.5 and < 0.6 and using words like 'enraged' at anger level > 0.9). This enables the generated text to be much more subtle and human-like.

Changes to the importing of sketches and transformation into Node module make Seed available to a much wider audience. Previously, the capabilities of the application were tied to the web environment, now it can be used virtually anywhere.

Future Work

There are several possible ways in which my summer work on this GSoC project could be extended. One of the possible ways would be by extending the Twitter bot capabilities. Currently, the bot lacks a comprehensive database solution for participating in true debates. There is also the possibility of improving the Seed source files, making the generated text better. Improving the capabilities of the analysis backend and the neural networks in it is also a possibility. At the moment the test set accuracy for topic analysis is around 81%. With a bigger and better dataset, this could easily go higher.

I already talked with my supervisor about possible ways in which I could collaborate on the project in the future. We talked about rewriting the original Seed repository so that it uses newly made seedtext module and adding the possibility of creating bots with the push of a button to the web application that the original Seed currently has. I am looking forward to helping make these plans reality.


At the end of this report, I would like to express my thanks to the whole team of people from CLiPS and beyond whom I had the honour to meet this summer. Special shoutout goes to my supervisor, Frederik, who has been really great over the whole duration of the internship. I am also happy that I got the chance to meet other GSoC students this year and I hope we'll stay in touch in the future. Last, but not least, I would like to thank Google for making GSoC possible, it was a wonderful experience.

GSOC 2018 Phase 2 Reports

  • Posted on: 24 July 2018
  • By: Guy

The progress reports for Phase 2 of GSOC 2018 are in! Alexander Rossa, Maja Gwozdz, Rudresh Panchal and Максим Филин have documented their work and are ready for the final phase!

Phase 2 - Report: Максим Филин

During the second month of Google Summer of Code, I continued working with Pattern 3 Framework and focused on the following tasks:

1. VKontakte API

I’ve implemented several methods which allow users to get information from VK social network.

Possibilities: 1) retrieving a user's profile description and profile picture 2) retrieving user's posts from the profile wall 3) retrieving posts from the newsfeed for a search keyword.

I’ve taken into account the request limitations and added the rules in VK class.

We’ve considered the authorization method in the social network VKontakte by a direct link through the VKontakte API (based on the OAuth protocol), called Implicit flow. Authorization by this method is performed through the VKontakte application specified in the form of an ID. This is the most secure method of authorization.

Also the instruction about how to obtain access_token was prepared. It’s in class description now but later will be added to pattern documentation.

2. Testing the most important parts

Twitter, SVM, Datasheet and DOM (the HTML parser in pattern.web), adding new tests and fixing bugs.


The structure of the module for the Russian language was created and part of the functional was implemented. Also the necessary data for module was collected: - Named Entities List - Frequency Dictionary - Part of Speech Wordlist - Spelling List - The following functional was implemented: 1) Parser for part of speech tagging 2) Spellchecker

The new methods will be gradually added.

4. Collecting sentiment political Russian tweets and posts from Twitter and VKontakte.

There is a lot of data from Twitter and other social networks in English, German, Arabic and other languages but nothing in Russian. So, we want to collect Russian Twitter/VKontakte dataset, particulary Russian political debate tweets. Then we can compare the data that we already have and determine general political and social trends.

The main idea is to use Yandex Toloka for dataset creation. It is a crowdsourcing platform which helps developers and researches to perform various tasks including marking texts. The creation of a task on this platform is not a trivial and should be formulated as accurately as possible, but with the help of the assessors it is possible to get the result more quickly than by own forces.

5. Prepare Pattern 3 for release.

It was important that all tests running through Travis CI were executed without errors. Now all tests pass. In the following weeks we are going to make the release of Pattern Py3.

Phase 2 - Report: Alexander Rossa

During my second GSoC term I was focusing on finishing various parts of the Twitter Deescalation bot and on extending the Seed module.

Twitter bot:

  • Created dataset of several thousands of Tweets for both topic prediction (keyword labeled and checked for correctness) and anger/participation classification (manually labeled)

  • Improved and tested neural network models used on said dataset

  • Did some pretotyping work with the bot - participating in online discussions and sort of impersonating an ideal version of the bot to see what the bot will have to deal with "in the real world" - the logs are from "bot's perspective" and closely follow the actual execution of the bot

Seed module:

  • Transformed Seed into an NPM module

  • Wrote up some documentation for using Seed as an NPM module

  • Almost finished implementing the conditional generation, still need to do a bit of work on connecting all the outputs and do some testing for correctness of the solution

The next focus for this project will be:

  • Finishing the conditional generation for Seed

  • Reworking the collected dataset a bit (turns out that there were too many classes for too little data which plateaued the test set accuracy on about 60% even with heavy regularization) - I collected more data for smaller number of classes and am hand labeling it right now

  • Testing and improving the bot in the real world

  • Retrospectively rewriting the original Seed repository with using Seed as a Node module instead and adding the ability to easily create Twitter bots from the Seed website

Phase 2 - Report: Rudresh Panchal

This post reflects upon some of the milestones achieved in GSoC 2018's Phase two.

The Phase 2 mainly concentrated on on expanding the rare entity detection pipeline, adding the generalization features and increasing accessibility to the system being built. The following features were successfully implemented:

  • Built a custom TF-IDF system to help recognize rare entities. The TF-IDF system saved the intermediate token counts, so that whenever a new document/text snippet is added to the knowledgebase, the TF-IDF scores for all the tokens do not have to be recalculated. The stored counts are loaded, incremented and the relevant scores calculated.

  • Implemented the "Part Holonym" based generalization feature. This feature relies on lexical databases like Wordnet to extract part holonyms. This generalizes tokens to their lexical supersets. For example: London gets generalized to England, Beijing to China at level one generalization and to Europe and Asia Respectively for level two generalization. The user is given the option of choosing the generalization level for each attribute.

  • Implemented the "Word Vector" based generalization feature. This maps the nearest vector space neighbour of a token in pretrained embeddings like GLoVE and replaces it with the same. For example: India gets replaced with Pakistan.

  • Implemented a general anonymization RESTful API. This gives people the option to utilize our system across different tech stacks.

  • Implemented a Token level RESTful API. This API endpoint gives token level information of various things including, the original token, replaced token, entity type and the anonymization type.

  • The API utilizes Django's token based authentication system. Implemented a dashboard to manage the authentication tokens for the users.

Some of the major things planned for the 3rd and final phase are:

  • Code cleanup: As the project progressed, some of the code has become redundant which needs to be removed.

  • Documentation: While the code is well commented and easy to understand, the project currently lacks thorough external documentation. A quick usage guide for non-programmer end users also could be helpful.

  • A simple scaffolding system for the user. The system currently ships without any predefined configurations (including entities, aliases etc). Script(s) which can quickly setup a ready to use system with certain default values (including pre-defined attribute actions, threshold values etc) would be useful.

  • GUI based and API based file upload system. The user currently has to currently paste plaintext in the GUI and set it as a parameter in the API. The option to directly upload text files will increase user convenience.

  • Experiment with language localization. The system currently works well with the English language, but it needs to be tried out with other languages.

Picture 1: The Token level API in action

Phase 2 - Report: Maja Gwozdz

In the second phase of GSoC, I continued annotating political tweets and corrected some typos in the dataset. I created a more varied corpus by collecting tweets related to American, British, Canadian, and Australian socio-political affairs (I am also collecting New Zealand tweets but they are really rare). As regards the annotation guidelines, I improved the document stylistically and added relevant examples to each section. I also created a short appendix containing the most important politicians and their party affiliations, so as to facilitate future annotations.

As for the dataset itself, I am happy to announce that there were far more idioms and proverbs than in the previous stage. The following list presents the top ten most frequent hashtags extracted from the tweets (the figures in brackets represent the relative frequency of respective hashtags):

1. #Brexit (3.93)

2. #TrudeauMustGo (3.23)

3. #JustinTrudeau (3.07)

4. #MAGA (2.99)

5. #Tories (2.53)

6. #Drumpf (2.23)

7. #Corbyn (2.19)

8. #Labour (2.08)

9. #Tory (1.98)

10. #ImpeachTrump (1.73)

Our core set of hashtags (balanced with respect to political bias) was as follows: #MAGA, #GOP, #resist, #ImpeachTrump, #Brexit, #Labour, #Tory, #TheresaMay, #Corbyn, #UKIP, #auspol, #PaulineHanson, #Turnbull, #nzpol, #canpoli, #cpc, #NDP, #JustinTrudeau, #TrudeauMustGo, #MCGA. Many more hashtags are being used but they usually yield fewer results than the above set.

Below are a few figures that aptly summarise the current shape of the corpus: Left-wing bias: ca 55%

Male authors: ca 49%

Polarity: ca 44% negative, ca 47% neutral, ca 9% positive

Mood: ca 50% agitated, ca 21% sarcasm, ca 13% anger, ca 9% neutral, ca 4% joy

Offensive language: present in approximately 17% of all tweets

Swearing by gender: ca 53% males

Speech acts: ca 76% assertive, ca 38% expressive, ca 10% directive, ca 3% commissive, 0.2% metalocutionary

In the third stage I will continue annotating political tweets and write a comprehensive report about the task. My Mentors have also kindly suggested that they could hire another student to provide additional judgments on the subjective categories (especially, polarity and mood). Having more annotators will undoubtedly make the dataset a more valuable resource.

GSOC 2018 Phase 1 Reports

  • Posted on: 29 June 2018
  • By: Guy

The write-ups for Phase 1 of GSOC 2018 are ready! Check out the progress reports of our talented team of coders: Alexander Rossa, Maja Gwozdz, Rudresh Panchal and Максим Филин.