GSOC 2019 reports

English Profanity and Offensive Word List Constitution

As part of Google Summer of Code 2019, I undertook the constitution of an annotated list of English terms related to the notions of profanity, hatefulness and offensiveness. In this post, I will describe the different steps taken towards building it up and annotating it.

The first step was to determine what technique to use to generate a list that would reflect the aspects being researched. My choice was to use 2 comparable corpora that would have as their main difference the presence/absence of offensive and hateful language, or not. A technique called pointwise mutual information (PMI) can then be applied to see what words are more typical of one corpus relative to the other. It is good at ignoring common and (usually) uninteresting words such as “the”, “an”, etc. while singling out typical terms of a given corpus.

To that end, I used textual data collected from the controversial social media platform gab.com. Gab came in the public spotlight in the aftermath of the Tree of Life shooting where it was then said that the shooter was a gab user and that the platform might have played in a role in his radicalization. Manually going through a couple of posts can quickly give one a hint of why such claims were made, as the platform is filled with openly racist, conspirationist, anti-Semitic and overall hateful and toxic content. It thus seemed like a “good” place to start. I manually selected a few dozens of users that were openly racist and hateful to be scraped in the hope that they would indeed reflect the toxic language I was looking for. In total, around 250,000 posts were retrieved from approximately 60 users over a span of 3 years (from late August 2016, when Gab first came online until late February 2019). The data was cleaned from URLs and usernames, as that data doesn’t convey useful information for our task as well as not being privacy-friendly.

The second step was to collect a reference corpus against which our toxic-language corpus could be compared. The main point when applying such techniques is to find data that is as close as possible to our target corpus, but for that one dimension we are researching, profanity and offensiveness in this case. I thus collected data from another social media platform, i.e. Reddit. The advantage here is that mere Internet slang would be less likely to show up after the comparison of both corpora, which is something that might have been a problem if the reference corpus had been, e.g. the Brown corpus, that is much too standard for our current purpose. A downside, however, is that Reddit, while being more mainstream, moderate and moderated than Gab, is also not free from toxic content and this could lead to some offensive language slipping through. Yet, the platform has recently been taking action against hateful and toxic content by banning posts, users and even entire subreddits deemed inappropriate, so Reddit still felt like a good reference in contrast to Gab. Reddit posts were simply retrieved using a public archive, and there was more than enough data to match that of Gab.

Once both corpora had been put together, we applied a PMI analysis with Gab as the target corpus, and kept the top 2000 words (ranked by PMI score). It yielded rather instinctive results with “Jew”, “nigger”, “kike” (offensive word for “Jew”) and other niceties showing up in at the very top. However, there was also a lot of non-offensive and semi-related terms that showed up such as “America”, “white” or “election” that would be interesting for topic modeling, but that did not entirely fit our purpose. Of course, it also output a lot of entirely unrelated words that would need to be cleaned up during the annotation phase. We thus needed another way to enrich the list.

The idea was to use lexical proximity between words represented as embeddings in a high-dimensional vector space. When applied toa sufficient amount of data, this technique can deliver surprisingly intuitive results. Given that words are represented in a mathematical form, they can be added and subtracted to and from one another, such that “Merkel” – “Germany” + “France” yields “Macron”. Needless to say that such models are powerful tools to capture all sorts of lexical relationships. For our purpose, we trained a basic word-embedding model from our Gab corpus. However, lexical relationships don’t jumped at me out of the blue and I needed seed words with which to compute the lexical proximity within the embedding space. Those were found heuristically by searching the web for lists of insults and rude language in general. We used 2 lists: a list of insults (thus excluding “rude” words such as “fucking”, as it is not an insult) put together collaboratively in “Wiki” format and a “Offensive/Profane Word List” by Luis von Ahn (creator of the language-learning app Duolinguo, among other things).

Each word itself was added to the final list, before being compared to the other words in the vector space using the cosine distance as a means of comparison. The 10 most similar words were kept and their respective distance to the seed word were added to that of previous words retrieved this way. For instance, we used “nigger” as a seed word, yielding “niggar” as a very similar one, and if “niggar” had previously been retrieved, the current cosine distance between “nigger” and “niggar” was added to that of the previous occurrence of “niggar”. In the end, we had generated a list of words mapped to accumulated cosine distances that could be sorted to retrieve the words most commonly associated to insults and other offensive words from our 2 original lists. Adding up the cosine distances of each retrieved word proved useful as the vector space of Gab was trained using a rather small amount of data for such a task (250,000 posts) and this cosine-distance-based retrieval technique also generated noise and irrelevant data.

Each word in the list was then annotated along to 2 axes/dimensions: one representing the level/degree of offensiveness (from 0 to 4) and another reflecting the nature or the topic associated with said word (racial, political, religious, etc.) based on previous work by CLiPS in German and Dutch. Topics were not mutually exclusive and multiple topics can be associated to one word. The manual review of words one by one is the opportunity to get rid of irrelevant words. However, it must be noted that the limit between relevant or not can sometimes be fuzzy, as sensationalist or controversial words (“refugee”, “supremacist”, etc.) can also prove useful. Thus, when in doubt, the word remained in the list, as deleted words cannot be retrieved, while irrelevant words can always be removed later if necessary.

I hope this post was enjoyable to read and gave a good overview on how to filter out specific data by comparison. I think the method described above works well for high-resource languages like English, given the quantitative nature of the techniques involved. Should it be transposed to other (and more specific) topics, as well as to languages less represented online, more precise techniques should be considered.

First month in CLIPs: outcomes and thoughts.

In the beginning of the month I knew that there was a single danger, awaiting me, and it wasn't connected to software development. I was afraid to be too fascinated by the variety of opportunities that lie ahead of me. Of course, my goal was to concentrate on one thing and bring it to some sort of satisfying completion. But also there were so many things I wanted to do... I wanted to work with Pattern, CLiPS'es package, written in Python; to experiment with cross-language analysis, and always held some psycholinguistic-related thoughts in my head.

Luckily for me, my mentors were very careful from the very beginning. They did everything possible so I wouldn't start working on all the projects at once. This month I had two tasks and I will talk in details about them further.

Part one: Pattern.

Pattern makes a strong impression on the young developer. It resemebles some sophisticated architectural structure. There are many things happening in it at the same time and it seems that Pattern can be used for any purpose imaginable.

In the beginning it was difficult to run the whole program at once. For example, several tests lost their relevance during the months, in which the new build was not launched. The program behaved differently on Windows and Linux, and wasn't working on the Python 2.7. In a few days it got a little easier. The main issue I worked on was: the required package named “mysqlclient”, which failed on some computers with no apparent reason. It was annoying for the users, who didn’t plan to use databases whatsoever. I figured out the way around it, but more issues were coming up and slowed me down a little bit.

For example, Travis CL was causing strange problems in the combination of Python 2.7 and cherrypy package, because of that Python 2.7. build was always failing to my annoyance. Some tests were outdated and were using things like “StopIteration” instead of simple “return”, contradicting to the recent PEP updates.

In the process, I also accepted one of the pull-requests to improve English verb inflexion etc (the request also has alternatives for solving problems with tests) and added a list of profanity words in Russian.

By the end of the month there is only one test in the whole program which still behaves semi-randomly. Very soon I will merge everything to the master branch, and will be able to say that the task is completed!

Part two: Hate speech

Before I was planning to compare politically-motivated hate speech concerning American and Russian relationship; but there were a lot of problems with that and it was making an impression of too complicated task for the GSOC period. So instead I decided to focus on the hate speech in Russian with a possibility to combine it later with cross-lingual analysis. I read a few of papers, before starting the project, and tried to write down all the findings, comparing approaches, error analysis and data sets. The table of all recorded findings is freely available here and will be updated.

Definition of hate speech by itself was already quite complicated and a lot of people confuse it with offensive speech (as noted in the article "Measuring the Reliability of Hate Speech Annotations..." by Ross et al ), so I decided to concentrate on one type of hate speech – sexism. I had several reasons to pick it: first of all, Russia had several sexist laws accepted recently, which was always provoking discussions, and second, I found a few sexist datasets in English online, which could help me a lot if I was to decide working with cross-lingual analysis.

I collected two types of data.

For my first type of data I collected comments from the news posts in Russian social network “Vkontakte”. I picked three major news pages, each of them having slightly different audience and censorship rules. I looked for news which could possibly trigger sexist discussions, for that I used VK API and a query of search words. The code which I wrote for that can be found here. The second type of data is more dense with sexism. It is a forum, which name is literally could be translated as “Anti-females”, which is created by the people who are trying to live with “patriarchy standards". They are discussing (and being serious!) such things as “female logic” (as an example of females being incapable of thinking logically), place of a female in the world etc.

It seems to be perfect for sexism detection, and the only problem is form of the data, differentiating from the previously collected comments. (But there are several approaches I am planning to try out on that).

So far I collected approximately 16 000 of units of sexist and non-sexist comments. Annotated corpora can be found here and the ones, which are to be annotated can be found here. Right now annotation is imperfect, because it is made only by me and for higher accuracy we would have to hire a few more Russian speakers. I hope to be done with my annotation by 3 of July, and I am planning to start working on the model.

Fabricio Layedra: Report - Phase 1

Initial work on viNLaP:

  • Exploratory Data Analysis of the MAGA Corpus
  • Creation of scripts upon scattertext, a python library, for the topical charts of viNLaP.
  • Front-end development version 1.0
  • Creation of scripts for downloading the tweets of MAGA Corpus in order to create the next bunch of visualizations based on geospatial data (latitude and longitude of the tweets).
  • Literature review for novel visualizations of TF-IDF scores.

CLiPS GSOC 2019 kicks off

  • Posted on: 22 May 2019
  • By: Guy

We are happy and proud to once again have an extremely talented team of GSOC students this year.

Fabricio Layedra will be working on viNLaP, an interactive and data-driven web dashboard with three modules. Each module is based on one of these three main analysis: Spatial, Temporal and Statistical/Traditional. Each module will include traditional visualizations related to the respective analysis but also novel visualizations based on the proposed ones in the literature. The proposed viNLaP is to visualize in this first scenario: polarized data; but it is built to be useful for new types of dataset that would come in the future.