Sexism detection in Russian language

Considering, that this report is quite long, I would recommend to read it here, where it is splitted into chapters, which supposingly should make the reading process easier.

The repository of the project itself can be found here and the description of the content of the repository can be found further down

N.B.: I decided to avoid the 4-page-long arxiv publication template, for the sake of completion and full documentation of the process. The references are given in the end of each of the subtopics.

In the course of work on the issue of "hate speech", two compilations have been made, both of which may be useful for further research in this area. The first one is an attempt to systematize research on hate speech. This file will be updated.

The second one is a compilation of known open source corpora on hate speech. The list includes more than ten of them. This file is not planned to be updated.

1. The definition of “hate speech” and problems and difficulties of the task.

Hate speech is difficult to define, and even after agreeing on some sort of definition, is still proving to be complicated to attribute something (a tweet or a message) to the hate speech.

In the article "Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis" authors noted that the agreement between the annotators of their hate speech corpus (measured with Krippendorff's alpha coefficient) ranged from 0.18 to 0.29, much lower than recommended coefficient of 0.8.

Even more important is that after the annotators were acquainted with the definition of hate speech, a significant improvement of the coefficient did not follow. Similar thoughts are found in the paper “Are You a Racist or Am I Seeing Things?", where the author faced a similar problem. In this article an attempt was made to compare the decisions from expert and amateur annotators. The mutual influence on attribution decisions was compared, and special attention was paid to annotation of tweets in two different corpora (one of them was made specifically for the purposes of the article, another was made for the article “Hateful symbols or hateful people? predictive features for hate speech detection on twitter”). By how different the labels turned out to be, one can draw additional conclusions about subjectivity of attribution attempts. The author of the article claims, that annotations made by amateur annotators should be accepted only if there is a complete agreement. They also support previous authors in the idea that the process of attribution is complicated without the intimate knowledge of the topic.

Similar ideas can be found in the article “Abusive Language Detection in Online User Content”, where the authors conducted so-called “Amazon Turk Experiment ”, where they hired several amateur annotators (not allowing each of them to annotate more than 50 text entities). The annotators were acquainted with guidelines, used by expert annotators. The agreement rate was acceptable only for the binary classification (coefficient of 0.867), but it dropped significantly for the categorical task. That could suggest that either there is a need for more extensive training, or that amateurs are much worse suited for the annotation task.

Some of the problems in the attribution can arise because of the focus on the word level, instead of tweet/message as an entity itself. This focus on words can be problematic in two ways.

First, the problem of attribution something to the “hate speech” category is not limited to the question what is a hate speech. The article “Automated Hate Speech Detection and the Problem of Offensive Language” demonstrates how mutually confusable are hate speech and offensive language.

Another problem is connected with the sarcasm or quotes of the opponent arguments, and it will be mentioned later.

In this work we hoped that because such specific area of the problem is picked (sexism detection) and because the annotator is familiar with the problem on more personal level, it will be easier to distinguish between the categories. Nevertheless, several problem arose. It is also worth noting, that while we were in the process of research, and first thought about just picking “hate speech” as the topic, we made a list of many publicly available or semi publicly available corpora. They are mainly in English, and there could be different annotation approaches, but it could be found helpful later for some further hate-speech research. You can find the whole list here.


All references are provided with links to the articles

Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated Hate Speech Detection and the Problem of Offensive Language. 11th International Conference on Web and Social Media (ICWSM), pages 512–515, Montreal, Quebec, Canada

C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, and Y. Chang. 2016. Abusive language detection in on-line user content. In 25th International World WideWeb Conference, WWW 2016, pages 145–153

Bjorn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. 2016. Measuring the reliability of hate speech annotations: The case of the European refugee crisis. Bochum, Germany, September.

Zeerak Waseem. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on Twitter. 1st Workshop on Natural Language Processing and Computational Social Science, pages 138–142.

Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. NAACL Student Research Workshop, San Diego, California, June. Association for Computational Linguistics.

1.2. The process of collecting corpora.

1.2.1 Media sources

The first source was the Russian social network Vkontakte (avaliable at ). It is perfectly equipped for the needs of developers, right inside it is provided with an API with detailed documentation.

We chose three different popular groups in vk, representing three different Russian media. They all have more than 500,000 subscribers, and ideologically they represent different directions. The media we have chosen are:

"Lentach" ( is a highly oppositional media in the past, now with a slightly less obvious focus, especially interesting because of the huge number of subscribers (over 2 million). Comments are cleaned by the bot: comments shorter than five words are prohibited, which members of the community are fighting with the help of padding - for example, writing five words comments like: "bot blow [me] three four five". Also in the comments obscene words are prohibited, but but not too ingenious, which gives rise to a lot of obscene vocabulary, where words are spelled in the opposite direction or letters are omitted. The rest of the restrictions seem insignificant to us.

"Medusa" ( is a liberal, oppositional media (more than 500,000 subscribers); the only media that promotes feminist views (at least on words). However, there may also be very long sexist debates in the comments. (Comments are open to all)

"RT News in Russian" ( - part of Russia Today, basically just automatically post articles published on the portal. Despite the impressive list of rules for commentators, comments are not particularly moderated. Since there are so many posts, the comments are not so numerous, despite the large number of subscribers (over a million).

We approached all these media in the same way, trying to find the comments we are interested in. This function was responsible for collecting the corpus (it can be found in this file):

def make_corpus(name,community, query_list, service_token,vk_api_vers):

We used it to find posts with any of the words in the query list because we assumed that it was the news related to these topics that would cause the most discussion. Here we also give the translation of the query sheet:

query_list= {'sexism', 'meToo', 'sexual harassment', 'decriminalization of domestic violence', 'rape', 'feminism', 'Shurygina', 'harassment' }

For each post we took only the first hundred comments: quite often news posts were compilative (included several different news), and then the whole dialogue could be devoted to something else. Often, this also turned out to be a type of hate speech, for example, our corpus contains a rather long racist dialogue related to Yakutia. In the resulting corpus we wrote the post id, comment id, label (by default, it just had "sexist" as label, and then I manually checked the text of the comment). When I checked each comment, I marked the label as one of three possible ones: "sexist", "not sexist" and "sexist in context". The latter category was used for messages that were not sexist in isolation from the context, but worked in this way in conjunction with the post. I did not include them in my train/test datasets, but they also present an interesting potential for analysis. (Read more about this in the last part of our report).

1.2.2. Forum sources.

The main source of sexist comments was the Antibab website (avaliable at, the name of which can be literally translated as "Anti-Women". There, a group of users (presumably mostly male) discusses women and their ultra-patriarchal views of life.

A lot of material was collected, more than 10,000 posts, but I decided to reduce the sample of messages that were taken from this forum: people there have very specific vocabulary, slang and manner of speech (+ the forum is mostly for people after thirty), so there has always been a danger that instead of detecting sexism, model will simply detect the users of this forum. To balance this, I found there a couple of topics with mostly non-sexist comments. I used them to get enough non-sexist material with a similar manner of speech. Data collection was the responsibility of the method:

def make_corpus_ant_forum(name,link_to_topic):

written by me, using Beautiful Soup. The method needs the name of the new file, where the corpus gonna be saved, and a link to the topic to scrape.

The main source of non-sexist speech was the Holywar forum (avaliable at, where I extracted a large topic devoted to family relations (problems with parents, close relatives, etc.). This was once again done in the hope that even at the level of the data it will be possible to balance the sexist and non-sexist. It would not be desirable for the model to consider the mentioning of women as sexist, for example. The method to extract this data worked similarly to the previous one.

1.2.3. The description of resulted corpora.

All of them can be found in the folder gsoc2019_crosslang/russian sexist corpora/annotated/

Name of the corpora Description
ant_1.csv Non-sexist comments (with some exception) of the antibab-forum
ant_2.csv Sexist comments (with some exception) of the antibab-forum
media_1.csv Non-sexist and sexist comments from Lentach
media_2.csv Non-sexist and sexist comments from Medusa
media_3.csv Non-sexist and sexist comments from Russia Today
ns_1.csv Very big, purely non-sexist corpus from Holywar forum

In the end we have 2577 annotated sexist comments and 21526 non sexist comments, which makes our corpus unabalanced, to some extent.

1.3. The process of collecting corpora.

1.3.1. Sarcasm.

One of the main problems was in the distinguishing between sarcasm and seriously meant sexist speech. Example:

“Aren, такую хрень несешь. Еще скажи, что мужик полигамен и должен осеменить больше самок, потому что так заложено природой.»


"Aren, you’re saying some bullshit. What next, are you going to say that all men are polygamous and should inseminate more females, because it is in the nature?”

The comment is not sexist by itself, but it contains sexist vocabulary. Here the problem could be solved with the words co-occurrence: the word “bullshit” could work as a signal not to take the message literally.

But another example proves to be even more complicated. This message is a reaction to the news that fewer female football fans will be showed on TV during the matches to avoid objectivization:

“Извините, но это какой пиздц. Может вообще женщинам запретим матчи посещать? Ну чтоб прям наверняка? Можно и в паранджу всех закутать. Это вообще 100% вариант.”


"I'm sorry, but that's fucked up. Maybe we should ban women from attending matches at all then? Just to be sure? You can also dress them all in burqas. That should work 100%.”

1.3.2. Special slang.

The forum, which was used as the main source of sexist speech, has very specific slang, not typical to Russian language overall. It could be problematic, because the system shouldn’t be limited just to specific small group of people. The compensation of it was attempted with the use of non-sexist topics (discussion about the phones and attempts to agree on the real life meeting date) of the same forum, to ensure that there would be enough of the non-sexist material with the same manner of speech.

1.2.3. Single annotator.

Of course, considering that we had a single annotator, the labeling process was far from perfect and quite subjective. Even though the definition and guidelines were discussed in details with my mentor, some cases were still providing a challenge, and possibly imperfect solutions were picked in the end. (More about it in the guidelines part) That is also the reason, why I can’t recommend data right now for immediate use. It still needs several of other annotators to look at it, preferably, familiar with sexism on the expert level.

1.4. The guidelines.
  1. Any generalizations that demean or degrade the female sex were considered sexist.

1.1. Sexist statements and generalizations made by women were also considered sexist.

Sexism in dialogues with women.

2.1. The use of unmotivated obscene vocabulary specifically emphasizing the female sex of the interlocutor were considered sexist. (Words as "whore").

2.1.1. Racist curses or swearwords implying a low intellect of the interlocutor were not considered sexist. (Except when it was neighboring or derived from some generalization).

2.2. Another sign of sexist comment was, when the interlocutor, in an attempt to insult a woman, attacked exclusively her appearance. It was considered hidden sexism.

2.3. Victim-blaming (aimed at the victim or during the talk about the victim) was considered a case of sexism.

Sexism and politics.

3.1. Criticism of feminism or simply aggressive statements about feminism were not considered sexist, because they could be politically motivated.

3.2. Criticism of female politicians was not considered sexism until it was reduced to criticism of appearance or generalizations about women in politics.

3.3. A special case: the infamous story of Diana Shurygina's rape, which became particularly popular through television. Every piece of news about Miss Shurygina caused a lot of comments, and many people speculated that she was not a victim.

This has always sparked a lot of discussion, including a lot of sexist rhetoric. The news portals raised this topic so often (even in a few years after the incindent), that with every news item mentioning Diana Shurygina, a large number of comments were made that negatively responded to the fact that the news about her continued to be published. We did not consider these comments were sexist, nor did we think it was sexist to use her name as a denominator as long as it was meant to be "hyperboleted news"[1]. If her name was used as a denominator for a rape victim, the comment was considered sexist.

In retrospect, this may not have been the best solution. This is one of those cases where the look of another annotator would have been very useful.

[1] - in the sense that news were abusing the story and trying to spark more and more debates, if she guilty or not. (AFTER the court's decision)

Sexism in dialogues with men.

4.1. We did not mark every comment that had sexist expressions as sexist. (See Part 1, on the difference between offensive speech and hate speech)

4.1.1. Therefore, for example, among the comments on the Antifemale forum, a lot of comments are marked as non-sexist, unless the purpose of the comment at the time was some kind of offensive generalization.

2.1. Preprocessing

The file was responsible for the preprocessing of the text. At first I planned to divide the date into test and train by myself and wrote a separate function for this purpose. Later on, it started to seem as not the be the best way to deal with it, that's why in the end I still used the function from the library sklearn to separate the date.

But nevertheless this kind of handling the distribution of comments proved to be useful when preparing for saving processed corpora (it helped to process the data in small portions).

In the end number of functions were written to preprocess the text, mostly they were using Russian support of nltk. They were functions to strip the punctuation, extract quotes and references using regular expressions, extract stopwords, and perform lemmatization and stemming. (However, it seems to me that the later ones are not very useful in the case of Russian hate speech detection). All the resulting texts were saved in the folder "TemporalCorpora" to facilitate the process of preprocessing for those who wish to use this corpus for their research.

2.2. Attempted models and their results.

I ambitiously planned three different ways of embedding, and several models of different complexity. I started with a simple tf-idf as an embedding and combined it with a naive bayes classifier.

Since our corpus is very imbalanced, instead of calculating f1-score, I used a balanced accuracy score as a metric, which is recommended in such cases. Just in case, we compared the results at different stages of text's preprocessing and then compared the use of logistical regression instead of naive bayes classifier. The results can be found in the table below.

The difference in result between types of preprocessing was not significant. The only interesting thing was, that it seems that the result is surprisingly worse when both punctuation and stop words disappear from the text. This probably deserves a separate analysis in the future.

Obviously all the results below are not particulary stable and set in stone: the accuracy in this case seemed more like an interval. I dealt with it with possibly slight profanity: just got results five times, each time and found the mean, in attempt to catch the logic behind change.

type tf-idf + NB tf-idf + Logistic Regression
no preprocessing 63% 62%
minus quotes and references 64% 64%
minus punctuation 61 % 64%
minus all above 62% 62%
lemmatization + all above 66% 63%

In the last few days, while I was in a hurry to finish this report, I had a late but interesting idea. I realized that I could still make my corpus less imbalanced, which could change the results somehow.

To do this, I excluded completely non-sexist corpus (ns_1.csv) from my final test/train. My corpus continued to be imbalanced: I still had to use a balanced accuracy score as a metric, but now the data was more like 2,000 to 8,000, not 2,000 to 20,000.

The result was instantaneous and significantly improved. Unfortunately, the idea came to me too late, so I couldn't test it as thoroughly as I wanted, but the results can still be found in the table below.

type tf-idf + Logistic Regression
no preprocessing 73%
minus quotes and references 71%
minus punctuation 70%
minus all above 69%
lemmatization + all above 74%

More advanced type of embeddings, which I have attempted, are ELMO. Our architecture was quite simple: we used ELMO embeddings - first, we used pretrained embedding in Russian, then trained it on our own (our resulted options and weight file can be found here in the repository), then plugged the results into LSTM, then to the simple feedforward neural network of one layer.

Because the training of ELMO took something which seemed like a million hours, I couldn't play around much with data and also only tried it exclusively on my very imbalanced corpus. Nevertheless the results were relatively good.

type (finetuned by me) ELMO (pretrained and finetuned by me) ELMO
no preprocessing 67% 74%

Both numbers are given in the same balanced accuracy score.

Word embeddings trained from 4chan and 8chan data

In this post, we will explain how we collected data from the imageboard forums 4chan and 8chan and how we trained word embeddings from them as well as an experiment illustrating their academic potentials.

Related code and files may be found here: GSoC 2019 - 4chan and 8chan Word Embeddings

4chan and 8chan are either popular or obscure places of the Internet depending on your knowledge of the Internet ecosystem. They now and then attract attention from different media outlets due to their potential ties with certain shootings and terrorist attacks, most often associated with white supremacism and neo-Nazism. Most recently, 8chan was for instance associated with the Christchurch shooter Brenton Tarrant, who allegedly posted a livestream link and a manifesto prior to his attack. The platform made the headlines again with the El Paso shooting and was thus, as of writing (22/08/2019), definitely shut down in the aftermath of the toil it sparked. We hence thought fit to gather data about them so as to be able to perform sound research on those controversial platforms.

Both platforms are very similar: they allow their users to create “threads” on a “board” (a sub-forum centered around a certain topic) to post content in an anonymous way, to which other users can respond. Their main difference lie in that 4chan has a set number of boards dedicated to different subject matters, whereas 8chan allows users to create their own boards around the topic of their choosing, leading to a much higher of boards, but also to much more sparse content. However, because of their structural similarity, they share boards centered around the same topics and we chose to investigate the so-called “/pol/” board, i.e. the board dedicated to discussing (international) politics. On both forums, /pol/ is a popular board and known for hosting what we could call “toxic” content: racist, fascist posts along with a wide variety of otherwise doubtful content, the virulence of which is probably fueled by the ‘absolute’ anonymity allowed on the platform.

We thus proceeded to collect the data from the /pol/ board of both platforms. Each platform possesses an API through which we can request thread or board-related data in a programmatic way. However, another specificity of those platforms is that they regularly clean up older content and only retain most recent threads so that, despite a limited archiving mechanism from the site itself, the content available through that channel is too limited for the purpose of training sufficiently good word embeddings. As a consequence, we looked for other means of gathering the relevant data. Luckily, multiple sites are dedicated to archiving the content posted on 4chan, but we have not managed to find one for 8chan. For 4chan, we found, that reaches back to late 2013 and, concerning 8chan, we resolved to collect the little information available to the site itself. 4plebs has not set up any API, so we decided to collect the data through scraping.

The collection of 8chan data was rather quick as, as mentioned above, the site does not host a large amount of data at any given time and the fact that 8chan seems to attract less users than 4 chan. Concerning 4chan, however, the quantity of data available for /pol/ alone is rather impressive. We set out crawler running from beginning of July until the end of July and collected approximately 30 million entries spanning over 6 years (between 2013 and 2019) for both 4chan and 8chan combined. However, given our crawling scheme that let 2 scrapers run in opposite directions (i.e. from past to present and from present to past), some data are missing for the years 2015 and 2017.

Once the data was collected, we evidently cleaned it through some pre-processing steps including removing URLs and reposts, as well as tokenizing it. We reflected for some time over how to best approach the training of the word embeddings: should we remove or keep stop words (common words (e.g. “the”, “about”, “on”, etc.) that might be detrimental to certain tasks, what window to use, should we use CBOW or skip-gram, etc. Also, we had to decide with what package to work with and we elected to go for genism, a popular Python library optimized for the training of dense word embeddings. It appeared that the default parameters were, as it stands, rather well-suited for our task and we stuck to them.

We then set up a toy experiment to show a potential use of those embeddings: we compared the distance between two given words in the 4/8chan vector space, and the distance between the same two words in another vector space trained from Reddit data. We won’t re-go through the whole description of the training for Reddit word embeddings. Let us simply note here that we collected the Reddit data from the archiving site result The words we chose put under scrutiny (henceforth ‘target words’) are the 50 first words of the English Profanity and Offensive Words (POW) list, that we generated and annotated as part of another project of this GSoC. The words against which they were compared in each vector space were very basic ones, conditioned on their part of speech tag (ADJ, NOUN or VERB), to give us a basic idea of the different representations of the target words. For each target word, it was compared against 2 words of the same grammatical category, a positive and a negative one. For adjectives, we had ‘good’ and ‘bad’, for nouns ‘human’ and ‘monster’ and for verbs ‘love’ and ‘hate’.

Let us take 2 target words to illustrate how it worked: ‘jew’ and ‘communist’. As a noun, ‘jew’ was compared to both ‘human’ and ‘monster’ in both vector spaces (making 4 comparisons in total) using their cosine similarity. The result is as follows:

‘human’ in Reddit: 0.27 VS ‘human’ in 4/8chan: 0.08
‘monster’ in Reddit: 0.32 VS ‘monster’ in 4/8chan: 0.31

For the target word ‘communist’ (ADJ):

‘good’ in Reddit: 0.12 VS ‘good’ in 4/8chan: -0.03
‘bad’ in Reddit: 0.21 VS ‘bad’ in 4/8chan: 0.04

Evidently, there are a lot of methodological shortcomings to the toy experiment above, but this is not intended as academic work, for now. The aim was to show the possible use of those embeddings and possible trends to be explored in language representation for each community. We hope this post gave a clearer view of the process of gathering data from platforms such as 4chan and 8chan, as well as the use that can be made of the word embeddings resulting from such data.

French Profanity and Offensive Word List Constitution

For the second project we worked on during GSoC 2019, we decided to transpose the work we had done before, i.e. we replicated the generation and annotation of a list of profanity and offensive words (POW) for the French language. We should already note that, despite the designation of such lists, they also encompass the potentially polarizing dimension of certain words in addition of the offensive and profane one. Below, we will explain the data and techniques leveraged to generate it.

Related code and files may be found here: GSoC 2019 - French Profanity and Offensive Word List

The first step to find POW is to find a dataset/corpus of data containing significant quantities of those. We thus started looking for online sources that fit that description and ended up with 2 paths to explore. The first one was a specific section of the French-speaking videogame website “” (JVC) called “Blabla 18-25 ans” (literally “Chitchat 18-25 years”). Even though it might sound like a benign and perfectly innocent forum, it has gained attention over the recent years for being a rather toxic part of the whole forum, one teeming with Pepe the Frog gifs, trolls or even straight up right-wing or Islamist radicals. French newspapers have been reporting on this phenomenon for a few years, linking such online activity to political activism and the recent success of the French populist right-wing party “Front National” (now rebranded “Rassemblement National”). Manual exploration of the forum also revealed that the toxicity of 18-25 had carried over to the sub-forum dedicated to discussing politics. Both of those sub-forums seemed like promising places to find POW and words indicative of polarized content.

The second identified source of data was a French-speaking board of the imageboard 8chan. 8chan is similar to the near eponymous 4chan, with the difference that users are allowed to create their own new boards on the site, much like users can create subreddits on Reddit. As a consequence, 8chan is full of niche boards, centered around specific topics, communities or languages. One in particular came to our attention after our browsing the Internet on the lookout for extremist websites or forums. While we were exploring the reactionary, identarian and right-wing extremist website “Démocratie Participative” (“Participative Democracy”), we noted that an 8chan board called “dempart” was featured on the site. After exploring the board for some time, we noticed it was fit for our task both in terms of data quantity and quality.

After the identification of those 2 sources, we proceeded to gather data from it by scraping them. From JVC, we arbitrarily decided to scrape around 200 pages from the 18-25 as well as the political subforum. Collecting all the data present there would yield an absurd amount of data since the site has been in activity for more than 10 years. The 8chan board “dempart” was, for its part, fully scraped as it did not exhibit big volumes of content. Out of curiosity, we also tried looking for the hashtag “#dempart” on Twitter and it yielded posts effectively discussing genuine participative democracy, having no ties with the racist group of the same name.

We then used the same technique as for the English POW list, i.e. the statistical metric of pointwise mutual information (PMI) to filter the relevant words from a target corpus compared to a reference corpus. Our reference corpus was a one of 88.000 French text messages published in an open-source format to reflect colloquial French. We noticed however that including only the dempart data from 8chan yielded more relevant results than with the JVC 18-25 data so we ended up no using the latter.

Finally, we also included in the general list smaller lists of less common insults sourced from heuristic Internet search. We manually reviewed them to ensure they were fit for our purpose before including them. Those sources include cruciverbists’ (crossword puzzles enthusiasts) lists as well as a one that used to be present in Android phones’ dictionaries with a special flag for words suspected to be offensive.

viNLaP: the MAGA corpus case

viNLaP is a web-based visualizer available here. It starts on a brief description of the corpora I am using as the case of study. The system offers you three tabs to explore the data with the focuses mentioned before. I will explain each one as following.


insert image


insert image

Speech Igniters

insert image

English Profanity and Offensive Word List Constitution

As part of Google Summer of Code 2019, I undertook the constitution of an annotated list of English terms related to the notions of profanity, hatefulness and offensiveness. In this post, I will describe the different steps taken towards building it up and annotating it.

Related code and data can be found here: GSoC 2019 - English Profanity and Offensive Word List

The first step was to determine what technique to use to generate a list that would reflect the aspects being researched. My choice was to use 2 comparable corpora that would have as their main difference the presence/absence of offensive and hateful language, or not. A technique called pointwise mutual information (PMI) can then be applied to see what words are more typical of one corpus relative to the other. It is good at ignoring common and (usually) uninteresting words such as “the”, “an”, etc. while singling out typical terms of a given corpus.

To that end, I used textual data collected from the controversial social media platform Gab came in the public spotlight in the aftermath of the Tree of Life shooting where it was then said that the shooter was a gab user and that the platform might have played in a role in his radicalization. Manually going through a couple of posts can quickly give one a hint of why such claims were made, as the platform is filled with openly racist, conspirationist, anti-Semitic and overall hateful and toxic content. It thus seemed like a “good” place to start. I manually selected a few dozens of users that were openly racist and hateful to be scraped in the hope that they would indeed reflect the toxic language I was looking for. In total, around 250,000 posts were retrieved from approximately 60 users over a span of 3 years (from late August 2016, when Gab first came online until late February 2019). The data was cleaned from URLs and usernames, as that data doesn’t convey useful information for our task as well as not being privacy-friendly.

The second step was to collect a reference corpus against which our toxic-language corpus could be compared. The main point when applying such techniques is to find data that is as close as possible to our target corpus, but for that one dimension we are researching, profanity and offensiveness in this case. I thus collected data from another social media platform, i.e. Reddit. The advantage here is that mere Internet slang would be less likely to show up after the comparison of both corpora, which is something that might have been a problem if the reference corpus had been, e.g. the Brown corpus, that is much too standard for our current purpose. A downside, however, is that Reddit, while being more mainstream, moderate and moderated than Gab, is also not free from toxic content and this could lead to some offensive language slipping through. Yet, the platform has recently been taking action against hateful and toxic content by banning posts, users and even entire subreddits deemed inappropriate, so Reddit still felt like a good reference in contrast to Gab. Reddit posts were simply retrieved using a public archive, and there was more than enough data to match that of Gab.

Once both corpora had been put together, we applied a PMI analysis with Gab as the target corpus, and kept the top 2000 words (ranked by PMI score). It yielded rather instinctive results with “Jew”, “nigger”, “kike” (offensive word for “Jew”) and other niceties showing up in at the very top. However, there was also a lot of non-offensive and semi-related terms that showed up such as “America”, “white” or “election” that would be interesting for topic modeling, but that did not entirely fit our purpose. Of course, it also output a lot of entirely unrelated words that would need to be cleaned up during the annotation phase. We thus needed another way to enrich the list.

The idea was to use lexical proximity between words represented as embeddings in a high-dimensional vector space. When applied toa sufficient amount of data, this technique can deliver surprisingly intuitive results. Given that words are represented in a mathematical form, they can be added and subtracted to and from one another, such that “Merkel” – “Germany” + “France” yields “Macron”. Needless to say that such models are powerful tools to capture all sorts of lexical relationships. For our purpose, we trained a basic word-embedding model from our Gab corpus. However, lexical relationships don’t jumped at me out of the blue and I needed seed words with which to compute the lexical proximity within the embedding space. Those were found heuristically by searching the web for lists of insults and rude language in general. We used 2 lists: a list of insults (thus excluding “rude” words such as “fucking”, as it is not an insult) put together collaboratively in “Wiki” format and a “Offensive/Profane Word List” by Luis von Ahn (creator of the language-learning app Duolinguo, among other things).

Each word itself was added to the final list, before being compared to the other words in the vector space using the cosine distance as a means of comparison. The 10 most similar words were kept and their respective distance to the seed word were added to that of previous words retrieved this way. For instance, we used “nigger” as a seed word, yielding “niggar” as a very similar one, and if “niggar” had previously been retrieved, the current cosine distance between “nigger” and “niggar” was added to that of the previous occurrence of “niggar”. In the end, we had generated a list of words mapped to accumulated cosine distances that could be sorted to retrieve the words most commonly associated to insults and other offensive words from our 2 original lists. Adding up the cosine distances of each retrieved word proved useful as the vector space of Gab was trained using a rather small amount of data for such a task (250,000 posts) and this cosine-distance-based retrieval technique also generated noise and irrelevant data.

Each word in the list was then annotated along to 2 axes/dimensions: one representing the level/degree of offensiveness (from 0 to 4) and another reflecting the nature or the topic associated with said word (racial, political, religious, etc.) based on previous work by CLiPS in German and Dutch. Topics were not mutually exclusive and multiple topics can be associated to one word. The manual review of words one by one is the opportunity to get rid of irrelevant words. However, it must be noted that the limit between relevant or not can sometimes be fuzzy, as sensationalist or controversial words (“refugee”, “supremacist”, etc.) can also prove useful. Thus, when in doubt, the word remained in the list, as deleted words cannot be retrieved, while irrelevant words can always be removed later if necessary.

I hope this post was enjoyable to read and gave a good overview on how to filter out specific data by comparison. I think the method described above works well for high-resource languages like English, given the quantitative nature of the techniques involved. Should it be transposed to other (and more specific) topics, as well as to languages less represented online, more precise techniques should be considered.

First month in CLIPs: outcomes and thoughts.

In the beginning of the month I knew that there was a single danger, awaiting me, and it wasn't connected to software development. I was afraid to be too fascinated by the variety of opportunities that lie ahead of me. Of course, my goal was to concentrate on one thing and bring it to some sort of satisfying completion. But also there were so many things I wanted to do... I wanted to work with Pattern, CLiPS'es package, written in Python; to experiment with cross-language analysis, and always held some psycholinguistic-related thoughts in my head.

Luckily for me, my mentors were very careful from the very beginning. They did everything possible so I wouldn't start working on all the projects at once. This month I had two tasks and I will talk in details about them further.

Part one: Pattern.

Pattern makes a strong impression on the young developer. It resemebles some sophisticated architectural structure. There are many things happening in it at the same time and it seems that Pattern can be used for any purpose imaginable.

In the beginning it was difficult to run the whole program at once. For example, several tests lost their relevance during the months, in which the new build was not launched. The program behaved differently on Windows and Linux, and wasn't working on the Python 2.7. In a few days it got a little easier. The main issue I worked on was: the required package named “mysqlclient”, which failed on some computers with no apparent reason. It was annoying for the users, who didn’t plan to use databases whatsoever. I figured out the way around it, but more issues were coming up and slowed me down a little bit.

For example, Travis CL was causing strange problems in the combination of Python 2.7 and cherrypy package, because of that Python 2.7. build was always failing to my annoyance. Some tests were outdated and were using things like “StopIteration” instead of simple “return”, contradicting to the recent PEP updates.

In the process, I also accepted one of the pull-requests to improve English verb inflexion etc (the request also has alternatives for solving problems with tests) and added a list of profanity words in Russian.

By the end of the month there is only one test in the whole program which still behaves semi-randomly. Very soon I will merge everything to the master branch, and will be able to say that the task is completed!

Part two: Hate speech

Before I was planning to compare politically-motivated hate speech concerning American and Russian relationship; but there were a lot of problems with that and it was making an impression of too complicated task for the GSOC period. So instead I decided to focus on the hate speech in Russian with a possibility to combine it later with cross-lingual analysis. I read a few of papers, before starting the project, and tried to write down all the findings, comparing approaches, error analysis and data sets. The table of all recorded findings is freely available here and will be updated.

Definition of hate speech by itself was already quite complicated and a lot of people confuse it with offensive speech (as noted in the article "Measuring the Reliability of Hate Speech Annotations..." by Ross et al ), so I decided to concentrate on one type of hate speech – sexism. I had several reasons to pick it: first of all, Russia had several sexist laws accepted recently, which was always provoking discussions, and second, I found a few sexist datasets in English online, which could help me a lot if I was to decide working with cross-lingual analysis.

I collected two types of data.

For my first type of data I collected comments from the news posts in Russian social network “Vkontakte”. I picked three major news pages, each of them having slightly different audience and censorship rules. I looked for news which could possibly trigger sexist discussions, for that I used VK API and a query of search words. The code which I wrote for that can be found here. The second type of data is more dense with sexism. It is a forum, which name is literally could be translated as “Anti-females”, which is created by the people who are trying to live with “patriarchy standards". They are discussing (and being serious!) such things as “female logic” (as an example of females being incapable of thinking logically), place of a female in the world etc.

It seems to be perfect for sexism detection, and the only problem is form of the data, differentiating from the previously collected comments. (But there are several approaches I am planning to try out on that).

So far I collected approximately 16 000 of units of sexist and non-sexist comments. Annotated corpora can be found here and the ones, which are to be annotated can be found here. Right now annotation is imperfect, because it is made only by me and for higher accuracy we would have to hire a few more Russian speakers. I hope to be done with my annotation by 3 of July, and I am planning to start working on the model.

Fabricio Layedra: Report - Phase 1

Initial work on viNLaP:

  • Exploratory Data Analysis of the MAGA Corpus
  • Creation of scripts upon scattertext, a python library, for the topical charts of viNLaP.
  • Front-end development version 1.0
  • Creation of scripts for downloading the tweets of MAGA Corpus in order to create the next bunch of visualizations based on geospatial data (latitude and longitude of the tweets).
  • Literature review for novel visualizations of TF-IDF scores.

CLiPS GSOC 2019 kicks off

  • Posted on: 22 May 2019
  • By: Guy

We are happy and proud to once again have an extremely talented team of GSOC students this year.

Fabricio Layedra will be working on viNLaP, an interactive and data-driven web dashboard with three modules. Each module is based on one of these three main analysis: Spatial, Temporal and Statistical/Traditional. Each module will include traditional visualizations related to the respective analysis but also novel visualizations based on the proposed ones in the literature. The proposed viNLaP is to visualize in this first scenario: polarized data; but it is built to be useful for new types of dataset that would come in the future.