GSOC 2019 reports

First month in CLIPs: outcomes and thoughts.

In the beginning of the month I knew that there was a single danger, awaiting me, and it wasn't connected to software development. I was afraid to be too fascinated by the variety of opportunities that lie ahead of me. Of course, my goal was to concentrate on one thing and bring it to some sort of satisfying completion. But also there were so many things I wanted to do... I wanted to work with Pattern, CLiPS'es package, written in Python; to experiment with cross-language analysis, and always held some psycholinguistic-related thoughts in my head.

Luckily for me, my mentors were very careful from the very beginning. They did everything possible so I wouldn't start working on all the projects at once. This month I had two tasks and I will talk in details about them further.

Part one: Pattern.

Pattern makes a strong impression on the young developer. It resemebles some sophisticated architectural structure. There are many things happening in it at the same time and it seems that Pattern can be used for any purpose imaginable.

In the beginning it was difficult to run the whole program at once. For example, several tests lost their relevance during the months, in which the new build was not launched. The program behaved differently on Windows and Linux, and wasn't working on the Python 2.7. In a few days it got a little easier. The main issue I worked on was: the required package named “mysqlclient”, which failed on some computers with no apparent reason. It was annoying for the users, who didn’t plan to use databases whatsoever. I figured out the way around it, but more issues were coming up and slowed me down a little bit.

For example, Travis CL was causing strange problems in the combination of Python 2.7 and cherrypy package, because of that Python 2.7. build was always failing to my annoyance. Some tests were outdated and were using things like “StopIteration” instead of simple “return”, contradicting to the recent PEP updates.

In the process, I also accepted one of the pull-requests to improve English verb inflexion etc (the request also has alternatives for solving problems with tests) and added a list of profanity words in Russian.

By the end of the month there is only one test in the whole program which still behaves semi-randomly. Very soon I will merge everything to the master branch, and will be able to say that the task is completed!

Part two: Hate speech

Before I was planning to compare politically-motivated hate speech concerning American and Russian relationship; but there were a lot of problems with that and it was making an impression of too complicated task for the GSOC period. So instead I decided to focus on the hate speech in Russian with a possibility to combine it later with cross-lingual analysis. I read a few of papers, before starting the project, and tried to write down all the findings, comparing approaches, error analysis and data sets. The table of all recorded findings is freely available here and will be updated.

Definition of hate speech by itself was already quite complicated and a lot of people confuse it with offensive speech (as noted in the article "Measuring the Reliability of Hate Speech Annotations..." by Ross et al ), so I decided to concentrate on one type of hate speech – sexism. I had several reasons to pick it: first of all, Russia had several sexist laws accepted recently, which was always provoking discussions, and second, I found a few sexist datasets in English online, which could help me a lot if I was to decide working with cross-lingual analysis.

I collected two types of data.

For my first type of data I collected comments from the news posts in Russian social network “Vkontakte”. I picked three major news pages, each of them having slightly different audience and censorship rules. I looked for news which could possibly trigger sexist discussions, for that I used VK API and a query of search words. The code which I wrote for that can be found here. The second type of data is more dense with sexism. It is a forum, which name is literally could be translated as “Anti-females”, which is created by the people who are trying to live with “patriarchy standards". They are discussing (and being serious!) such things as “female logic” (as an example of females being incapable of thinking logically), place of a female in the world etc.

It seems to be perfect for sexism detection, and the only problem is form of the data, differentiating from the previously collected comments. (But there are several approaches I am planning to try out on that).

So far I collected approximately 16 000 of units of sexist and non-sexist comments. Annotated corpora can be found here and the ones, which are to be annotated can be found here. Right now annotation is imperfect, because it is made only by me and for higher accuracy we would have to hire a few more Russian speakers. I hope to be done with my annotation by 3 of July, and I am planning to start working on the model.

Fabricio Layedra: Report - Phase 1

Initial work on viNLaP:

  • Exploratory Data Analysis of the MAGA Corpus
  • Creation of scripts upon scattertext, a python library, for the topical charts of viNLaP.
  • Front-end development version 1.0
  • Creation of scripts for downloading the tweets of MAGA Corpus in order to create the next bunch of visualizations based on geospatial data (latitude and longitude of the tweets).
  • Literature review for novel visualizations of TF-IDF scores.

CLiPS GSOC 2019 kicks off

  • Posted on: 22 May 2019
  • By: Guy

We are happy and proud to once again have an extremely talented team of GSOC students this year.

Fabricio Layedra will be working on viNLaP, an interactive and data-driven web dashboard with three modules. Each module is based on one of these three main analysis: Spatial, Temporal and Statistical/Traditional. Each module will include traditional visualizations related to the respective analysis but also novel visualizations based on the proposed ones in the literature. The proposed viNLaP is to visualize in this first scenario: polarized data; but it is built to be useful for new types of dataset that would come in the future.