For the second project we worked on during GSoC 2019, we decided to transpose the work we had done before, i.e. we replicated the generation and annotation of a list of profanity and offensive words (POW) for the French language. We should already note that, despite the designation of such lists, they also encompass the potentially polarizing dimension of certain words in addition of the offensive and profane one. Below, we will explain the data and techniques leveraged to generate it.
The first step to find POW is to find a dataset/corpus of data containing significant quantities of those. We thus started looking for online sources that fit that description and ended up with 2 paths to explore. The first one was a specific section of the French-speaking videogame website “jeuxvideo.com” (JVC) called “Blabla 18-25 ans” (literally “Chitchat 18-25 years”). Even though it might sound like a benign and perfectly innocent forum, it has gained attention over the recent years for being a rather toxic part of the whole forum, one teeming with Pepe the Frog gifs, trolls or even straight up right-wing or Islamist radicals. French newspapers have been reporting on this phenomenon for a few years, linking such online activity to political activism and the recent success of the French populist right-wing party “Front National” (now rebranded “Rassemblement National”). Manual exploration of the forum also revealed that the toxicity of 18-25 had carried over to the sub-forum dedicated to discussing politics. Both of those sub-forums seemed like promising places to find POW and words indicative of polarized content.
The second identified source of data was a French-speaking board of the imageboard 8chan. 8chan is similar to the near eponymous 4chan, with the difference that users are allowed to create their own new boards on the site, much like users can create subreddits on Reddit. As a consequence, 8chan is full of niche boards, centered around specific topics, communities or languages. One in particular came to our attention after our browsing the Internet on the lookout for extremist websites or forums. While we were exploring the reactionary, identarian and right-wing extremist website “Démocratie Participative” (“Participative Democracy”), we noted that an 8chan board called “dempart” was featured on the site. After exploring the board for some time, we noticed it was fit for our task both in terms of data quantity and quality.
After the identification of those 2 sources, we proceeded to gather data from it by scraping them. From JVC, we arbitrarily decided to scrape around 200 pages from the 18-25 as well as the political subforum. Collecting all the data present there would yield an absurd amount of data since the site has been in activity for more than 10 years. The 8chan board “dempart” was, for its part, fully scraped as it did not exhibit big volumes of content. Out of curiosity, we also tried looking for the hashtag “#dempart” on Twitter and it yielded posts effectively discussing genuine participative democracy, having no ties with the racist group of the same name.
We then used the same technique as for the English POW list, i.e. the statistical metric of pointwise mutual information (PMI) to filter the relevant words from a target corpus compared to a reference corpus. Our reference corpus was a one of 88.000 French text messages published in an open-source format to reflect colloquial French. We noticed however that including only the dempart data from 8chan yielded more relevant results than with the JVC 18-25 data so we ended up no using the latter.
Finally, we also included in the general list smaller lists of less common insults sourced from heuristic Internet search. We manually reviewed them to ensure they were fit for our purpose before including them. Those sources include cruciverbists’ (crossword puzzles enthusiasts) lists as well as a one that used to be present in Android phones’ dictionaries with a special flag for words suspected to be offensive.