Word embeddings trained from 4chan and 8chan data
In this post, we will explain how we collected data from the imageboard forums 4chan and 8chan and how we trained word embeddings from them as well as an experiment illustrating their academic potentials.
Related code and files may be found here: GSoC 2019 - 4chan and 8chan Word Embeddings
4chan and 8chan are either popular or obscure places of the Internet depending on your knowledge of the Internet ecosystem. They now and then attract attention from different media outlets due to their potential ties with certain shootings and terrorist attacks, most often associated with white supremacism and neo-Nazism. Most recently, 8chan was for instance associated with the Christchurch shooter Brenton Tarrant, who allegedly posted a livestream link and a manifesto prior to his attack. The platform made the headlines again with the El Paso shooting and was thus, as of writing (22/08/2019), definitely shut down in the aftermath of the toil it sparked. We hence thought fit to gather data about them so as to be able to perform sound research on those controversial platforms.
Both platforms are very similar: they allow their users to create “threads” on a “board” (a sub-forum centered around a certain topic) to post content in an anonymous way, to which other users can respond. Their main difference lie in that 4chan has a set number of boards dedicated to different subject matters, whereas 8chan allows users to create their own boards around the topic of their choosing, leading to a much higher of boards, but also to much more sparse content. However, because of their structural similarity, they share boards centered around the same topics and we chose to investigate the so-called “/pol/” board, i.e. the board dedicated to discussing (international) politics. On both forums, /pol/ is a popular board and known for hosting what we could call “toxic” content: racist, fascist posts along with a wide variety of otherwise doubtful content, the virulence of which is probably fueled by the ‘absolute’ anonymity allowed on the platform.
We thus proceeded to collect the data from the /pol/ board of both platforms. Each platform possesses an API through which we can request thread or board-related data in a programmatic way. However, another specificity of those platforms is that they regularly clean up older content and only retain most recent threads so that, despite a limited archiving mechanism from the site itself, the content available through that channel is too limited for the purpose of training sufficiently good word embeddings. As a consequence, we looked for other means of gathering the relevant data. Luckily, multiple sites are dedicated to archiving the content posted on 4chan, but we have not managed to find one for 8chan. For 4chan, we found archive.4plebs.org, that reaches back to late 2013 and, concerning 8chan, we resolved to collect the little information available to the site itself. 4plebs has not set up any API, so we decided to collect the data through scraping.
The collection of 8chan data was rather quick as, as mentioned above, the site does not host a large amount of data at any given time and the fact that 8chan seems to attract less users than 4 chan. Concerning 4chan, however, the quantity of data available for /pol/ alone is rather impressive. We set out crawler running from beginning of July until the end of July and collected approximately 30 million entries spanning over 6 years (between 2013 and 2019) for both 4chan and 8chan combined. However, given our crawling scheme that let 2 scrapers run in opposite directions (i.e. from past to present and from present to past), some data are missing for the years 2015 and 2017.
Once the data was collected, we evidently cleaned it through some pre-processing steps including removing URLs and reposts, as well as tokenizing it. We reflected for some time over how to best approach the training of the word embeddings: should we remove or keep stop words (common words (e.g. “the”, “about”, “on”, etc.) that might be detrimental to certain tasks, what window to use, should we use CBOW or skip-gram, etc. Also, we had to decide with what package to work with and we elected to go for genism, a popular Python library optimized for the training of dense word embeddings. It appeared that the default parameters were, as it stands, rather well-suited for our task and we stuck to them.
We then set up a toy experiment to show a potential use of those embeddings: we compared the distance between two given words in the 4/8chan vector space, and the distance between the same two words in another vector space trained from Reddit data. We won’t re-go through the whole description of the training for Reddit word embeddings. Let us simply note here that we collected the Reddit data from the archiving site result redditsearch.io. The words we chose put under scrutiny (henceforth ‘target words’) are the 50 first words of the English Profanity and Offensive Words (POW) list, that we generated and annotated as part of another project of this GSoC. The words against which they were compared in each vector space were very basic ones, conditioned on their part of speech tag (ADJ, NOUN or VERB), to give us a basic idea of the different representations of the target words. For each target word, it was compared against 2 words of the same grammatical category, a positive and a negative one. For adjectives, we had ‘good’ and ‘bad’, for nouns ‘human’ and ‘monster’ and for verbs ‘love’ and ‘hate’.
Let us take 2 target words to illustrate how it worked: ‘jew’ and ‘communist’. As a noun, ‘jew’ was compared to both ‘human’ and ‘monster’ in both vector spaces (making 4 comparisons in total) using their cosine similarity. The result is as follows:
‘human’ in Reddit: 0.27 VS ‘human’ in 4/8chan: 0.08
‘monster’ in Reddit: 0.32 VS ‘monster’ in 4/8chan: 0.31
For the target word ‘communist’ (ADJ):
‘good’ in Reddit: 0.12 VS ‘good’ in 4/8chan: -0.03
‘bad’ in Reddit: 0.21 VS ‘bad’ in 4/8chan: 0.04
Evidently, there are a lot of methodological shortcomings to the toy experiment above, but this is not intended as academic work, for now. The aim was to show the possible use of those embeddings and possible trends to be explored in language representation for each community. We hope this post gave a clearer view of the process of gathering data from platforms such as 4chan and 8chan, as well as the use that can be made of the word embeddings resulting from such data.