Final Report: GDPR Anonymization Tool
What is text anonymization?
Text Anonymization refers to the processing of text, stripping it of any attributes/identifiers thus hiding sensitive details and protecting the identity of users.
This system consists of two main components.
Sensitive Attribute Detection System
Before text can be anonymized, the sensitive attributes in the text which give out information have to be identified. We use two methods to do the same (that can be used in tandem or as standalone systems as well):
Named Entity Recognition Based Detection: This relies on tagging of tagging of sensitive entities in text. The user can setup different configurations for different entities which would determine how the given entity anonymized. The different options available are: Deletion/Replacement, Supression and Generalization. The system currently ships with Spacy's NER system, but can very easily be switched out for other NER models.
TF-IDF Based rare entity detection: Certain sensitive attributes in text might not neccesarily be tagged/identified by the NER system. These sensitive tokens can be identified by the TF-IDF system. The term frequency–inverse document frequency identifies possible rare entities in text based on the distribution and occurence of tokens across sample text snippets supplied by the user. Once the TF-IDF score threshold is set, tokens with scores above the same are determined to be sensitive and anonymized.
Sensitive Attribute Anonymization System
Once the sensitive attributes/tokens are detected, they need to be anonymized depening on the kind of token they are. The user can set different anonymization actions for different tokens. The currently available options are:
Deletion/Replacement: To be used in certain cases where retaining a part of the attribute through the other anonymization methods too is not appropriate. Completely replaces the attribute with a pre-set replacement. Example:
My name is John Doewould be replaced by
My name is <Name>.
Supression: To be used when hiding a part of the information is enough to protect the user's anonymity. The user can supply the percentage or the number of bits they want to be supressed. Example:
My phone number is 9876543210would be replaced by
My phone number is 98675*****if the user chooses 50% supression.
Generalization: To be used when the entity is sensitive enough to need anonymization but can still be partially retained to provide information. This system has two methods of carrying out generalization
Word Vector Based: In this option of generalization, the nearest neighbor of the word in the vector space is utlilized to generalize the attribute. Example:
I live in Indiaget's generalized to
I live in Pakistan. This method, while completely changing the word largely retains vector space information useful in most NLP and Text Processing tasks
Part Holonym Based: In this option, the system parses the Wordnet Lexical Database to extract part holonyms. This method works exceptionally well with geographical entities. In this, the user can choose the level of generalization. Example:
I live in Beijingget's generalized to
I live in Chinaat level 1 generalization and to
I live in Asiaat level 2 of generalization.