Phase 1 - Report: Rudresh Panchal
With the first coding phase of GSoC 2018 coming to an end, this post reflects upon some of the milestones achieved in the past month.
I first worked on finalizing the architecture of the Text Anonymization system. This system is being built with the European Union's General Data Protection Regulations in mind. The system seeks to offer a seamless solution to a company's text anonymization needs. The many existing solutions to GDPR mainly focus on anonymization in Database entries, and not on anonymizing plain text snippets.
My system pipeline consists of two principal components.
Entity Recognition: In this part, the entity is recognized using various approaches including Named Entity Recognition (implemented), Regular Expression based patterns (implemented) and TF-IDF based scores (To be implemented in 2nd Phase).
Subsequent action: Once the entity is recognized, the system looks up the configuration mapped to that particular attribute, and carries out one of the following actions to anonymize the data:
Generalization (To be implemented in 2nd phase).
The methods to generalize the attribute include a novel word vector based generalization and extraction of part holonyms.
Some of the coding milestones achieved include:
Setup the coding environment for the development phase.
Setup the Django Web App and the Database.
Wrote a function to carry out the text pre-processing, including removal of illegal characters, tokenization, expansion of contractions etc.
Wrote and integrated wrappers for the Stanford NER system. Wrote the entity replacement function for the same.
Wrote and integrated wrappers for the Spacy NER system. Wrote the entity replacement function for this too.
Wrote the suppression and deletion functions. Integrated the two with a DB lookup for configurations.
Wrote the Regular Expression based pattern search function.
Implemented the backend and frontend of the entire Django WebApp.
Major things planned for Phase 2:
Implement a dual TF-IDF system. One gives scores based on the documents the user has uploaded, and one which gives scores based on TF-IDF trained on a larger, external corpora.
Implement a word vector closest neighbor based generalization.
Implement the holonym lookup and extraction functions,.
Picture 1: Shows the Dashboard of the users, allowing them to add new attribute configurations, modify existing configurations, add Aliases for the NER lookup, add Regex patterns and carry out text anonymization.
Picture 2: Shows the text anonymization system in action. The various entities and regex patterns were recognized and replaced as per the configuration.