Phase 2 - Report: Rudresh Panchal

This post reflects upon some of the milestones achieved in GSoC 2018's Phase two.

The Phase 2 mainly concentrated on on expanding the rare entity detection pipeline, adding the generalization features and increasing accessibility to the system being built. The following features were successfully implemented:

  • Built a custom TF-IDF system to help recognize rare entities. The TF-IDF system saved the intermediate token counts, so that whenever a new document/text snippet is added to the knowledgebase, the TF-IDF scores for all the tokens do not have to be recalculated. The stored counts are loaded, incremented and the relevant scores calculated.

  • Implemented the "Part Holonym" based generalization feature. This feature relies on lexical databases like Wordnet to extract part holonyms. This generalizes tokens to their lexical supersets. For example: London gets generalized to England, Beijing to China at level one generalization and to Europe and Asia Respectively for level two generalization. The user is given the option of choosing the generalization level for each attribute.

  • Implemented the "Word Vector" based generalization feature. This maps the nearest vector space neighbour of a token in pretrained embeddings like GLoVE and replaces it with the same. For example: India gets replaced with Pakistan.

  • Implemented a general anonymization RESTful API. This gives people the option to utilize our system across different tech stacks.

  • Implemented a Token level RESTful API. This API endpoint gives token level information of various things including, the original token, replaced token, entity type and the anonymization type.

  • The API utilizes Django's token based authentication system. Implemented a dashboard to manage the authentication tokens for the users.

Some of the major things planned for the 3rd and final phase are:

  • Code cleanup: As the project progressed, some of the code has become redundant which needs to be removed.

  • Documentation: While the code is well commented and easy to understand, the project currently lacks thorough external documentation. A quick usage guide for non-programmer end users also could be helpful.

  • A simple scaffolding system for the user. The system currently ships without any predefined configurations (including entities, aliases etc). Script(s) which can quickly setup a ready to use system with certain default values (including pre-defined attribute actions, threshold values etc) would be useful.

  • GUI based and API based file upload system. The user currently has to currently paste plaintext in the GUI and set it as a parameter in the API. The option to directly upload text files will increase user convenience.

  • Experiment with language localization. The system currently works well with the English language, but it needs to be tried out with other languages.

Picture 1: The Token level API in action