This repository contains the source code for my Master's thesis project.
This thesis explores the potential of natural language processing (NLP) in the social sciences, specifically the clustering of contextual word embeddings. However, the limited interpretability of these techniques makes it difficult to get a deeper understanding. To address this issue, this thesis proposes a strategy to provide social scientists with a human-friendly explanation of word clusters by using the contextual information around each item to provide an explanation for each cluster.
Using various explainability techniques, salience scores are generated to rank the contextual elements of sentences in order of importance. Then, a probing classifier evaluates the information highlighted by each explainability technique and predicts the cluster to which each embedded word belongs.
The results of this thesis indicate that the use of explainability techniques can generate informative explanations that can help us understand the distinctions between different clusters of contextual word embeddings. Ultimately, we believe that our work can help social scientists be more confident in using contextual word embeddings for various NLP tasks.
Install the module in the main folder like:
pip install MasterThesis
The available arguments are:
--sentences_generation: generates sentences from the datasets.
--clustering_embeddings: clusters the embeddings.
--extract_sentences_with_target: extracts sentences with target.
--salience_extraction: extracts salience.
--training_classifier: trains the classifier.
To run the script, use the following command:
python -m marc_thesis [argument]
This project requires the environment variable MARC_THESIS_EXPERIMENT_FOLDER
to be set. This variable will contain all the information generated by the different funcitonalities and will serve as a write/read storage folder for them.
mkdir experiments_folder
export MARC_THESIS_EXPERIMENT_FOLDER=/absolute/path/experiments_folder