The project was developed in following process
Dataset selection
Tweets Dataset - Top 20 most followed users in Twitter social platform https://dataverse.harvard.edu/dataset.xhtml?id=3047332
I used this mostly because, number of tweets per person ration in the dataset was pretty high, which gives better clustering.
Preprocessing of the dataset
- Filter out all the tweets, not in English, the language column of the dataset is used to accomplish this.
- Removed URLs from all the files.
- Replace the # and @ symbol with ‘’ because hashtag and mention could contain contextual info.
Generating embeddings
I have tried using a multiple sentence transformer to get the contextual embeddings of the tweet data, and further employed Uniform Manifold Approximation and Projection (UMAP) technique to reduce dimension and aid in visualizing the clusters of twitter emotions.
Based on this clustering I was able to find choose the best sentence transformer for this task. (all-mpnet-base-v2 based on the results)
- Clustering in all-distilroberta-v1 embedding space
- Clustering in all-mpnet-base-v2 embedding space
Clustering and result
Finally, HDBSCAN technique was used on these generated contextual embeddings to get a high-density hierarchical clustering to get topic representation (in numbers). Furthermore, Gephi was used to create a network graph for the celeb to topic clustering graph.