Towards Building a Recommender System to improve Diversity in Online News Feed

Date of Award

12-2024

Degree Name

Master of Science

Department

Computer Science

First Advisor

Shameek Bhattacharjee Ph.D.

Second Advisor

Ajay Gupta Ph.D.

Third Advisor

Li Yang, Ph.D.

Keywords

BERT, diversity index, information retrieval, information theory, LDA, topic modeling

Access Setting

Masters Thesis-Abstract Only

Restricted to Campus until

12-1-2026

Abstract

This research proposes a method to compute the diversity index in a news feed to offer a spectrum of possibilities from the news articles available on the Internet and to give a fair representation to most perspectives and opinions. The current social media recommendation algorithms work purely in the interest of the user to enhance engagement on the platform. This creates a gap in the information provided to the user. The objective of the research is to fill this gap by offering a much diverse set of news articles from the Internet.

Firstly, the required data set surrounding a subject is collected from the internet using SerpDev search API [20] and CrewAI’s website scraping [5] tool. Once the articles are extracted from different websites, characteristics around it are brought to a common ground using pre-processing steps.

Multiple themes in a document from the corpus are then explored using Latent Dirichlet Allocation known as LDA [20] which uses Bayesian inference to get the posterior probability of topics in each document, also the posterior probability of each word in the document. A topic is essentially a collection of words that frequently appear together in documents and represent a certain theme or subject matter. Since topics are not manually seeded in LDA, the ideal number of topics for the corpus is computed using Coherence score [4] and Perplexity score [16].

To manually input topics, Bidirectional Encoder Representations from Transformers known as BERT [6] topic modeling is further explored. A set of topics is prepared to best represent the spectrum of possibilities for the corpus. Posterior probability of topics in each document combined with diversity scoring technique [7] is used to generate a diversity index for the article set. Here, diversity means the variability in the information being processed, transmitted or encoded.

This document is currently not available here.

Share

COinS