Project 3 Final Proposal

August 5th, 2020


Team 30: cise rox!

Jack Polk

  • Packaging (for user and developer consumption)
  • CLI interface

Jack Driscoll

  • Program logic
  • CLI interface

Samantha Dorfman

  • Media coordinator and presentation manager
  • Team organizer

Presentation

https://www.youtube.com/watch?v=LytI9WQ-h0A

Project Page

https://public.ranvier.net/Sites/cop3530-project3/


Problem & Motivation

Social media has the potential to be a powerful tool for both positive intentions and more negative ones. In the case of twitter, the latter tends to be much more prevalent, with toxic behavior on the platform[1] being much more prevalent when compared to other platforms. On a separate note, Twitter has provided tools for many years that allow people to easily scrape and collect data from the platform, such as tweets, user accounts, and trending topics. One example of this being utilized can be found in the Sentiment140 dataset[2], which has over 140 million tweets along with their timestamp, user, and a rating of the positivity/negativity of the tweet. In an attempt to quantify the level of negativity found on Twitter in an efficient and scalable manner, we propose a command-line interface (CLI) tool called the Twitter Polarity Indexer (TPI) which will sort through these tweets and categorize users based on the average of their tweets polarities (with positive tweets having a polarity of 4 and negative tweets having a polarity of 0).

Features

The TPI has several key features that, when complete, will indicate that it is ready to be used as a viable polarity indexing solution for the platform:


Tools

The following frameworks/languages will be used:

Language

Python 3[3] for CLI and program logic

Libraries

tqdm[4] and rich[5] for report formatting

Tools

Visual Studio Code for our development environment

GitHub for version control

Data

Dataset Schema

Our dataset will consist of one schema, derived from Sentiment140[6]:

The data is a CSV with emoticons removed. Data file format has 6 fields:

0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

1 - the id of the tweet (2087)

2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)

3 - the query (lyx). If there is no query, then this value is NO_QUERY.

4 - the user that tweeted (robotickilldozr)

5 - the text of the tweet (Lyx is cool)

Data Structures and Algorithms

* Permission from Professor Kapoor was given for us to use the standard Python dictionary structure rather than implementing our own.

Analysis

Initially, we intended to implement an AVL tree to sort usernames and a min-max heap to retrieve the most negative and positive users based on their average polarity. Each node in the AVL tree would have a private member for the average polarity of each user and the nature of the min-max heap would allow us to filter out the extremes from the dataset. Upon implementing this, we decided that the min-max heap proved to be useless, as while parsing the dataset we could determine the extremes. Additionally, due to the discrete nature of the dataset, with each tweet being 0 or 4 exactly, the extremes would always end up being one of those two because of the number of users with only one or two tweets. Storing the usernames in a dictionary instead proved to be the more efficient approach, and after discussing the possibility of implementing our own, we decided to use the standard implementation for implementation simplicity. Along with implementing our own AVL tree for the polarities, we were able to compare the data structures in terms of their retrieval time for both the polarity groups and the user searches (we stored both users and polarity groups in AVL trees and dictionaries to demonstrate access time for both structures).

In the user interface, it was originally our idea to have a non-interactive CLI program that would allow for scripted use of the program along with a HTML report that the user could view and interact with in the browser. As we developed the frontend, we quickly realized that the amount of information that would need to be embedded in the HTML report was infeasible for user interaction without the use of complex Javascript, which we did not anticipate would be needed. Stepping away from this approach, we instead implemented an interactive CLI with the same desired functionality as the HTML export but with far more speed and flexibility than what the web solution would provide.

Time Complexities

Function

Complexity

Function

Complexity

User().polarity()

O(n)

main()

???

AVL().insert()

O(log(n))

AVL().search()

O(log(n))

AVL().print_tree()

O(n)

Reflection

As a group, we did really well together. We each had strengths to bring to the table and we were able to execute a really efficient program given the wrong turns we took. We had planned to implement an AVL tree for the usernames and a min heap for the polarity. After working on this approach for a few days, we realized that the purpose of the project was to compare two data structures, and if the min heap stored the polarity and the AVL tree the usernames, we completely were not following the guidelines of the project. With only a few days until the due date, we restructured our ideas, code, and output to tend to that of the project and the lack of time. It was in this period of pressure that we really produced our best ideas. It took spending time and effort on the wrong approach to truly yield the best one. We utilized GitHub tremendously in terms of testing the code that another one of us would push. This group really worked as a real life computer science office would.


[1] https://dl.acm.org/doi/pdf/10.1145/2818052.2869107

[2] http://www.sentiment140.com/

[3] https://docs.python.org/3/

[4] https://github.com/tqdm/tqdm

[5] https://github.com/willmcgugan/rich

[6] http://help.sentiment140.com/for-students