I hope you’ll enjoy the first post and it will find particularly useful if you want to go deeper into underlying ML infrastructure. I would like to go through one use case of big data platform implementation for the machine learning exercise. At the end of the posts chain, I would like to open the discussion and on how would you tackle that topic.
Outline
In this part, I am going to describe what was the problem I was trying to resolve and its origins. I will, of course, guide you through the description of the data. In the second part, I will brief you on some monetary aspects of the cloud solution that will help you to prototype at little cost. An what is more important, I would love to concentrate on the technical aspects of the infrastructure as well as the machine learning approach.
Intro
The inspiration for this project was my master thesis. Initially what I knew was that I want to do NLP related work. But as you might know, it is a very broad term and consists of many different subtopics. For instance, conversational systems (chatbots), translators, summarization or text correction application. They are just a part of a wider NLP universe. Obviously, I would not be able to capture all of them.
Picture 1. Modern NLP applications and project inspiration
Therefore I decided to focus on one element – namely ‘sentiment analysis’. At that time I was completely new to NLP, so I just picked up the piece that seemed the most appealing to me. I was curious if and how it is possible to apply sentiment analysis in some real-world scenarios and potentially to resolve some business cases.
Data
Having the direction set, I started to look for the data, and you would probably agree with me – this is the most boring and tiresome part of the work. But finally, I run into the web page of one professor Julian McAuley from the University of California in San Diego. There I found complete, and more important, clean datasets of historical Amazon reviews. I narrowed down the range of the different datasets to the book reviews, just to not go completely crazy with the analysis. I also assumed that book reviews must be the most interesting in terms of semantic analysis (e.g. there was a dataset related to the category ‘Apps for Android’ – I don’t think it would as exciting as the one I picked up). But of course, it might be cool to analyze all categories and, for example, compare sentiments frequencies/types.
Picture 2. Data description
This is the example of the record from the bespoke dataset.
user (reviewer) ID and name
product ID
helpfulness of the review
the raw text of the review
overall user’s score the book
short summary
review time in Unix and regular formats
The metadata table contains general information about the books, such as its title, price, sales rank, brand and what products were bought together with a given book.
In total there were slightly more than 41m reviews and almost 9.5m products in the metadata table. The main dataset was around 10 GB + 3 GB of metadata which gives us 13 GB. It might be difficult to fit the entire data structure in your RAM and even more difficult to run complex calculations. That was actually what brought me to the AWS cloud application. I’ve picked up AWS as a target solution just because of personal interest. I am sure that the same exercise can be performed within any other cloud provider ecosystem: Azure, Google or IBM.
Problem statement
What I decided to do is to create a classifier that will leverage knowledge (learn) from VADER sentiment analyzer from NLTK library and will be able to identify the sentiment from raw text and some additional features (e.g. helpfulness of the review or overall rating).
In the next part, I am going to describe the technical and cost side of cloud architecture and Spark Cluster set-up.
References:
“Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering”, R. He, J. McAuley, WWW, 2016
Amazon User Guide: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html
Comentários