NLP project with cloud architecture: Part 3

In the first and second parts, I am describing the outline of the project and some details of the cloud infrastructure set-up. There you can also find some information on how to perform a complex calculation on the scale and at little cost.

This part I would like fully dedicate to the machine learning approach I have taken for the study and guide you through some algorithms I have used. The notebook with the full code is available on GitHub.

Machine learning scenario

For the data crunching and modeling, I used PySpark library and Spark MLlib. The steps I took are as follows:

Performing simple EDA to find outliers, analyze them and eliminate in most of the numerical variables.
Constructing feature extraction pipeline. (I find them particularly useful when it comes to chain of transformations on one dataset)
Creation of the models, such as Logistics Regression, Naïve Bayes and Support Vector Machines.
Evaluation and model selection.
Prediction.

(Logistics regression was of the worst performance in terms of both computational and predictive. That is why I am skipping the explanation of it here.)

Picture 1. Entire view on analysis

Simple EDA and text feature extraction

I performed sentiment analysis on the raw text without any augmentation first. I used VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analyzer because is specifically attuned to sentiments expressed in social media and it is perfectly suitable for what I am trying to accomplish. Having this done I went through text data exploration.

Picture 2. Sample EDA output

When it comes to the text feature extraction, what I have done in the pipeline is:

Concatenation of raw comment and its short summary.
Tokenization using a regular expression.
Removing stop words.
Vectorization of the text.
Weighting it over inverted document frequency.

Picture 3. The outline of the pipeline

For transformations, Spark adds them to a DAG of computation and only when the driver requests some data, does this DAG actually gets executed. One advantage of this is that Spark can make many optimization decisions after it had a chance to look at the DAG in entirety. This would not be possible if it executed everything as soon as it got it.

This is so-called ‘lazy evaluation‘ – what does that actually mean?

It means you will have to materialize many intermediate datasets in memory. This is evidently not efficient – for one, it will increase your computational costs. Because you’re really not interested in those intermediate results as such. Those are just convenient abstractions for you while writing the program. So, what you do instead is – you tell Spark what is the eventual answer you’re interested and it figures out best way to get there. As a result, it optimizes the performance and achieves fault tolerance.

Naïve Bayes for text classification

There some important multinomial Naïve Bayes independence assumptions I need to point out before going further:

Bag of Words assumption (the order of words in a sentence does not matter)
Conditional independence (all the features independently contribute to the likelihood that document d is falling to the class c)

Naïve Bayes is easily applicable to the classification of the documents (e.g. classic example you all are probably familiar with is spam detection).

After extracting sentiments with VADER analyzer – it turned out that there are significantly more positive reviews than negative or neutral. To eliminate this bias I decided to do down-sampling on positive observations and remove neutral sentiment for simplicity of illustration. The accuracy achieved was around 79% that was a relatively good result, taking into account that only text was used for this particular modeling iteration.

Picture 4. Naive Bayes initial output

I used default smoothing (regularization parameter) equals to 1 – so-called additive or Laplace smoothing that basically helps to avoid the situation when will have to divide by zero when calculating probabilities. (By the way this kind of smoothing is called Laplace smoothing only when you use 1 – generally, it is called Lidstone smoothing).

SVM with n-grams

For the support vector machines algorithm, I decided to change the approach slightly and implement bigrams. For those who don’t know what n-gram is: in the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of an (n − 1) – order Markov model.

This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a learning rate. SGD allows minibatch (online/out-of-core) learning. The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared Euclidean norm L2 or the absolute norm L1. By default L2 regularization is used. Regularization, significantly reduces the variance of the model, without a substantial increase in its bias.

It hardly improves the results (81% accuracy) if we compare it to Naïve Bayes. I believe that for this case Naïve Bayes is actually a better model because it is much simpler. But of course, if all parameters in SVMwithSGD tuned precisely it can potentially bring better results.

Picture 5. SVM implementation with bigrams

Recap

Being able to set up your cluster using AWS Spot Instances is a nice-to-have skill because you are able to understand how underlying ML infrastructure works. But of course, you can avoid going into these details and just use EMR.
Data sparsity is more of an issue in NLP than in other machine learning fields because we typically deal with large vocabularies where it is impossible to have enough data to actually observe examples of all the things that people can say. There will be many real phrases that we will just never see in the training data.
I recommend using Naive Bayes if you want to interpret the model easier. But SVM can potentially give better performance but is harder to interpret.

How to improve?

Use compute-optimized instances instead of general use machines – it will speed up the training.
Extract more features and utilize metadata – you can use for a different purpose (e.g. to create a recommender based on the comments).
Do stemming on words – it will significantly reduce dimensionality.
Try to extract topics using, for example, Latent Dirichlet Allocation – it is interesting if topics sentiments will coincide with reviews sentiments. If yes – it is worth thinking about a joint model where LDA will be an inner circle and Naïve Bayes (or SVM) outer. I started LDA implementation and you will also find it at the end of the published notebook.