top of page

NLP project with cloud architecture: Part 2

Updated: Apr 16, 2020

In the first blog post, I introduced the problem statement I was resolving throughout the study, the origins of the project inspiration and data that were used.

In this part I will guide you through some cost aspects of the cloud architecture, telling you why I used AWS EC2 Instances. And, of course, you will see how the entire architecture looks like. All scripts and detailed guidelines you can find in my GitHub repository.


Cost-effectiveness

The idea was to create a cluster of affordable machines and perform all calculations on them. There were two possible ways to achieve this.

First one was simply to set up AWS Elastic MapReduce service and run Spark cluster there. The second way was kind of ‘do-it-yourself’ that assumes the installation of the cluster manually on AWS EC2. The first one actually has a lot of pros: you should not care literally about anything, cluster and notebook (in this case Apache Zeppelin) will be set up automatically – you log in and do ML magic. But there is no fun in setting up things automatically – I was curious about how things work in the background. That’s why I decided to go with the second solution, the main advantage of which is cost-effectiveness.


cost

Picture 1. AWS calculator estimation for North Virginia


EMR is very expensive, but at the same time very convenient because you do not care about the background. EC2 on-demand is not far from EMR in terms of cost. But fortunately, there is a third option – EC2 Spot Instances. The difference between ‘on-demand’ and ‘spot’ is that – a Spot Instance is an unused EC2 instance that is available for less than the On-Demand price. You can think about this as a bargain price. With new releases, you are able to hibernate and continue the instance. This feature, unfortunately, was not accessible when I was doing the project.


Simple – yet efficient and scalable architecture

I started with the creation of multiple independent machines. In my case, there were 3 of them, but you can create as many as you’d like. One of the main advantages of Apache Spark is that it is highly scalable. Say if you have 3 instances and you are running out of memory you can just create a script that will automatically spin up a new instance and attach it to the existing cluster. All machines are completely clean from the beginning – only Ubuntu with Python 2.7 installed. I also created an Amazon S3 bucket where I located all the data. To ensure the best latency the instances were created in the same region as S3 bucket (North Virginia in my case). IAM (identity access management) role (S3AdminFullAccess) allows all instances to source data directly from the bucket without any limitation.


aws-spark

Picture 2. Final solution architecture


When I had been done with everything in AWS console I have moved on to the VMs terminals by logging in using PuTTY. Then I simultaneously set up Spark and Python environment on all machines – it is crucial to have them set up exactly in the same way (all libraries should be of the same version, etc.) because the application will run the same commands on all instances at the same time. In contrast to the general environment Jupyter Notebook could be installed on a master node only, but installing it on workers will not hurt. You shouldn’t forget to create a security group for your cluster that allows you to access it from the local computer.

For me as a data scientist, who was always interested mainly in statistics, machine learning and not at all in cloud architecture, it was very interesting and exciting to learn from scratch how environments that we are using for modeling are being created. I hope that impresses you as well 🙂

In the next and last part I will describe the machine learning approach with the feature extraction pipeline I applied in the study.


References:


14 views0 comments

コメント


bottom of page