19 Nov, 2017

Cloud Services For Data Scientists

Nowaday, many businesses use Cloud Based Services and as a result small and large companies started building and investing in such services. Just to name some famous ones, Google, Microsoft, IBM and Amazon are the leaders in this field of business. As far as I know, Amazon is the first company started such systems.

I don’t know if the rumors are true, but apparently Amazon in its early stage (probably when it was only selling books online) bought many servers to handle its peak traffic. Later they realized that the servers only working at the maximum capacity around christmas (I guess because people were buying books as gifts for christmas back then!) and most of the servers were idle other times of the year. Therefore, they decided to rent the remaining idle capacity of the servers and that's how Amazon Web Services (AWS) launched.

Now Amazon makes 14.5 Billion dollars in revenue [1] through these services. For small companies, the main advantage of these services is the low cost (of using these services on-demand) comparing to the high cost of buying servers. Besides, these services run on optimum energy and maintenance services, so the servers end up more reliable and efficient.

I been using Amazon Web Services for the past two years and I am quite familiar with the services Amazon provides. However, I used Google and Microsoft cloud services from time to time. Here I am going to introduce some of AWS services that I think would be (potentially) useful for working with data:

Elastic Compute Cloud-EC2:
simply this service is the core of most of AWS services, EC2s are in fact servers that you can rent from Amazon and setup or run any program/application on it. These servers come in different operation systems and Amazone charge you based on the computing power and capacity of the server (i.e. Hard Drive capacity, CPU speed, RAM capacity, ...) and the duration the server been up. For instance you can rent a Linux or Windows server with properties (RAM, CPU, etc.) fit to your specific needs and Amazon charges you based on the specifications and the duration you use the server. Note that previously AWS were charging at least for one hour for each instance you run, although apparently they changed their policy to per-second bill recently [2].

Simple Storage Service - S3:
Is the Amazon storage service for files, this system kind of looks like Dropbox or Google drive, however I believe the cost is lower depending on the usage. This system doesn’t provide a user friendly interface since it is designed for working with online applications not the user. Working with this system through APIs and SDKs is way easier than through its web console and there are many libraries and APIs developed with various languages for this service. Boto3 for instance is S3 library (in fact it is suitable for working with many other AWS services as well) in Pythin language.

Relational Database Service - RDS:
To put it simply, RDS is like a SQL storage system that you can rent from Amazon. Again you can specify the features based on your needs and it supports SQL Server, MySQL, PostgreSQL, ORACLE and couple of other sql based frameworks.

Redshift is the data warehouse service of Amazon, its a distributed system (something like Hadoop framework) which lets you to store huge amount of data and get query out of it. The difference between this service and RDS is its high capacity and ability to work with Big Data (terabyte and petabyte size of data). Redshift is very scalable, meaning (depending on the query, network structure and design, service specification, etc.) the speed of getting query out of 1 terabyte of data and 1 petabyte of data can be the same by scaling up (i.e. adding more cluster to) the system. RDS is not suitable for data in the size of petabyte and on the other hand distributed systems are not suitable for data in the size of gigabytes or couple of terabytes (cost wise).

Elastic MapReduce-EMR:
This service is suitable for setting up Hadoop clusters with Spark and other cluster based applications. A Hadoop cluster can be used as a compute engine or a (distributed) storage system. However if the data is in such large size that you need a distributed system to handle it, redshift is more suitable and way cheaper than storing in EMR.
Since you can set EMR to install Apache Spark, this service is suitable for cleaning, re-formatting and analyzing Big Data (it is usually not that easy to set up and adjust Spark on a cluster!). You can use EMR on-demand, meaning you can set it to grab the code and data from a source (e.g. s3 for the data and code and RDS for the data), run the task on the cluster, and store the results somewhere (again s3, RDS or redshift for instance) and terminate the cluster. Using the service in such way you can reduce the cost of your cluster significantly. In my opinion EMR is one of the most useful AWS services for Data Scientists.

1. IBM Beats Amazon In 12-Month Cloud Revenue, $15.1 Billion To $14.5 Billion
2. Per-Second Billing for EC2 Instances and EBS Volumes