Job roles in Data Science
Have you ever imagined how these big companies like Google, Amazon, and etc. deal with your data? Who are these people who actually process your data and make sense of it? Have you ever imagined how, on searching for a Headphone on Amazon, you get advertisements of the branded ones in Facebook or other e-commerce websites? Have you ever imagined how, on searching for a particular product of a particular brand, you start getting recommended with similar kinds of products from the same brand? I mean, who are these people who see your data and say that “You should buy this product!” Well, after seeing “Iron Man” and “Avengers” you will probably say that these are not done by people, these are done by Artificial Intelligence, particularly Machine Learning Algorithms. And you are absolutely right. But there are people who, not only build these algorithms but also maintain it. And it is not just one person in a company, but many. For a single company, your data goes through different people with different job roles. Let’s explore these job roles.
The four major job roles are:
· Data Engineer
· Data Analyst
· Data Scientist
· ML Engineer
Before starting with the discussion of individual job roles, let us explore a scenario:
Let’s say you want to perform some analysis in the Facebook database. Will you directly go to the actual database and start performing analysis? Of course not. One of the main reasons is that, the database is an OLTP database ( Online Transactional Processing Database) and is performing customer transactions and we cannot directly access the database. One way to perform analysis would be to take out the data (I mean ‘copying’) and then work on it (also known as OLAP database, as in Online Analytical Processing Database). Let’s see the problems involved in it.
Let’s say you have a local e-commerce company. You have some products and you have your customers. Your customer can buy multiple products at a time. How will you keep record of a transaction that will take place? How many tables will you make to keep the data recorded?
Let’s say your answer is 1 Table where you store Customer name, Customer Id, Product Name, Product Id, and other info.
The problem with this data is that if a customer has bought multiple products the table may look like this.
Can you see that there are multiple entries with the same customer name? This is called Data Redundancy. We avoid this kind of thing in the OLTP database. Moreover, with one table, you will ultimately make huge number of columns like, Customer Name, Customer ID, Product Name, Product ID, Product Location, Customer Location, Price and so on.
Now, you might say that we will need 2 Tables. One will store Customer Information and the other will store Product Information.
Can you tell me where are you storing the transactional data? Where are you storing which customer is buying which product? Therefore, we are going to need 3 Tables. One, to store Customer Info, Second one, to store Product Info and the last one will store the Transactional Info.
This is an OLTP Database. The Data Redundancy is at its lowest level here but it gave rise to a new problem.
If we want to run a query in this database, it becomes a hectic task. First, you have to take out the data from the OLTP Database and then perform analysis. If we want to know that which customer buys expensive things every month, we would have to search in three tables. First, we would require the cost details of the products from the second table. Next, we would compare each customer with the number of products and their cost details using the third the table to get a Customer ID who buys most expensive things. And then, we would require the First table to get the name of the Customer from his/her Customer ID. We see that, to run a simple query in the database, we have to work with 3 tables. In bigger companies, there might be thousands of tables. You might say that we would first join all the tables using Customer ID and Product ID (as they are the unique entities in their respective tables) and then, perform analysis on it. Yes, that is what is actually done. Now, you might raise the question that we made three tables to solve the problem of Data Redundancy in OLTP database but here (to perform an analysis) we are joining those tables and creating redundant data. Does it not seem that we are working hard for no reason?
There is actually a reason we do this Data Redundancy thing. As mentioned before, OLTP and OLAP are two different databases. The database, which actually stores the Company data and is performing transactions with the user, is known as the OLTP Database and the database in which we perform analysis is known as the OLAP database. OLAP database is used specifically for Analysis purposes and not transactional purposes. To store Company Information, we should use as less memory space as possible as there may be several millions of customers and products and memory is not cheap. To perform some analysis in the database (as mentioned before), we do not directly analyse the data in the OLTP database (which actually stores Company data). We first extract the data and make it more feasible to perform analysis (OLAP Database). We make it more feasible by joining the different tables so that our hard work is less while performing the analysis. And this database has redundant data as the data will ultimately be of no use after we have performed the analysis (so no long term storage problem). If we want to perform the analysis again, the data will be extracted and engineered again.
Now, let’s come to a big question. WHERE DO THE ANALYSTS GET THE DATA FROM? Well, there are different methods to extract data:
· The classic method is to take the Data from your own company’s database.
· Another method can be an API which gives you data. You hit the API and it throws you back the data.
· Another method can be open source datasets that are available in the market. One of the most famous website which serves open source datasets is KAGGLE (recently owned by Google).
· Another method can be Web Scraping. We, basically, throw a python code on different websites and it gathers the required data.
· Another method is that you yourself create the data.
Well, retrieving data is one of the most hectic tasks. That is the reason it has a specific job role assigned to it. And guess which job role can it be?
If you guessed Data Engineer, you guessed it absolutely right.
Data Engineer
By Complex Definition, Data Engineers are these guys who prepare the “big data” infrastructure to be analyzed by the Data Analysts. Frankly speaking, they are people who are responsible for storing a Company’s data, as in how many tables should be there for least storage requirement and the contents of the table. They are the people who also retrieve the data and engineer it. Engineer it, as in clean the data (as much as possible), join tables and perform other operations and then present them to Data Scientists or Data Analysts. Their job is to make sure that the data is easily accessible and their goal is to optimize the performance of their company’s big data infrastructure.
Skills:- Hadoop, MapReduce, Hive, Pig, Data Streaming, NoSQL, SQL, programming.
Tools:- DashDB, MySQL, MongoDB, Cassandra.
Responsibilities:-
· Management- The Data Engineer manages the Big Data infrastructure of the company.
· Analytics- The Data Engineer plays an analytical role where he performs analyses of the data stored in the databases. He/She also troubleshoots data issues within business and across the business and presents solution to these issues.
· Collaborative role- The Data Engineer plays a collaborative role where, in collaboration with senior data engineering management, he develops and implements scripts for database maintenance, monitoring and performance tuning to be applied across the business.
· Knowledge- It is also the Data Engineers duty to keep up with industry trends and best practices.
Salary Trend:-
The current salary trend from Payscale States that the Salary of Data Engineers according to their experience. Salary of a Data Engineer with experience less than a year is ₹ 407K and increases in a linear fashion as experience increases.
Data Analyst
After extracting the data from the database, it is the Data Analyst who performs major analysis on the data for Company betterment. Basically, Data Analyst interprets data and turns it into information which can offer ways to improve a business, thus affecting business decisions.
Let us get back to the example given earlier where you have an e-commerce company. Let us say that there is an occasion nearby. Imagine the month of December. You know its Christmas and people will spend a lot. For an increased profit, you would want to increase the stock of your products. But will you increase the stock of every product that you have? There can be millions of products and some of them might not be worth increasing during Christmas. Is it worth increasing the amount of Summer Clothes during Christmas? Of course not. With human intuition we can say that we will not increase the quantities of all but certain products about which we are most certain that they will bring us better profits. With human intuition we can say that some of the products may be giftable products which are worth increasing the quantity as people gift each other many things at Christmas. This is where the Data Analyst kicks in. He reduces the hard work of the company CEO to decide which products should be worth increasing the quantity for maximized profit. This guy can look up the previous year’s data and say the names of the particular products that had exponential sale last year.
Skills:- SQL, Microsoft Excel, Critical Thinking, Python-statistical programming or R, Data Visualization, Presentation Skills.
Responsibilities:-
· Data Cleaning and Preparation, also known as Data Pre-processing.
· Data Analysis and Exploration.
· Statistical Knowledge
· Creating Data Visualizations
· Creating Dashboards and/or reports.
· Writing and Communication
· Domain Knowledge
· Problem-solving.
Salary trend:-
This job role starts with a minimum salary of ₹317k with experience level less than a year and increases exponentially with increase in experience.
Data Scientist
This is the main guy of the company. This guy takes up the analysis from the Data Analyst and applies Machine Learning algorithms to predict the future outcomes of a particular decision taken by the company, or predicts the market value of a product in future years and many other tasks. Data Scientists are analytical experts who utilize their skills in both technology and social science to find trends and manage data. They analyze, process, model data, and then interpret the results to create actionable plans for companies and other organizations.
Let us get back to the old example where you have an e-commerce website. I have been giving this example throughout the whole article so as to keep things related to a simple and single company. You can relate it to bigger companies as well.
Let’s say, for a little motivation, your company is running pretty well and you want to expand your network of warehouses to increase the amount of stock. Will you randomly buy places at random spots and build a warehouse? No. This is because delivery time is one of your main concerns. You want to serve your customers as fast as possible. In order to do so, you would want to have the location of your warehouse near your customers. As warehouses can be built only once (and renovated as well), you would want your warehouse to be located in a place where many people buy your products. This is one of the roles of the Data Scientist. This guy takes up the analyzed data, and based on this analyzed data he applies his model to select a location which will optimize your delivery time and your profits. And this is just one example of the job of a Data Scientist. This guy has a lot to perform and is vast to explain in a single example.
Tools:- Python, R, Julia, PyCharm, Orange and IBM Watson, Tableau and so on.
Responsibilities:-
· Programming
· Machine Learning Techniques
· Data Visualization and Reporting
· Risk Analysis
· Statistical Analysis
· Effective Communication
· Research
Salary Trend:-
According to PayScale the starting salary of a data scientist with less than a year’s experience can be upto ₹509K and a high exponential increase in salary with increase in experience.
ML Engineer
Imagine your company’s Data Scientist has created a model which recommends products similar to which people buy. You want this model to be integrated with your website so that anybody visiting your website will get recommended. But currently this recommendation engine resides on the Data Scientist’s Laptop has he does not know how apply that model to the website. The Web Developer who has built the model does not have any idea about Machine Learning models. There is knowledge gap between the people of your company. This is where the ML Engineer fits in. He is responsible for communicating with the Data Scientist and the Web Developer and to integrate the ML model created by the Data Scientist.
Skills:-
· Computer Science Fundamentals and Programming
· Probability and Statistics
· Data Modeling and Evaluation
· Applying Machine Learning Algorithms and Libraries
· Software Engineering and System Design
Salary Trend:-
From the above chart (source: PayScale), you can see that the salary of a ML Engineer is almost the same as a Data Scientist. The Salary starts from ₹503K with experience less than a year and increases exponentially with experience.
Comparison of Skills versus Different job roles in Data Science
Here is a chart from Udacity showing what Data Science skills are required for different Job roles in Data Science.
The Data Engineer requires a lot of Programming Tools as he has to extract and clean the data and also perform other operations. He does not need much data visualization, nor does he need Data Intuition and Statistics. In no way he requires Machine Learning and Calculus. But he is required to wrangle with the data and Software engineering.
The Data Analyst does require a lot of programming and Data Visualization skills. He also requires Data Intuition and Statistics.
The Data Scientist requires almost all the things. But the part where he can be given a little relief is Software Engineering and Calculus.
The ML Engineer does not need to know how to wrangle with the data as the Analysts and Scientists will do it for him. But he requires Machine Learning and Calculus.
Conclusion
The field of Data Science is a very vast field and is one of the most growing and promising fields. As the growth increases, the requirement for Engineers increases. Therefore, the vacancies of jobs in this field, is very large. You can easily search any job portal and if you have the right skills, you are sure to get a job in this field with a good salary at the end of the day.