Data Science Explained – Pronto!

We have reached the end of our first season of “Data Science Pronto!” videos. Thirteen short videos explain why some operations have become best practice in data science. Like – we know that the R^2 metric is not good for evaluating time series prediction performances, but why? Or we know we need regularization terms to avoid overfitting, but what is a regularization term? Or even, yes, we know we need to carefully partition our data when uploading them to a Spark platform, and that this partitioning is not the same partitioning to create test set and training set: But what is it then? The "Data Science Pronto!" videos try to explain “why” some practices have become best practices in data science.

As we wrap up the first season, let’s revisit the questions we’ve asked ourselves. We covered all sorts of data science related topics: use cases, data loading and preparation, machine learning, time series analysis, deep learning, and finally of course deployment. In retrospect, we might have covered one or two details throughout all steps of the data science creation cycle in less than 30 minutes. How is it possible? Shall we check?

Sentiment Analysis

Sentiment analysis has been a mantra in the past years. Articles, blog posts, and applications on sentiment analysis abounded. It seemed very mysterious and tricky to implement. Does it require a grammar based solution or a machine learning solution? And how does deep learning fit into the picture? And most of all, what is it? Lada tried to explain all of this by only stealing 2:20 minutes of your time.

Data Loading

Before we start applying any kind of data science steps, we have to load our data into a file, a database, or a big data platform. Due to the structure of some big data platforms, data often has to be partitioned (or re-partitioned) to be processed most efficiently in terms of speed. Now, this partitioning operation, although often heard in a data environment, is NOT the same partitioning operation that separates a dataset into a training set and test set. Do not confuse them! So, what is it? And do I really need it? Here is a concise answer to avoid confusion down the road.

Data Preparation

Data preparation seems like an easy step within the data science creation cycle.

Data Leakage

You’ve probably heard about avoiding “data leakage” and what data leakage is. What I am not sure about is that you know that you know. The “data leakage” term has been introduced recently to indicate the well known problem of polluting the test data with training data. We explain data leakage in this Data Science Pronto video.

Bootstrapping

Often in the data preparation phase, we encounter the problem of not having enough data. This is probably when you’ve heard the term “bootstrapping”. Bootstrapping is a statistical technique to artificially recreate a dataset with the same statistical properties as the original dataset. It’s not only used for data preparation, but also used in some machine learning algorithms. We briefly explain how it works in the following video.

Machine Learning

We have now reached the heart of data science: the machine learning algorithms.

Let’s start from the basics.

Linear vs. Logistic Regression

What is the difference between linear regression and binary logistic regression? Do you know how to interpret their coefficients to quantify the importance of their input features? If not, here is a 2-minute “Data Science Pronto!” video on this topic.

Bagging and Boosting

I am sure you know about random forest. You know it is an ensemble algorithm. Is it a bagging or a boosting algorithm? What is the difference between bagging and boosting? Check out this “Data Science Pronto!” video for the details.

Overfitting

Now moving to the evaluation part where we have to make sure that the model we have trained is not overfitting. Overfitting is a common pitfall in machine learning and we must take all possible measures to avoid it, e.g., add a regularization term to the cost function used during training. All clear? If not, here is a video to explain:

Regularization

And then also a video to explain what regularization terms are and why they help.

Time Series Analysis

Still in the evaluation phase, we always hear that we can use error metrics to evaluate models for numeric predictions. However, in case of time series analysis, R^2 should be avoided. We have all accepted that, but have you ever wondered why?

Deep Learning

Of course, we also dived into deep learning. It could not be otherwise. Deep learning is based on neural networks. The popularity of neural networks started back with the backpropagation algorithm, which is based on the gradient descent technique. The gradient descent technique is the basic foundation for the whole deep learning space. Curious? It is explained quickly but exhaustively in this video.

Gradient Descent

LSTM Units

One famous paradigm in deep learning is Long Short Term Memories (or LSTM), the ones that are so good at learning from past examples. Here is a quick summary of what they are and why they are so popular.

Deployment

The model is now ready to exit the data science creation cycle and to go out into the real world. The moment for deployment has come. There are many options to deploy a machine learning model or any other data related application, as a web application, as a dashboard, as a web service via REST API, and so on.

Deployment

If you are not familiar with those term, you should at least learn quickly what deployment is:

REST Service

And what a REST service (web service via REST) is:

That was the entire first season of “Data Science Pronto!” videos. Did you know all the answers to all the “whys”? Did you learn new things in this 30-minute tour of data science? If you knew all of the answers already, maybe you should star in one of the next videos. If you did not, then stay tuned for the next series of videos!

Indeed, we will now take a pause now for a few months, recharge, and hunt those data science questions for which you know the formal answers but not the why. We will be back sooner than you think!

Send an email to blog@knime.com with your burning questions about obscure best practices, incomprehensible parts of algorithms, frequently mentioned use cases and we will try to answer.

Thank you for watching… from the “Data Science Pronto!” team.

The Data Science Pronto Team!

Data Science Explained – Pronto!

Sentiment Analysis

Data Loading

Data Preparation

Data Leakage

Bootstrapping

Machine Learning

Linear vs. Logistic Regression

Bagging and Boosting

Overfitting

Regularization

Time Series Analysis

Deep Learning

Gradient Descent

LSTM Units

Deployment

Deployment

REST Service

You might also like