To SQL or Not to SQL, UFOs & Data Science

It was my pleasure to recently interview Tosin Adekanye as part of the My Data Guest interview series.

Tosin Adekanye started her journey towards data science during her undergrad studies in psychology, where she was heavily involved in statistics and academic research. She developed these skills further during her MBA, where she started learning and utilizing ML algorithms and data science software. But it was during the lockdown, last year, that she became obsessed with data science! And that is also when she got to learn the No Code/Low Code “powerhouse” – as she calls KNIME software.

Tosin is a very active member of the KNIME Community and an influencer in the data science space on social media. She writes articles for a number of journals, like the "Low code for Advanced Data Science" journal on Medium, describing very interesting solutions to the most diverse tasks. I repeat “to the most diverse tasks”. For example, in an article she provides a solution for fraud detection on credit card transactions; in another article she talks about UFO sightings; then in another article she describes the possibility to bypass SQL coding with visual programming, and so on. Quite a diverse array of projects.

Rosaria: Hi Tosin, Tell us more about these different projects. Which of the projects you’ve written articles about were for work and which ones are hobbies?

Tosin: Some of the articles were motivated by work projects. When I was at FIS, the global credit card processing company, a colleague in the fraud department told me about how they use models to predict credit card frauds. That’s what got me curious to build something like that myself and write about it. Another work-inspired article came about because I use databases heavily every day. That led to the article on SQL. The UFO article on the other hand, well, that's a hobby of mine. I’m obsessed with science fiction and alien sightings. But it’s hobbies that also motivate me to look into different topics.

Rosaria: I’d like to talk about your article for predicting credit card fraud. Can you tell us a bit more about your interest in working with financial data?

Tosin: Yes, I really like finance. My background is actually psychology and business, but finance is a synergy of all my passions. Fraud is a real pain for both the customer and the company. It’s cost so many companies in the U.S. so much money... I was curious to see if I could build a model that could catch these fraudulent transactions and maybe tell us which variables seem to go with fraud. The dataset I used was synthetic data, but it was based on real-life trends in this field.

Someone's age, for example, is associated with a high likelihood for fraud: Fraudsters like to pick on potentially gullible, older people. Online fraud also takes place more at night or early in the mornings. Being aware of insight like this can help companies and customers better protect themselves.

Rosaria: Have you ever tried to predict stock prices? And do you think this is actually possible?

Tosin: Let me rewind a bit! During my MBA in finance we tried to predict stock prices based off of things such as financial ratios from past company statements. Our best models probably predicted to an accuracy of 10 to 14 and only accounted for a certain amount of variance. I believe that to really predict a stock price you need to be holistic. You need to look at everything from news articles to social media posts to historical stock prices to the company's financials. I am still actively working on this. By the way, this project was also the reason why I started using KNIME. I needed something that lets me process data from all these kinds of sources.

In the past, I also ran some analysis between sentiment on Twitter and the movement of S&P values, and detected some relationships there. So yes, I think it's possible to predict stock prices. Definitely not to 100% accuracy, but given enough data from lots of different sources and good models, yes it's possible to predict with some accuracy.

Rosaria: In your article on fraud detection, you compared the performance of several models (Decision Tree, XGboost, Gradient Boosted trees) and declared the XGboost model to be the best. Why do you think XGBoost performed better than the other algorithms in this particular use case?

Tosin: For classification XGboost tends to perform better for several reasons. It’s an ensemble model so it's building multiple trees, which usually have better performance. Also, XGboost improves on weak learners - on features that are not leading to good predictions. It keeps getting better as it progresses. It also limits overfitting by introducing a cost function, so all these different controls usually make it perform better than other models.

Rosaria: You say that it was your work with databases that led you to write the article “to SQL or not to SQL?”. This is a controversial topic. Some people say, "You must SQL otherwise you forget how to program in SQL". Others say, "If I cannot SQL, then visual programming allows me to SQL anyway". What is your opinion?

Tosin: Oftentimes I feel that in this field of data science people can become too attached to tools. I would say do what works best for you. If you like to see code and it’s the best thing for you then do that! I prefer to have a blended approach. I have workflows with lots of SQL code, but you’ll also see me using SQL nodes [e.g., the nodes in the KNIME Database extension implementing SQL queries in the background] because oftentimes that's what works best for me. Essentially whatever works best and whatever is most efficient for you is what I would recommend.

Rosaria: Some time ago, somebody posted on Twitter that people use low code tools only if forced by their bosses, that people would never use it for their hobbies. You seem to contradict that tweet. Do you use KNIME Analytics Platform for all your projects?

Tosin: I’m not sure how they got that point of view. I brought KNIME to my workplace! I don’t use KNIME exclusively but it would be hard for me to go somewhere and not be able to use KNIME.

Rosaria: So, you use KNIME Analytics Platform in combination with other tools?

Tosin: Yes, usually Power BI and KNIME. Sometimes I use PyCharm if I need to program in Python. KNIME does have a Python node, so you can program in there too, but I usually use KNIME, PyCharm, and Power BI.

Rosaria: Your other big passion in addition to data science is science fiction. In your UFO article, you found a way to combine them. Tell us more about that story.

Tosin: I'm a huge fan of sci-fi and psychology. In psychology we often notice that the more something is talked about the more people seem to experience it. So I wondered if there could be some sort of relationship between sightings of UFOs and movies about UFOs. Based on data of UFO sightings and release dates of movies about aliens I did a visualization of the correlation. And the correlation was highly significant! But of course that's not causation.I would probably need to do some more digging and get better data - a richer movie database plus more recent UFO sightings and then see if I can isolate what's causing what.

Rosaria: How did KNIME help you decipher the relationship between the number of movies and number of sightings?

Tosin: I had multiple datasets so KNIME really helped me to blend them. The movie datasets are pretty different, so I had to do some standardization and join everything together. KNIME was very helpful for that as well as for correlation. The correlation was super easy to run just using the Linear Correlation node and then I was able to quickly look at the relationship.

Tip: Check out the workflow UFO Sightings Data Prep in Tosin’s space on KNIME Hub.

Rosaria: Are you planning to write more articles like those? I am following you and I cannot wait till the next one appears.

Tosin: I haven’t quite decided about my next article but there are two things I’m working on. One is flight cancellations - looking at what factors go with flight cancellations and when flights are most likely to be canceled.

My other project is fun! There's a Twitter API Connector node in KNIME that makes it super easy to pull tweets. So I’m going to get tweets from different countries that mention the word happiness and then see which words are most associated with happiness in different parts of the world.

Rosaria: How can data scientists in the audience follow your work and access your workflows?

Tosin: I write articles for KNIME's Low Code for Advanced Data Science journal on Medium and I also have my website TosinLitics where I publish my work. I can recommend checking out my KNIME Hub space where I keep my workflows. I think it's very helpful to see what other people have done - not just my workflows but all workflows in general on the KNIME Hub. It's great for idea generation or to help you get unstuck. Being able to refer to examples by others helps a lot.

Rosaria: Your TomTom component is in your KNIME Hub space, right? It was a very popular component on the KNIME Hub. What does it do? Can I download it?

Tosin: Yes, I was motivated to do this because I didn't really see many solutions around that let you quickly overview the distance between two points using longitude and latitude. I did some research and figured Tom Tom was the best tool to go with. So now you just get your API key and put it in the component. Once you have your data files containing the longitudinal information, you can run those through the component and get the drive time and the distance, traffic delays, and all related information, to go from one point to the other.

Tip: Download the component Drivetime and Distance Query - Latitude Longitude from the KNIME Hub.

Rosaria: Are you planning to implement more components?

Tosin: Yes. There’s so much you can get from the Tom Tom API! I’m planning to do a couple more. This could be a family of geospatial components.

Rosaria: You are a very active KNIME community member but let's go back in time: How long have you been using KNIME, and how did you get started with KNIME?

Tosin: This might come as a surprise but I actually only started using KNIME in January 2021. I’d already had some exposure to software like KNIME - I used SPSS Modeler from 2017. Then I used Alteryx but the licensing was a barrier for me. I needed something that was efficient, that could let me do so many things for data science. That's when I found KNIME.

Even though I haven't been using KNIME that long, you can really climb the learning curve quickly because of the resources that are available. KNIME also has some of the most approachable, most passionate employees and that's really helped me come along in my learning curve.

Rosaria: Tell us about the biggest challenge you had to solve in your professional life as a data scientist.

Tosin: Dealing with textual data! I had been running away from this for many many years. But in January I wanted to learn how to process and analyze textual data and get comfortable with it. Example workflows have helped a lot with this, you know. I wanted to do some sentiment analysis. I realized that with textual analysis, once the data’s cleaned properly and processed, it can be reduced to mathematics! Now it just seems a lot easier.

Rosaria: Tell us the biggest mistake you’ve learned from.

Tosin: My biggest mistake was the basis for an article I wrote about class imbalance and why accuracy is not always the best - especially when you have imbalanced classes. I posted about my first detection model because it had an accuracy of 99%. The LinkedIn data science crowd was super helpful because they pointed out that when you have imbalanced classes, accuracy is not necessarily the best metric. Indeed, you can have high accuracy values, but your minority class can still perform really badly in terms of classification. That was something I knew but sometimes you know something in theory but you don't really realize it until you see it happening in practice

Rosaria: Yes, imbalanced classes can give you false expectations of how it’s performing. Do you have any advice for all young aspiring data scientists who are in the audience?

Tosin: I have three primary pieces of advice:

One, read a lot. Medium is a good platform. You don't have to fully understand everything you read but you’ll be familiarized with the topic and this will help in the future.
Also, don't be afraid to share your work .The first things I shared weren’t always that good but having shared them I got feedback which helped me improve more.
Just get started! You don't have to be perfect but you're going to grow and build up from there.

Rosaria: Any book to recommend for the ones in the audience always eager to learn something new?

Tosin: Yes, Data Analytics Made Easy by Andrea De Mauro. I really like it because it teaches you the theory for analytics and for data science, which is so important. Sometimes, programs for data science just jump right into Python, but I think the theory is more important. At the end of the day Python is just a tool. When you learn about theory, this knowledge helps you know how to overcome obstacles in practice and be better in this field. This book also teaches how to use KNIME.

Rosaria: Books are definitely an essential tool to get a solid basis but what are your usual readings to keep you up-to-date on new, exciting data stories?

Tosin: I read a lot of Medium articles and I'm very active on LinkedIn, connected with people. so I usually see a lot of things that are being talked about. Staying in the loop, reading, and googling helps to keep up to date.

Rosaria: While following your #30daysofknime initiative, I also discovered that you are a very talented video-maker and actually ... "surprise" ... a very talented singer. Would you like to conclude this interview with the song you sang in the first video posted within the #30daysofknime initiative?

Find Tosin’s song - in the video of the original interview below!

To SQL or Not to SQL, UFOs & Data Science

You might also like