We recently had the opportunity to chat with Daniel Contaifer, a Persist.AI researcher (formerly VCU) who started using KNIME a little while ago. In his work, Daniel studies associations between biomarkers and PCOS, but verifying their statistical validity can be challenging because the number of samples he has available is not very large. Fortunately, KNIME has been helping him tackle this issue: with oversampling functionalities like the SMOTE node, Daniel was able to gather insights about these associations in a much more robust fashion. And the best part? No coding was involved.
Daniel, a seasoned biomedical researcher, was frustrated with having to learn how to code in order to conduct his research. After being introduced to KNIME, he quickly grasped the basic concepts of the platform and put together a comprehensive statistical workflow for his research in a matter of a few weeks. We were so impressed with his KNIME journey that we decided to write two articles together: in this one, we interview Daniel to understand a bit more about the research he does and how he has been using KNIME; in a second article, we will focus on the workflow he has built to identify potential associations between different biomarkers and PCOS, an ovarian syndrome that affects 6% to 12% (as many as 5 million) of US women of reproductive age. Now without any further ado…Daniel Contaifer and KNIME.
(Aline) Hi Daniel, it is great to talk to you again! We were wondering if you could share your KNIME journey with us. But first things first: Tell us a little bit about your work as a research specialist at VCU.
(Daniel) My story of how I became a life science researcher is at the center of how I discovered the potential of using KNIME. After I finished my bachelor’s degree in Sciences at State University of Rio de Janeiro, Brazil, I didn’t have a chance to complete my master’s degree in biophysics, but I continued learning and updating my scientific knowledge until I was given the opportunities that I was craving for. I was invited to come to the US 18 years ago as a lab assistant on a J1 visa to work as a substitute for a research specialist that was on maternity leave for 6 months. My boss was so impressed with my work and my fast grasping of the technical aspects of the project that he offered me a position to continue on the project. After a few years he changed my status to a H-1B visa as a research specialist.
With curiosity and effort to learn, I became an independent researcher in his group and I worked on different projects at VCU. At this time I met Dr. Dayanjan Winjesinghe (also known as Shanaka), who was still doing postdoctoral work in the Department of Biochemistry. He likes to say that I was the only one that made sense of the issues the lab was confronting when he participated as a collaborator in our lab meetings. We became friends and when he started his lab at the VCU School of Pharmacy, he invited me to be his research specialist and lab manager.
His belief in my potential as a researcher and in my capacity of self-learning pushed me towards data analysis, using multivariate statistical methods to tackle the complexity of high dimensional data found on metabolomics and lipidomics studies. After acquiring knowledge on lipidomics and metabolomics analysis, and on the multivariate statistical tools to discern the correlation of the biochemical pathways associated with chronic diseases, I became more interested in finding putative biomarkers linked to the pathophysiological signs of these diseases. The struggles to achieve this goal are well known by researchers. Although we are able to acquire a huge amount of data, we are limited in our search for putative biomarkers due to the small sample size of the clinical studies.
(Aline) Thanks for the details on your trajectory, including research struggles. In this context, how did you first hear about KNIME?
(Daniel) The cumbersome tasks of data transformation, centering, and normalization are time consuming when working with data extracted from lipidomics. So, I decided that I should learn some coding to speed up my work. I am a bit old now, closer to my retirement than I realized, but if all the kids are learning how to code at least in R, I should be doing the same. In addition, Shanaka always pushed me and the young students in our lab towards coding. This, however, is easier said than done, and I confess I ended up procrastinating on that goal.
Overtime, relying on others to help me with data mining, or to understand my vision on how I want the data analyzed, led to wasting time and frustration. Shanaka eventually suggested: “Daniel, why don’t you try using KNIME to create a statistical workflow, that way you won’t need to learn how to code.” It looked too good to be true, but I accepted the suggestion and started playing with the KNIME nodes. I still had to rely on the coding expertise of some students to create R snippets to solve some convoluted parts of my workflow. However, I noticed that with time new nodes were offered in the KNIME hub, and with great help from the KNIME forum I gained more independence to build my workflow.
(Aline) It is interesting that you started with KNIME in collaboration with other researchers. Do you think that KNIME made research collaboration easier for you? If so, in what ways?
(Daniel) The process of building a workflow that can be used not only by me, but also understood and practical for anyone is not an easy task. Although the main body of the workflow came from my attempts to recreate what was on my mind in KNIME, asking for suggestions from colleagues and associates helped a great deal towards a comprehensive solution to deal with small sample sizes. Before launching a new iteration of the workflow to improve the analysis, I would send it to colleagues and ask them to use their own data, or to use my dataset and ask for input on how it worked, besides reporting problems, errors, and suggestions for improvement. So I take credit for the final product, but without the feedback from other researchers in our group I would not be satisfied with the solutions I created with KNIME.
(Aline) Interesting, thanks! You frequently mention how small sample sizes are challenging for research. How can one get statistical validity for their results if there are not enough observations? We were wondering if you tried other KNIME functionalities to tackle this problem. And how did you find out about the SMOTE node?
(Daniel) Small sample size means not enough power to reach statistical significance, even if there is a clear trend towards biochemical or physiological differences among the groups we’re studying. This was sometimes frustrating after analyzing our data. A power analysis is a good practice before planning your experiments, and getting previous data from pilot studies or searching published results helps to plan the size of the sample population that will allow you to reach the right power to find any statistical significance in your study.
However, this is more complicated when structuring an exploratory analysis using multivariate data. We were dealing with hundreds of variables from three databases, without a previous hypothesis. My approach to deal with the small sample was the basic assumption of the representative aspect of the cohort. The PCOS group of the sample population was selected with the Institutional Review Board (IRB) approved protocol, with the aim to represent women with obesity diagnosed with PCOS and compared with a control group of women with obesity, but no diagnosis of PCOS (non-PCOS) group. The main study was published in 2016.
Our exploratory analysis was a post-hoc analysis of the lipidome and metabolome of the cohort plasma. Assuming that the small group we had at hand was representative of the PCOS population across women with obesity, we proposed that the SMOTE node found on KNIME hub was a perfect solution to solve the sample size issue. The post on KNIME hub states,
This node oversamples the input data (i.e. adds artificial rows) to enrich the training data. The applied technique is called SMOTE (Synthetic Minority Over-sampling Technique) by Chawla et al. It creates synthetic rows by extrapolating between a real object of a given class (in the above example "active") and one of its nearest neighbors (of the same class). It then picks a point along the line between these two objects and determines the attributes (cell values) of the new object based on this randomly chosen point.
We opted to oversample the two groups, since both were composed of only 15 study participants. A four times oversampling increased the sample to 60 observations per group. The synthetic sample has a similar distribution profile and z-scores, thus enabling a representative approximation of the variability within each group, and to expect it to reveal any significant statistical difference. Also, the binary groups of the PCOS study (yes if the women has PCOS; no, otherwise) allowed for the use of classification models, such as Random Forest and Naive Bayes. Moreover, we could use validation models with Logistic regression and Area Under the Curve estimates. The synthetic sample size also allowed the split of the groups to 80% for training and 20% for test, which otherwise would be impossible. The result of the exploration showed that a Random Forest approach selected a model with eight variables, rendering an accuracy of 98%.
We are finishing the manuscript with the results for publication, and we are certain that our approach will be approved by a peer reviewed journal. Moreover, we will make the workflow public for other researchers to use it, so that they can find solutions for similar small sample size studies.
(Aline) We are so happy to hear that this approach has been leading to satisfactory results! When we first talked, I remember that you mentioned being frustrated about the idea of coding. We were wondering if you gave coding a shot at some point, and if so how the experience compares with using a low-code tool such as KNIME.
(Daniel) I indeed tried to learn R, although my personal experience is that complex scripts are time consuming to learn and difficult to turn concise. I came close to using scripts available online, but never scratched the surface of linking the several different steps that were necessary to solve some of my questions. Although I must say that I was resisting the idea of learning how to code, my life as a researcher became so much easier and exciting as I found how easy it was to work with KNIME nodes. I sincerely think that anyone with a basic statistical understanding can use KNIME as a tool to overstep the coding process. Actually, setting the hyperparameters for node configuration is the closest to coding the KNIME experience can get me. The possibilities are immeasurable for solving the different problems my line of work presents.
(Aline) Fantastic. Another thing we talked about last time was about your interest in Statistics, even though you are not a statistician. Have you ever used other KNIME statistical nodes in your research, or are there any specific nodes that you are interested in exploring?
(Daniel) The need to use statistical multivariate analysis was a result of my curiosity about the possibilities to find hidden information in the huge amount of data we were acquiring in the lab. We asked some statisticians to look at our data and propose some approaches for data mining, but noticed that there was a resistance to using this approach. They usually ended up applying the hypothesis-driven approach of group comparison with ANOVA and t-test, with their non-parametric counterpart. Unfortunately, this approach was ringed by the multiple pairwise comparison due to hundreds of variables to test, and the lack of power for comparison.
I decided to learn a bit more about data mining without any prejudice, and I started to find solutions for the analysis. Together with Shanaka, I have published around 20 papers and more are getting ready to be published this year. As for my discovery of KNIME nodes, I am now working on adding an unsupervised side-workflow to the supervised one I used for the PCOS study, which will give me the chance to explore not only the differences, but also to identify possible outliers and study their peculiar pathophysiological profiles. For that I am learning how to use the Cluster Analysis node. It can’t be more exciting than that, right?
(Aline) Right. Unsupervised methods are excellent for finding hidden structure in data, which can sometimes even lead to new research questions. Now, a final question: here at KNIME we are always looking for ways to deliver better solutions. For that reason, we would like to know if there is anything you would like to see improving, or if there is any functionality that is currently missing in your opinion.
(Daniel) I learned to use KNIME very quickly two years ago, but there is still so much I need to learn. The use of machine learning and AI training to identify patterns on data generated by our lab is a challenge that we are starting to tackle. I spent several months out of the lab, for personal reasons, and only now am back to work with research again. I came back to KNIME and found out that several improvements had been added. The node repository looks more structured and offers a long list of new nodes that I need to explore, since they did not exist two years ago.
I still find it confusing to know when to use a Component, as I use a lot of Metanodes as you can see in my workflow. I wonder if as we construct our workflow, the system could offer suggestions for how to use a Component or a Metanode to simplify our work. I have seen some of my colleagues' workflows and they would definitely benefit from some tips for us to make sense of the labyrinth they are creating. If something like this already exists, it’s not clear to me and I would like to learn about it. I am starting to follow the online training and I hope I will be able to work more efficiently and give better feedback in the future.
(Aline) Thank you so much for your time, Daniel! Your story of quickly learning KNIME to tackle an important problem is very inspiring!
(Daniel) It was my pleasure to participate in the conversation. Thank you for the opportunity.
And that is it for today, folks! If you enjoyed our interview with Daniel, stay tuned for our article on his research work!