The importance of community in data science

As first published in Data Science Central.

Assembling predictive analytics workflows benefits from help and reviews; on processes and algorithms by data science colleagues; on IT infrastructure to deploy, manage, and monitor the AI-based solutions by IT professionals; on dashboards and reporting features to communicate the final results by data visualization experts; as well as on automatization features for workflow execution by system administrators.

Data scientists benefit from a community of experts

The need for a community of experts to support the work of a data scientist has ignited a number of forums and blogs where help can be sought online. This is not surprising because data science techniques and tools are constantly evolving, and mainly, it is only the online resources that can keep up the pace. Of course, you can still draw on traditional publications, like books and journals. However, they help in explaining and understanding fundamental concepts rather than asking simple questions that can be answered on the fly.

It doesn’t matter what the topic is, you’ll always find a forum to post your question and wait for the answer. If you have trouble training a model, head over to Kaggle Forum or Data Science Reddit. If you are coding a particular function in Python or R, you can refer to Stack Overflow to seek help. In most cases, there will actually be no need to post any questions because someone else is likely to have had the same or a similar query, and the answer will be there waiting for you.

Sometimes, though, for complex topics, threads on a forum might not be enough to get the answer you seek. In these cases, some blogs could provide the full and detailed explanation on that brand new data science practice. On Medium, you can find many known authors freely sharing their knowledge and experience without any constraints posed by the platform owner. If you prefer blogs with moderated content, check out online magazines such as Data Science Central, KDnuggets or the KNIME Blog.

There are also a number of data science platforms out there to easily share your work with others. The most popular example is definitely GitHub, where lots of code and open source tools are shared and constantly updated by many data scientists and developers.

Despite all of those examples, inspiring data science communities do not need to be online, as you can often connect with other experts offline as well. For instance, you could join free events in your city via Meetup or go to conferences like ODSC or Strata, which take place on different continents several times each year.

I am sure there are many more examples of data science communities which should be mentioned, but now that we have seen some of them, can you tell what a data scientist actually looks for in all those different platforms?

To answer this question, we will explore four basic needs data scientists rely on to accomplish their daily work.

1. Data science examples to learn from

Data scientists are constantly updating their skill set: algorithm explanations, advice on techniques, hints on best practices, and most of all, recommendations about the process to follow. What we learn in schools and courses is often the standard data analytics process. However, in real life, many unexpected situations arise, and we need to figure out how to best solve them. This is where help and advice from the community become precious.

Junior data scientists exploit the community even more to learn. The community is where they hope to find exercises, example datasets, and prepackaged solutions to practice and learn. There are a number of community hubs where junior data scientists can learn more about algorithms and best practices through courses on site, online or even a combination of the two —starting with the dataset repository at UC Irvine, continuing with the datasets and knowledge-award competitions on Kaggle, through to educational online platforms such as Coursera or Udemy. There, junior data scientists can find a variety of datasets, problems and ready-to-use solutions.

However, blind trust in the community has often been indicated as the problem of the modern web-connected world. Such examples and training exercises must bear some degree of trustworthiness, either from a moderated community — here the moderator is responsible for the quality of the material — or via some kind of review system self-fueled by community members. In the latter, the members of the community evaluate and rate the quality of the training material offered, example by example. Junior data scientists can therefore rely on the previous experience of other data scientists and start from the highest rated workflow to learn new skills. If the forum or dataset repository is not moderated, a review system in place is necessary for orientation.

2. Blueprints to jump-start the next project

Example workflows and scripts, however, are not limited to junior data scientists. Seasoned data scientists need them too! More precisely, seasoned data scientists need blueprint workflows or scripts to quickly adapt to their new project. Building everything from scratch for each new project is quite expensive in terms of time and resources. Relying on a repository of close and adaptable prototypes speeds up the proof-of-concept (PoC) phase as well as the implementation of the early prototype.

As is the case for junior data scientists, seasoned data scientists make use of the data science community, too, to download, discuss and review blueprint applications. Again, rating and reviewing by the community produces a measure for the quality of each single blueprint.

3. Feedback on solutions with the community

It is actually not true that users are only interested in the free ride — in this case, meaning free solutions. Users have a genuine wish to contribute back to the community with material from their own work. Often, users are more than willing to share and discuss their scripts and workflows with other users in the community. The upload of a solution and the discussion that can ensue have the additional benefit of revealing bugs or improving the data flow, making it more efficient. One mind, as brilliant as it may be, can only achieve to a certain extent. Many minds working together can go much farther!

This concept reflects the open source approach of many data science projects in recent years: Jupyter Notebook, Apache Spark, Apache Hadoop, KNIME, TensorFlow, Scikit-learn and more. Most of those projects developed even faster and more successfully just because they were leveraging the help of community members by providing free and open access to their code.

Modern data scientists need an easy way to upload and share their example workflows and projects, in addition to, of course, an option to easily download, rate and discuss existing ones already published online. When you offer an easy way for users to share their work, you’d be surprised by the amount of contributions you will receive from community users. If we are talking about code, GitHub is a good example.

4. A space for discussions

As we pointed out, the main advantage to an average data scientist for uploading his/her own examples on a public repository — besides, of course, the pride and self-fulfillment of being a generous and active member of the community — exists primarily in the corrections and improvements advised by fellow data scientists.

Assembling a prototype solution to solve the problem might take a relatively short time. Improving that solution to be faster, scalable and achieve those few additional percentages of accuracy might take longer. More research, study of best practices, and comparison with other people’s work is usually involved, and that takes time with the risk of missing a few important works in the field.

Therefore, data scientists need an easy way to discuss with other experts within the community to significantly shorten the time for solution improvement and optimization. A community environment to exchange opinions and discuss solutions would serve the purpose. This could take place online on websites like the KNIME Forum or offline at free local meetup events.

These are the four important social features that data scientists rely on to build and improve their data science projects.

A data science platform for the community

Data scientists could definitely use a project repository interfaced with a social platform to learn the basics of data science, jump-start the work for their current project, discuss best practices and improvements, and last but not least, contribute back to the community with their knowledge and experience.

Project implementation is often tied to a specific tool. Wouldn’t it be great if every data science tool could offer such a community platform?