KNIME logo
Contact usDownload
Read time: 5 min

Why a data practice is not a software engineering department

April 18, 2024
Data strategy
Stacked TrianglesPanel BG

A big part of “digital transformation” is, or at least should be, professionalizing data activities. Even the help of GenAI won’t make the need for data expertise go away – in contrast, even more than before, data workers will need to focus on the complex (and hopefully interesting) parts of working with data. The ultimate goal remains the same: Enable everyone in the organization to make sense of data in a flexible yet reliable and trustworthy way while at the same time ensuring what data and tools can be used – which is even more important nowadays with the temptation to send data into the cloud to get help from someone else’s AI. And then there is, of course, the need for governance of the (many) critical parts: Can we audit what happened to the data, why it happened, and which data/analyses were actually run – even years later?

Many companies are approaching this as a software challenge and are essentially building a software engineering practice that focuses on their data activities. However, even though strong similarities exist, there is one key difference that makes this not quite that easy: Data work is extremely fluid across teams, resources, and time, which requires different environments and setups.

Fluid roles

Most importantly, there is fluidity across roles.

In a software practice we have fundamentally one group of people that builds tools for the rest of the company (or their customers). This group consists of various roles, such as frontend and backend developers, UX designers or security engineers. They can only succeed if they work closely and interactively with the end users, which is one of the reasons why agile development processes are so successful. Here, the software practice group builds software that the end users, well: use.

In a data practice, we have different roles as well: Data engineers, visualization experts, ML and AI scientists, data analysts. They often form a core data team that provides access to data and all sorts of tools and build ready-for-consumption data applications. This team also needs to work closely with the actual users. Typically these are the business analysts who are providing domain knowledge or simply users who want to gain insights from the available data. But the real power emerges when the core data team enables everybody else in the organization to derive new insights or even build new applications themselves. This may be as simple as creating a new view on a collection of data sources or automating repetitive data aggregation tasks; it can also be as complex as trying out a couple different predictive models. Done right, this builds an active “data community” across the entire organization and gradually upskills everybody to be able to do a bit of data engineering, visualization, and analytics themselves – to the degree they need to. 

Fluid skills

Secondly, there is also fluidity across skills and expertise: The core data team therefore needs to provide not only ready-to-use tools for one user group as is typical in software development, but also tools that allow people of very different backgrounds to go about their data work and make sense of data as they see fit and that matches their skill levels. People who come from a spreadsheet background are masters of data wrangling, others are statistics or visualization wizards, and some really are deep into the science part of data science and optimize predictive models. But hardly anyone knows it all – although we would like to have them all enhance their skills gradually over time. Enabling these types of different workers to work and grow together will make the entire organization “data ready”.

Fluid data and tools

And finally, there is fluidity when it comes to data and tools. We cannot build access control into a closed, final application but instead, we need to provide varied degrees of access to building blocks our data community can build upon. And those building blocks need to be updated regularly since the types of data and tools that are available change continuously. This leads to even more complex governance issues. We cannot establish well-crafted, built-in boundaries but need to adjust those as well with the changing landscape underneath. We need to provide flexible access to updated data aggregations, enable the use of new technologies, or quickly reroute ongoing activities to a new data storage provider or a new AI vendor – without disrupting all of the ongoing activities of our community.

So what *is* the difference?

In a nutshell: It’s almost as if a software department was trying to turn all of their users into part time assistant developers, some helping out with a bit of the frontend work, some doing backend development or UX design – and a few touching all of those parts here and there. And yet in software, we don’t want to enable users to make modifications to the program themselves. That would be a governance and reliability nightmare. This is in sharp contrast to the modern, data driven organization, where we do want to enable everyone to make changes to the analytical workflows or even develop some themselves.

So what do we need to keep that fluid setup of roles, skills, data, and tools under control?

From the data community perspective, it needs to be one environment that everybody can use and that is flexible enough to allow updates with changing resources and requirements. It needs to be an environment that allows new members of the community to get started easily and intuitively but at the same time it needs to provide the power to build complex data processes for the more advanced users.

For our core data team, that same environment needs to allow the building of pre-packaged applications, but more importantly, it also needs to support the distribution of pre-defined building blocks for controlled data and tool access (especially when those reside outside of our organizational boundaries), best practices for all levels of users, and enforce self-documented data processes so we can also audit what was done, and how, at a later stage.

Put differently, the core data team needs an operating system for all data activities. The core data team owns the setup and the boundaries and provides some of the standard applications on top of it. They still provide the deep expertise in data engineering, visualizations, and analytics, and they govern and potentially also audit how deep the community can dive into those aspects. The data engineers may provide ready-to-use building blocks to the data community that provide access to well-defined and anonymized data aggregates. They also ensure that all access through such building blocks can be properly audited. The visualization experts provide prepackaged visualizations aligning with corporate best practices. The AI specialists provide pre-configured packages to access the approved AI services using the corporate’s cloud account. And the compliance engineer on the data team provides a white list of modules for each department to ensure only approved tools are being used. The community can collaborate freely within those boundaries using the same environment across diverse expertise and skill levels.

This environment essentially provides the “lingua franca” for everybody to collaborate and share and help everyone become data literate together.