Do you remember the Iron Chef battles?
It was a televised series of cook-offs in which famous chefs rolled up their sleeves to compete in making the perfect dish. Based on a set theme, this involved using all their experience, creativity, and imagination to transform sometimes questionable ingredients into the ultimate meal.
Hey, isn’t that just like data transformation? Or data blending, or data manipulation, or ETL, or whatever new name is trending now? In this new blog series requested by popular vote, we will ask two data chefs to use all their knowledge and creativity to compete in extracting a given data set's most useful “flavors” via reductions, aggregations, measures, KPIs, and coordinate transformations. Delicious!
Want to find out how to prepare the ingredients for a delicious data dish by aggregating financial transactions, filtering out uninformative features or extracting the essence of the customer journey? Follow us here and send us your own ideas for the “Data Chef Battles” at firstname.lastname@example.org.
Ingredient Theme: Energy Consumption Time Series. Behavioral Measures over Time and Seasonality Index from Auto-Correlation.
Data Chefs: Haruto and Momoka
Ingredient Theme: Energy Consumption Time Series
Let’s talk today about electricity and its consumption. One of the hardest problems in the energy industry is matching supply and demand. On the one hand, over-production of energy can be a waste of resources; on the other hand, under-production can leave people without the basic commodities of modern life. The prediction of the electrical energy demand at each point in time is therefore a very important chapter in data analytics.
For this reason, a couple of years ago energy companies started to monitor the electricity consumption of each household, store, or other entity, by means of smart meters. A pilot project was launched in 2009 by the Irish Commission for Energy Regulation (CER).
The Smart Metering Electricity Customer Behaviour Trials (CBTs) took place during 2009 and 2010 with over 5,000 Irish homes and businesses participating. The purpose of the trials was to assess the impact on consumers’ electricity consumption, in order to inform the cost-benefit analysis for a national rollout. Electric Ireland residential and business customers and Bord Gáis Energy business customers who participated in the trials, had an electricity smart meter installed in their homes or on their premises and agreed to take part in research to help establish how smart metering can help shape energy usage behaviors across a variety of demographics, lifestyles, and home sizes. The trials produced positive results. The reports are available from CER (Commission for Energy Regulation) along with further information on the Smart Metering Project. In order to get a copy of the data set, fill out this request form and email it to ISSDA.
The data set is just a very long time series: one column covers the smart meter ID, one column the time, and one column the amount of electricity used in the previous 30 minutes. The time is expressed in number of minutes from 01.01.2009 : 00.00 and has to be transformed back to one of the classic date/time formats, like for example dd.MM.yyyy : HH.mm. The original sampling rate, at which the used energy is measured, is every 30 minutes.
The first data transformations, common to all data chefs, involve the date/time conversion and the extraction of year, month, day of month, day of week, hour, and minute from the raw date.
Topic. Energy Consumption Time Series
Challenge. From time series to behavioral measures and seasonality
Methods. Aggregations at multiple levels, Correlation
Data Manipulation Nodes. GroupBy, Pivoting, Linear Correlation, Lag Column
What can we do, in general, with a time series? Often the final goal is prediction of future values, based on current and past values. Just how much past? Also, a time series can follow very different shapes. Has the shape any meaning? Can we summarize the time series evolution, by describing the electricity related habits of the household? Is there any seasonality that we can take into account? Is it possible to predict future values for groups of similar time series? In this case, how do we measure similarity across time series? Well, let’s start this challenge and let’s see what our data chefs have prepared for today’s data!
Data Chef Haruto: Behavior Measures over Time
Haruto has decided to remain in the time space and to analyze the electrical behavior of the energy consumers, as measured by their smart meters. In particular, he explored the energy consumption on weekends and business days, on each day of the week, on each hour of the day, and for different time frames during the day.
In order to do that, he calculated first the average energy consumption by week day, by hour, by day times, and by weekends vs. business days. Average values already show who is using the largest amount of energy. Then he transformed such average values into percentages, to understand when each entity uses how much energy.
In figure 1, the workflow upper branch - embedded in the “Usage Measures” labelled square – is from Data Chef Haruto.
Figure 1. Final Workflow 03_ETL_Energy_autocorr_stats. The upper part named "Usage Measures" describes the entity’s energy consumption behavior. The lower part labelled "Auto-correlation Matrix" calculates the auto-correlation matrix of the energy consumption time series for a selected meter ID. This workflow is available on the KNIME EXAMPLES Server under 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats*
The first 2 metanodes, named “Daily Values” and “Hourly Values”, respectively, calculate (Fig. 2):
- The average daily/hourly energy usage by meter ID (GroupBy node)
- The average energy usage by meter ID vs. day of week/hour of day (Pivoting node)
- The average energy usage during weekends vs. business days / day time frames (Rule Engine + Pivoting node)
After that a series of Math Formula nodes in the metanodes named “Intra-day segments (%)” and “Week Day (%)” put the average values into context, by reporting them as the percentage of energy used during intra-day segments and during week days.
Figure 2. Content of the “Daily Values” metanode to calculate the daily energy consumption by meter ID in average, in average per day of week, in average over week ends and business days
Data Chef Momoka: Auto-Correlation Matrix
Momoka decided to look for seasonality patterns and for that to check each time series auto-correlation matrix.
In figure 1, the workflow lower branch - embedded in the “Auto-correlation Matrix” labelled square – is the result of Data Chef Momoka’s work.
First the data is shaped as a pivoting table with average energy consumption of meter ID vs. date and hour. The metanode “Pivoting”, indeed, produces the energy consumption time series for all meter IDs, sampled every hour, and sorted by time. The subsequent metanode, named “Select Meter ID”, allows to select one time series only through its meter ID value.
In order to calculate the auto-correlation matrix, we need:
- normalized values for a meaningful comparison of the correlation indeces
- past values to calculate the correlation of the current sample with its past N samples
In metanode “Normalize & Lag”, time series values are then being normalized into [0,1] and N past samples are introduced. Normalization is achieved with a Normalizer node, while the N past samples are produced by a Lag Column node. The Lag Column node makes N copies of the selected column and shifts its values of 1, 2, …, N steps forward. If the column values were sorted by time, this means that we would attach the N past samples of the time series to the current one.
The auto-correlation matrix of the current samples with their past N samples is then calculated using a Linear Correlation node. The correlation matrix will show a few highly correlated columns, like for example x(t) and x(t-2). In particular, if the auto-correlation function shows local maxima at recurrent steps in the past, like at x(t) and x(t- i*24) with i= 1,2, .., this might be a sign of a seasonality pattern.
The “Find Seasonality” metanode searches for such local maxima in the correlation functions. It detects the smallest seasonality period as the position of the first local maximum of the correlation function’s first derivative (Fig. 3).
Figure 3. Content of “Find Seasonality” metanode, which finds the local maxima in the auto-correlation function through its first derivative values
The final workflow with the contributions of both data chefs can be admired in Figure 1 and can be found on the EXAMPLES server in: 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats*
Let’s now see the final results on a specific smart meter. The jury randomly chose meter ID 1038.
According to the behavioral features provided by data chef Haruto, the entity connected to meter ID 1038 uses 232 kW/day on average, more or less the same amount every day of the week, with not much difference between weekends and business days. Moving to the hour scale, meter ID 1038 uses ~10 kW/hour on average, most of it during the day and roughly equally distributed over morning and afternoon.
Indeed, the line plot provided by data chef Momoka for the energy usage time series of meter ID 1038 (Fig. 4) shows a cyclical trend day vs. night, where the kWs used during the day are definitely dominant. The plot also shows no difference of electricity usage across week days.
Figure 4. Line Plot of energy consumption time-series for meter ID 1038. Notice the day/night rhythm.
This cyclic trend justifies the auto-correlation based findings of data chef Momoka. The signal autocorrelation map (Fig. 5) shows a 24-hour cycle, with local maxima in the auto-correlation function at x(t) and x(t-24), x(t) and x(t-48), x(t) and x(t-72) and so on. The smallest seasonality period was calculated to be 24 hours.
Note. The stronger the cyclic behavior of the auto-correlation matrix, the more meaningful the seasonality pattern. In figure 5, the time-series seasonality is clearly visible through the cyclic trend of its auto-correlation matrix.
Figure 5. Auto-correlation matrix of energy consumption time-series for meter ID 1038. You can see the cyclic trend of the auto-correlation matrix and the auto-correlation local maxima at -24, -48 and so on.
Again, the final workflow with the contributions of both data chefs can be found on the EXAMPLES server in: 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/03_ETL_Energy_autocorr_stats*
Note. The example workflow on the EXAMPLES server works only on a subset of the original dataset. This is because the original dataset must be obtained by filling the request form and emailing it to ISSDA. Therefore the auto-correlation map and in general all other numbers shown in this post will be different when produced by the example workflow on the reduced dataset!
We have reached the end of this competition. Congratulations to both our data chefs for wrangling such interesting features from the raw data ingredients! Oishii!
Coming next …
If you enjoyed this, please share it generously and let us know your ideas for future data preparations.
We’re looking forward to the next data chef battle. The theme ingredient there will be ClickStream data
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)