Section 3.5. Data Preparation for Time Series Analysis

Here we describe all of the processing operations and techniques that can help you with dates, times, signal processing, and general data preparation for time series analysis.

Lag Column Node: The Key to Time Series Analysis

The Lag Column node is a very simple and yet very powerful node, since it allows us to perform time series analysis.

Strictly speaking, all it does is make a copy of the selected input data columns and shifts the cells down an n number of steps. However, if the data cells are ordered in time, moving them two steps down is equivalent to moving them into the past or future, depending on the time order.

Using the Lag Column node you can create a vector of past samples for each time, t, in the time series, with predefined size and sampling frequency.

Exercise

  • Read dataset from file clustered hourly_values_all.csv
  • Replace the row ID values in the dataset with the values in the “row ID” column
  • Transform the “row ID” column into a Date&Time type, and append these values as a new column in the dataset
  • Sort the data in ascending order according to the column of type Date&Time
  • Filter the data to only include one of the cluster time series
  • Use the Lag Column node to build the vector of 10 past samples for each time, t
  • Plot the time series and the past 10 samples for each time, t

 

Solution
  1. Use the File Reader node to read the file clustered_hourly_values_all.csv.
  2. Use the RowID node to define the “row ID” column as the new row ID column. Make sure that “Remove selected column” is not selected in the configuration dialog.
  3. Connect the String to Date&Time node to the data with the new row IDs. Include the “row ID” column, select “Date&Time” as “New type” and write“ yyyy-MM-dd_HH” as “Date format” in the configuration dialog.
  4. Connect the Sorter node to the data with the newly created Date&Time values. In the configuration dialog, select this column of type Date&Time in the “Sort by:” section, and check “Ascending”.
  5. Use the Column Filter node to extract only one time series column from the sorted data, for example “Cluster_26”. Exclude all of the other columns.
  6. Use the Column Rename node to rename the selected time series column to “cluster”: double click the “cluster_26” column in the configuration dialog, select “Change”, and write the new name “cluster” in the field next to the check box.
  7. Connect the Lag Column node to the filtered and preprocessed data, and apply the following settings:
    • Column to lag = “cluster”
    • Lag = 10
    • Lag interval = 1
    • Select both “Skip initial incomplete rows” and “Skip last incomplete rows”
  8. Plot the data with the time series and columns containing past samples using the Line Plot (JavaScript) node. Select row IDs as x-axis column and all other columns (the time series column and columns containing past samples) as y-axis columns in the configuration dialog.

The solution workflow is available in KNIME EXAMPLES Server under 02_ETL_Data_Manipulation/06_Date_and_Time_Manipulation/05_Vector_of_Past_Samples