There is a new KNIME forum. You can still browse and read content from our old forum but if you want to create new posts or join ongoing discussions, please visit our new KNIME forum: https://forum.knime.com

Knime Large Row Perfomance

Member for

5 years 5 months nxfxcom

Hello.

This is not the usual how do I increase ram etc question. This is much more specific ;) We are loading CSV files with about 20 Columns and 3 Million rows (11GB each). Doing any transformations like String Manipulation (to lower) takes ages, grouping to do calculations takes ages. My INI is :

-startup
plugins/org.eclipse.equinox.launcher_1.3.200.v20160318-1642.jar
--launcher.library
plugins/org.eclipse.equinox.launcher.win32.win32.x86_64_1.1.400.v20160518-1444
--launcher.defaultAction
openFile
-vm
plugins/org.knime.binary.jre.win32.x86_64_1.8.0.152-01/jre/bin
-vmargs
-Dorg.knime.container.cellsinmemory=6000000
-server
-Dsun.java2d.d3d=false
-Dosgi.classloader.lock=classname
-XX:+UnlockDiagnosticVMOptions
-XX:+UnsyncloadClass
-Dsun.net.client.defaultReadTimeout=0
-XX:CompileCommand=exclude,javax/swing/text/GlyphView,getBreakSpot
-Xmx42G
-Dorg.eclipse.swt.browser.IEVersion=10001
-Dsun.awt.noerasebackground=true
-Dequinox.statechange.timeout=30000

I am running this on a Xeon 1535v6 at 3.1 4 core and as you can see below (images), the machine is not fully utilized. And yet a simple Group BY (No math takes 6 Minutes) and a string manipulation tales close to an hour. I have set yo keep all in memory.. Still not any better.

 

What can i do? (I have to run about 60 of these files a day ;) 

 

Thank you

Comments
Thu, 03/01/2018 - 01:53

Member for

5 years 5 months

nxfxcom

Also. I am using fast SSDs. Should I set like a scratch disk for Knime or so?

Thu, 03/01/2018 - 03:04

Member for

3 years 2 months

RolandBurger

Hi,

One thing you can do is enable streaming execution, see here for how to set this up: https://www.knime.com/blog/streaming-data-in-knime

This will work well with a lot of transformation nodes, including String Manipulation. It will however not work with nodes that need to use information from the full table, e.g. GroupBy. If possible, you can structure your workflow in a way that connects as many streamable nodes as possible, and only do non-streamable tasks like grouping at the end.

Hope that speeds things up a bit!

Also, you have set -Dorg.knime.container.cellsinmemory to 6M. Your tables seem to be bigger than that (60M), so increasing that value might help as well.

Cheers,

Roland

Files
1.PNG102.67 KB
2.PNG87.13 KB