There is a new KNIME forum. You can still browse and read content from our old forum but if you want to create new posts or join ongoing discussions, please visit our new KNIME forum: https://forum.knime.com

SDF Writer US-ASCII charset limit

Member for

8 years 6 months swebb

Hi

We've been having some issues writing out references into an SDF where the reference contains an accented charachter. 

Looking into he SDF Writer node the DefaultSDFWriter#openOutputWriter method is specifying US-ASCII as the Charset. I cant see in the specification for SDF (http://media.accelrys.com/downloads/ctfile-formats/ctfile-formats.zip) that the data has a charachter limit other than a charachter max length. 

Making the following change:

 

        if (m_settings.fileName().endsWith(".gz")) {
            return new BufferedWriter(new OutputStreamWriter(new GZIPOutputStream(os), StandardCharsets.ISO_8859_1));
        } else {
            return new BufferedWriter(new OutputStreamWriter(os, StandardCharsets.ISO_8859_1));
        }

 

Appears to enable us to write out the references. HaveI overlooked something or would it be safe to make this change?

Cheers

Sam

Comments
Tue, 05/30/2017 - 12:57

Member for

8 years 6 months

swebb

One thing I overlooked was can the SDF Reader reader it back in? Answer: nope

Tue, 05/30/2017 - 06:01

Member for

8 years 6 months

swebb

Hrm, maybe CT Files are expected to be ASCII?

Tue, 05/30/2017 - 10:04

Member for

12 years 11 months

thor

I also didn't find a reference to what charset is acceptable in SDF files. Therefore we decided to stick to ASCII (which probably was the only one around when SDF was invented...).

Wed, 12/06/2017 - 05:07

Member for

2 years 1 month

WildCation

This has been a serious problem for me, as SDFs I encounter in the wild can be ASCII, cp-1252, or UTF-8.

Is there any chance of this functionality being added to the SDF reader/writer? I use knime to run various filters to weed out data quality problems, and it's really not great if in doing so all the alphas, betas, primes etc end up getting scrambled in the process.