KNIME logo
Contact usDownload

Automating the parsing of 1000s of Safety Data Sheets to mitigate risk

Life Sciences (Pharma & Biotech)ProductionProduction Quality
Soluzioni Informatiche

A Safety Data Sheet (SDS) is a standardized document by which chemical manufacturers communicate a chemical’s hazard information to chemical handlers. It typically contains chemical properties, health and environmental hazards, protective measures, and safety precautions for storing, handling, and transporting chemicals.

Chemical handlers extract information from these by reading the section of interest, but this “manual” workflow is not effective if the Health, Safety & Environment manager needs to gather information about all chemicals used in the company in order to put an adequate risk management plan in place. This KNIME workflow makes it possible to automatically extract hazard information from thousands of SDS.

Automating text mining with KNIME

SDS from different sources, customers, and providers are gathered. The user uploads either a single PDF, a library of PDFs, or a PDF-containing folder, as well as an Excel file with the list of all the requested phrases to be updated, to a KNIME workflow - which can be deployed on KNIME Server if more computational power is needed. Text mining nodes are applied to the result of the Tika Parser to extract all sentences composing each file. Every sentence, using string or regex manipulation, is analyzed by searching the Chemical Abstracts Service (CAS) number, product name, and all risk phrases. A try and catch construct helps with large variations in the input files. The results report the file name, product name, all the CAS numbers retrieved in each document, and all the retrieved phrases, which are matched with the defined user list.

⇒ Download Workflow from KNIME Hub

Why KNIME

The open source KNIME Analytics Platform makes this task not only faster, but also reduces the risk of human error. The Tika Parser node enables the retrieval of meta information from each file, the try/catch errors construct effectively avoids workflow errors, and regex code in a java snippet isolates CAS numbers from PDFs.

⇒ Download this Innovation Note as a PDF

More success stories

Wave Life Sciences
Wave Life Sciences
Success Story

Wave Life Sciences

How Wave Lifesciences made data science accessible to domain experts to create real business value

Centogene
Centogene
Success Story

Centogene

How KNIME helped Centogene identify biomarkers and improve accuracy of patient diagnostics

Alexion
Alexion
Success Story

Alexion

How KNIME helped Alexion shorten time to disease diagnosis & accelerate time to treatment