The Challenge: Automatically Extract Hazard Information
A Safety Data Sheet (SDS) is a standardized document by which chemical manufacturers communicate a chemical’s hazard information to chemical handlers. It typically contains chemical properties, health and environmental hazards, protective measures, and safety precautions for storing, handling, and transporting chemicals. Chemical handlers extract information from these by reading the section of interest, but this “manual” workflow is not effective if the Health, Safety & Environment manager needs to gather information about all chemicals used in the company in order to put an adequate risk management plan in place. This KNIME workflow makes it possible to automatically extract hazard information from thousands of SDS.
The Solution: Text Mining Workflow in KNIME
SDS from different sources, customers, and providers are gathered. The user uploads either a single PDF, a library of PDFs, or a PDF-containing folder, as well as an Excel file with the list of all the requested phrases to be updated, to a KNIME workflow - which can be deployed on KNIME Server if more computational power is needed. Text mining nodes are applied to the result of the Tika Parser to extract all sentences composing each file. Every sentence, using string or regex manipulation, is analyzed by searching the Chemical Abstracts Service (CAS) number, product name, and all risk phrases. A try and catch construct helps with large variations in the input files. The results report the file name, product name, all the CAS numbers retrieved in each document, and all the retrieved phrases, which are matched with the defined user list.
Why KNIME Software
The open source KNIME Analytics Platform makes this task not only faster, but also reduces the risk of human error. The Tika Parser node enables the retrieval of meta information from each file, the try/catch errors construct effectively avoids workflow errors, and regex code in a java snippet isolates CAS numbers from PDFs.