KNIME Extensions for Next Generation Sequencing (trusted extension)

 

Institute Pasteur - PF2

We present here nodes and workflows used for processing next generation sequencing results. Most nodes presented here are not necessarily specific to NGS data, but might be useful in other circumstances as well.

With regards to the workflows presented here, they are coming with sample data and are pre-executed. They are not only showing how to use a specific node (this is meant to be described in the node help), but rather solve a specific NGS related problem.

Last but not least, there is no garantee/promise or whatsoever associated with any of the information here. We are very happy to discuss anything described on these pages and also welcome contributions from other KNIME NGS users/developers.

Kind regards,

Bernd (baj)

 

About the Nodes

Name of the node Description
IO
BEDGraphWriter Writes out BED files.
Bio Sequence Reader Reads in sequence information (RNA, DNA, Protein) from FASTA, GENBANK, UNIPROT, EMBL, INSDseq file formats. No annotion is being read.
FQCReader Reads in file with multiple tables. Works together with FQCRow2Table, Table2FQCrow, and FQCWriter
FQCWriter Writes multiple tables to file. Works together with FQCRow2Table, Table2FQCrow, and FQCReader
FastQReader Reads in FastQ file into table. One FASTQ entry (i.e. 4 lines) are translated into one row. This node is using BioJava
FastQWriter Writes out FastQ file into a file. This node is using BioJava.
GenbankAnnotReader Reads in just the annotation information from a genbank file. See example workflow below on how to convert genbank to GFF.
ROIReader Reads regions of interest (ROIs) formated files. See example workflow and node annotation for further information
SAMReader Reads Sam or Bam files.
ROI
PositionStr2Position (to be deprecated) Part of the ROI concept.
RegionOverlap Identifies regions that overlap. This node is usually used within a sub-workflow that divides the data set per chromosome. The first input node is being retained. 
TOOLS
Bash Executes commands in bash or cmd.exe (see inline documentation)
CmdwInput Similar to the bash node only that it takes the input table and executes strings within that table.
CollectionLinePlot Way of showing SVG graphs in a table view. Uses numerical collections.
CountSorted Counts occurrences within a sorted column. It is faster than the ValueCounter and useful for counting reads from a FASTQ file as they are already sorted. It also uses minimum amount of memory
FQCRow2Table Each line coming from the FQCReader represents a table. This node converts such a table in a KNIME table. Works together with FQCWriter, Table2FQCrow, and FQCReader.
GetSequenceName Get the name of a sequence object as a string
IGVview Enable link to IGV through table view
JoinSorted Creates a full outer join of two sorted tables.
NGSConcat Concat 2 tables with identical table specs
Table2FQCrow Converts a whole table into a FQCrow to be written out using FQCWriter. Works together with FQCWriter, FQCWriter, and FQCReader.
TableSpecs Retrieves simple stats for table and columns(n) included are column type, index, lower and upper bound (table 1) number of rows and columns (table2). This is very similar to the KNIME nodes "Extract Table Dimension" and "Extract Table Spec"
Wait Does nothing other than synchronising executions. This can also be done using the Variable Ports of existing nodes
DEPRECATED
GetRegions (deprecated) The concept of ROIs is now implemented in seqan (http://www.seqan.de/projects/ngs-roi/)
Seq2PosIncidents (deprecated) Part of the ROI concept.
OneString (deprecated) Superceeded by KNIME node (TableCreator)
PileupCounts (deprecated) Part of the ROI concept.
GroupByLoopStart (deprecated) Superseded by KINIME version...
AdapterRemoval (deprecated) the algorithm is now implemented in seqan. (seqan.de, https://projets.pasteur.fr/projects/pf2workflows/repository/show/jagla/apps/clean_ngs)
AdapterRemovalAdv (deprecated) See above...

Workflows

Name of the workflow Description
FastQ-stats Descriptive statistics of Illumina results in fastq format. (usually before mapping)
Genbank-GFF conversion Example workflow showing how to convert genbank files into GFF format.

Workflows showing the use of the nodes

Name of the workflow Description
FASTQReader Simply one node with data from NCBI/SRA (SRR001356, Illumina sequencing of Mouse brain transcript fragment library)
FASTQWriter Simple workflow that reads in a FastQ file, then reduces the sequence and quality string to the first position and writes out the result.
Count sorted Simple workflow that reads in a FastQ file, sorts the data by the sequence and then applies both the value counter and the "countSorted" nodes, as well as sorts by the counts.
GetRegions Simple workflow that uses SAMReader, Seq2PosIncidents, CountSorted, PositionStr2Position, and GetRegions.
RegionOverlapp Intersect annoation from UCSC database with regions of interest
Bash example Execute something (ls) on the command line
CmdwInput example Execute something (ls) on the command line
Roi workflow workflow showing some of the features for the ROI (region of interest) concept in the context of next generation sequencing projects.
Advanced ROI workflow shows an advance workflow that shows how to work with multiple miRNAs experiments. It shows how to display multiple samples.

Source Code

The source code can be accessed at https://anonymous:knime@community.knime.org/svn/nodes4knime/trunk/org.pasteur.

License

The NGS nodes are released under GPLv2.