GetRegions and other NGS specific nodes
This is a simple workflow to analyze a sam formated alignment file to determine regions of interest. This might be part of an analysis of an NGS experiment for analyzing short RNA molecules...
Simple workflow that reads in a SAM file (uses Illumina data from NCBI/SRA (SRR001356, Illumina sequencing of Mouse brain transcript fragment library that was aligned to mm9 using bowtie).
We are reading in only the first 200,000 sequences. Then we filter out any sequence that didn't align to the reference (mm9). We use the flag column to determine the strand.
Then we use Seq2PosIncident node to translate a given read into several rows, one for each sequence position.
We sort by the sequence postion string that contains the chromosome name and the position delimited by "_".
We count the positions where a read was aligned to. This is equivalent to the pileup output from e.g. samtools.
Since the rowId has the sequence position now we have to move this into a regular column.
We can now translate back the string that holds the chromosome and the sequence position into two columns using PostionStr2Position. This could have been also done using two simple JAVA snippets but this is faster because it only uses one iteration instead of two.
We can now combine adjacent sequence positions into regions of interests (GetRegions).
We sort by the length of those regions, just to see if there is anything interesting, i.e. look at the longest region.
We then use a JavaSnippet to calculate the maximum coverage of a given region...