Fastq stats

Descriptive statistics of Illumina results in fastq format.

(usually executed before mapping)

This workflow creates statistics on the quality string and sequence string in a FASTQ file:
  1. Sum of quality values per position;
  2. Mean quality value per position;
  3. Quality median per sequence position (not being activated as it takes some time to calculate);
  4. Detailed matrix with quality values and how often this quality value occurs per position;
  5. Distribution of sequence lengths (sequence length - count)
  6. count of all sequences
  7. unique sequence counts (in log scale): How often has a specific sequence been sequenced

as a single ZIP file
Figure 1 Shows the workflow in a KNIME environment. Numbers are added using GIMP.

1. FastQReader: Home-brewed node to read in FASTQ files. Its main parameter is the file name to be read in. "-option=52,FQR_fpname,${inputFile},String" modifies this paramter in the script that calls this workflow (FastQ_stats.sh). The file type is usually "Solexa" for the newer versions for the Illumina pipelines. Other options are "Illumina" (old Illumina format) and "Sanger" (I guess 454).

2. OneString: Home-brewed node to introduce a single value (String, Integer, Double) into a KNIME workflow. Here we specify the path and prefix of the output files (example: /pasteur/solexa2/solexa_travail/PF2/100618_HWI-EAS285/Gerald/L1_NI/s_1_no_adapter.)

3. MetaNode: Sub-workflow to set the specific filenames for the output files into variables. Those variables are being used in the CSV Writer nodes. In addition we also calculate the maximal length of the sequences and put this into a variable. This is needed to define a maximal vector length for the quality array. Based on the output file prefix this is a one-stop place to set all variable that are being used in the workflow.

4. Java Snippet: This Java snippet translates the quality string into an Array of integer values. The statistics node can then automatically generate statistics of the quality scores for each position in the sequence.

 int max =$${Dlength}$$.intValue();
 byte[] by;
 by = $QualityString$.getBytes();
 max = Math.min(max,$QualityString$.length());
 Integer[] in;
 in =new Integer[max];
 for(int i=0;i<max;i++){
   in[i] = new Integer(by[i]);
 return in;


5. Column Filter: In this part of the workflow we are only interested in the quality values therefore we remove anything else from the table.

6. Split Collection Column: This node creates one column for each quality position, actually for each element in the collection (array) that we previously created (see 5.)

7. Statistics: We are now able to calculate some stats for each sequence position. We get here more information than we actually want, but since it is already implemented why not use it...

8. Transpose: Now the only problem with the output of the statics node is that all the stats are per column. By transposing the table we get a fixed number of columns and as many rows as we have sequence positions (max. lengths of the read sequences).

9. RowID: (We probably don't need this column as the row id is also accessible as a variable) Here we just copy the rowID that is unique by definition (within KNIME) into a separate column called "id".

10. Java Snippet: The "Split Collection Column" node names the columns (that are rows by now) something like "Split Value #" where # represents the number of the array position. Since this is not very user friendly to read we just remove the "Split Value" string and are left with the sequence position. We store this in a separate column. Potentially we could replace the ID column to save space???

return $id$.substring(12);


11. Column Resorter: Here we just reorder the columns so that they are more intuitive to read. What this basically does it to move the position column from node (10) to the front.

12. 13. 14. Column Filter: select columns for output

15. Value Counter: The sequence length has been calculated in the Meta workflow. We are now only counting how often a specific length occurs.

16. Java Snippet: This is a workaround because I don't know how to access specs from a table. It est we are interested in the number of reads (entries) or rows. Therefore we are creating a column with only value, in this case "1" and then count how often this value occurs. Not very nice, but it works...

17. Value Counter: here we count the rows or better the "1" that were generated in node 16

(This value actually also is a result from node 19....

18. Value Counter: here we count the different sequences. This might be a limiting step if we run HiSeq data as there is an internal hash being used that might overflow. One workaround would be sort the table and then use the countSorted node. But as we haven't run into any problems for now we leave it as it...

19. Statistics: This will give us the distribution of the different sequences and how often they occur...

20. Java Snippet : To reduce the length of the output table we calculate the log of the count.

return Math.floor(Math.log($count$));

21. GroupBy: now we group by the absolute value of the log value

22. Java Snippet : and we add the range that corresponds to the values.

Double ex1 = Math.ceil(Math.exp($ln_count$));
Double ex2 = Math.floor(Math.exp($ln_count$+1));
DecimalFormat df1 = new DecimalFormat("####");
return df1.format(ex1) + "-" + df1.format(ex2);

23. CSV Writer: Write to output files. (overwrite is selected)