Workflow showing how to convert genbank to GFF
Genbank files contain annotation information for sequence data and can also contain the sequences itself. See sample for further information on the file format. One of the main features of the genbank format is that it is supposed to be human readable as well as automatically parsable. Knowing the complexity of biology, the genbank format and that humans are very error prone it is not a supprise that a lot of the genbank files have inconsitant annotation that make it basically impossible to create a generalized a parser. Thus, only a semi automated process can be a relistic goal. The fact that genbank files can be very long make it difficult to come with a parser for a specific purpose. It is very difficult and time consuming to look for special cases as scrolling through intermediate results is not very easy with common tools such as the unix command line, perl, or other such tools. This is where KNIME comes into play. KNIME enable the user to quickly sort, scroll, and verify intermediate results and thus get a more reliable results when converting genbank to GFF format.
The workflow described here can be a starting point for such a conversion.
Description of the workflow
It is suggested to understand the process of each conversion by verify intermediate results as well as the final results. Important steps in the workflow have been highlighted with yellow annotations, red annotations are used for things that rarely occur and are not covered in this workflow.
- We read in the Pestis genome (just because I was recently with that one).
- we then concentrate on the features which can be easily converted into GFF.
- The location feature describes the full contig. We are not interested in that. Changing from exclude to include allows you to quickly verify what you are throwing away...
- Different features are separated by "/", thus we split the value string.
- we copy the rowID to a real column (something that we need later to combine the resutls again.)
- In the meta workflow "reformat" the magic takes place. Let's have a look at the output and rest first... The output of that meta workflow is a table with a column for each feature type, the region information. When viewing the table we can sort and thus check for missing values or other abnormalities.
- We just need to reformat the structure of that table to derive at a GFF table. The simple Java snippet creates the column 9 by concating what we would like to include. Here we can quickly copy and paste the columns we would like to have.
- Then we remove any column that is not needed.
- and ensure the correct order of columns.
- We set the 2nd column to "genbankKNIMEconverter" because we are want to let everyone know who converted the features.
- Finally we sort by contig name and start position. The table can now be exported using the tableWriter node.
So far it looked relatively straight foward. Now lets have a closer look at the magic...
I split up that part in two...
We have to loop through all the value columns, parse and combine the results.
On the top we just keep the row id to link to the individual features. The second row filters out just the value columns that have been created by the split node. We then loop through them, rename the individaul columns to have the same name. This way the loop end node appends the results. Tag - value pairs are separated by "=" in genbank. Thus we can create one column with the tags and one with the values. This information we are going to use in the Pivoting step (part two).
In the bottom row we collect the additional information (anything but the value columns).
but before using the Pivot node we clean up a bit by removing unnecessary columns and renaming some. And after the pivoting we rename the columns to make them look nicer.
We then combine the features columns with the original additional information (bottom) and clean up again: Rename rows, set the strand column to either "+" or "-". (The case when this is not to be set is not handled). After we have saved the information on the strand we can clean up the regions and remove the "comp()" string from the region information.
The row spliter retains all rows that we can handle. The bottom output handles cases where the beginning or end are not clearly defined.
We can then create the beginning/end columns by spliting the region cell and converting it to a number. We need the last step to be able to sort later properly. We also append columns dot and dot2 for which we don't have the information...
Please send any comments, suggestions, blames etc to bernd (dot) jagla (at) pasteur (dot) fr.