Code
SeekDeep qluster
The main purpose of qluster is to take de-multiplex data/single sample data and to create final haplotypes with relative abundances by clustering reads by collapsing on specific errors. The program requires at most two options: input sequence files either in fastq(--fastq
), fasta/qual(--fasta
,--qual
), or fastq format (--fasta
) and a parameters files (--par
) that determine the extent of the clustering of the reads. Several default parameter files are provided with the source code of SeekDeep.
Just typing the name of the program will give a help message on running the program
Did not find a recognizable read in option
Options include: --fasta,--fastagz,--fastq,--fastqgz
Command line arguments
OptionNum Flag Option
Need to have --par see SeekDeep qluster --help for more details
...
...
Also calling -help
will do the same
Also all flags in SeekDeep are case insensitive and so all the following would have the same results
At the core of qluster is iteratively clustering collections of raw reads, which from here on will be called clusters, based on user supplied error profiles. The errors that are consider in this profile are 1base indels, 2base indels, >2base indels, low quality errors, high quality errors, and low kmer frequency errors. See future method for more extensive explanation of these errors than the short description that follows now. The low and high quality errors are based on per base quality scores of mismatching bases along with flanking quality scores. Low kmer errors are based on the occurrence of kmers centered on the mismatch. On each iteration a different error profile is given to allow for different amount of error to cluster reads together.
Input reads are read in and initial clusters are created based on a simple sequence comparison creating clusters of unique sequences. Clusters are then sorted by size and then the iterative clustering process is perform. Each iteration starts by taking the clusters at the very bottom of the list and comparing it to the most abundant clusters. Comparisons are done by doing a global alignment and counting and the categorizing the errors into the categories mentioned above. This error profile is then compared to the allowable error profile for that iteration. If the error profile passes the current profile it is then checked against the next clusters in line for any other possible matches and is then added to cluster it matches best. At the end of each iteration a consensus sequence is calculated for each clusters based on majority rules basis, clusters are then resorted by their new read numbers and the new iteration is started.
#Format of parameters files The parameters file will determine what errors to cluster on and how many iterations to do. The set up of the file is that every line is another iteration. Each line should contain 8 numbers. The order of what they mean is as follows
Each number is separated by a colon and the first row can be a title line if it starts with a ‘s’ below is an example of a parameters file
stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
100:3:1:0:0:0:0:0
100:3:2:0:0:0:0:1
100:3:3:0:0:0:1:1
100:3:4:0:0:0:2:1
100:0:1:0:0:0:0:0
100:0:2:0:0:0:0:1
100:0:3:0:0:0:1:1
100:0:4:0:0:0:2:1
This file indicates to do eight iterations. On the first iteration only the top 100 clusters would be checked. Clusters of size 3 will be compared against (this means that clusters of size 3 or less will still be compared to larger clusters but other clusters will not be compared to clusters of 3 or less which means they can’t be the seeds for new clusters). And clusters will only be collapsed together if they only differ by 1 one base indel. On the next iteration the number of top clusters to compare against and the cluster size that can form seeds will be the same but now clusters that differ by 2 one base indels and by 1 low kmer frequency errors. On the fourth iteration clusters of any size can now form seeds for new clusters and we are back to allowing only 1 one base indels.
Though this is not enforced in anyway this is common practice in using qluster, first allow only large clusters to form seeds slowly allowing more errors between clusters so that first solid clusters can be form. Once the number of errors willing to be allowed is reached go back to the a small amount of error and allow small clusters to form seeds so not low frequency haplotypes aren’t found.
Also alternative to specifying a specific number of top clusters to check a percentage can be given instead. See example below.
stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
10%:3:1:0:0:0:0:0
100:3:2:0:0:0:0:1
100:3:3:0:0:0:1:1
100:3:4:0:0:0:2:1
10%:0:1:0:0:0:0:0
100:0:2:0:0:0:0:1
100:0:3:0:0:0:1:1
100:0:4:0:0:0:2:1
This means that in the first iteration the top 10% of reads will be checked, so if there were 2000 clusters, the top 200 would be checked. And the these formats can be mixed so the here the next iteration only the top 100 clusters would be checked no matter how many clusters there were.
qluster can also offer the traditional OTU percent identity clusters that is often employed by programs to cluster targeted amplicon clustering. The parameters file is similar to the previous example but there is only one column for the errors that is now percent identity.
stopCheck:smallCutoff:id
100:3:.97
100:3:.97
100:0:.97
100:0:.97
This means the program will perform four iteration while allowing clusters to collapse into each other if they differ by less than 3%. The first two columns mean the same as explained above.
The first parameter/method describe was created out of a need to be more precise than the non-specific OTU percent identity clustering method. In our work with Plasmodium (Malaria) we had a need to study haplotypes that differed by only one base pair but we still had to contend with sequencing and PCR errors that confound a typical targeted amplicon sequencing approach. Thus this method of collapsing only on specific types of errors was born, it allows us to perform what is essentially a percent identity clustering but be very specific where that percent identity is coming from. This allows us to collapse only small indels (which for our work in protein coding sequence are very unlikely as they would cause a frame shift) and that plague sequencing technologies like 454 and Ion Torrent and to collapse only on base mismatches that come from bases with low quality scores (something that is provided by all technologies) and on low kmer frequency mismatches which are often PCR error while preventing high quality errors from collapsing which has allowed us to find haplotypes that only differ by one base mismatch.
An output directory will be created for all output files of qluster. The default name is the name of the input file plus the word qluster plus the current date and time when qluster was run. The name can be changed using the -dout
flag.
As stated above the only thing qluster needs to run is an input file and a parameter file
Input to qluster can be fastq, fastq/qual, or just fasta (though then all mismatches will be high quality mismatches and advantage of quality scores is lost)
With the set up of the parameters file being completely at the whim of the user and what to collapse on can be somewhat arbitrary (though just picking an OTU cut off is also arbitrary) it is somewhat challenging picking the “correct” parameters. We have analyzed several control known mixture datasets and through testing out several parameters we have found the following parameters file to work out best (all clusters above .1% were expected clusters and all expected clusters were found) for 454 data, this file is provided with the SeekDeep source code in a folder called SeekDeepParametersFile and is called 454_it_lkmer2
. You also use these parameters by not supplying the --par
flag and using the --ionTorrent
flag which will set this automatically
stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
100:3:1:0:0:0:0:1
100:3:2:0:0:0:0:1
100:3:3:0:0:0:1:1
100:3:4:0:0:0:2:1
100:3:5:0:0:0:3:1
100:3:6:0:0:0:4:1
100:3:7:1:0:0:5:1
100:3:7:2:0:0:5:1
100:3:7:3:0:0:5:2
100:3:7:4:0:0:5:2
100:3:7:5:0:0:5:2
100:0:1:0:0:0:0:1
100:0:2:0:0:0:0:1
100:0:3:0:0:0:1:1
100:0:4:0:0:0:2:1
100:0:5:0:0:0:3:1
100:0:6:0:0:0:4:1
100:0:7:1:0:0:5:1
100:0:7:2:0:0:5:1
100:0:7:3:0:0:5:2
100:0:7:4:0:0:5:2
100:0:7:5:0:0:5:2
And the following file, 454_it_lkmer2_largeHpr
, to work well with Ion Torrent data.
stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
100:3:1:0:0:0:0:1
100:3:2:0:0:0:0:1
100:3:3:0:0:0:1:1
100:3:4:0:.99:0:2:1
100:3:5:0:.99:0:3:1
100:3:6:0:.99:0:4:1
100:3:7:1:.99:0:5:1
100:3:7:2:.99:0:5:1
100:3:7:3:.99:0:5:1
100:3:7:4:.99:0:5:2
100:3:7:5:.99:0:5:2
100:0:1:0:0:0:0:1
100:0:2:0:0:0:0:1
100:0:3:0:0:0:1:1
100:0:4:0:.99:0:2:1
100:0:5:0:.99:0:3:1
100:0:6:0:.99:0:4:1
100:0:7:1:.99:0:5:1
100:0:7:2:.99:0:5:1
100:0:7:3:.99:0:5:1
100:0:7:4:.99:0:5:2
100:0:7:5:.99:0:5:2
Also if you just use the --ionTorrent
flag it will automatically use these parameters
Also IonTorrent comes with a slew of problems but among them is how the quality scores are calculated that causes some trouble for qluster so there are 3 additional flags to turn on to get the best performance out of qluster for IonTorrent Data
SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2 --qualTrim 3 --adjustHomopolyerRuns --useCompPerCutOff
#or to turn on all of them just use -ionTorrent
SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2 --ionTorrent
10^(-qual/10)
, this comes out to be 80% and 63% chance of error)SeekDeep by default weighs indels in homopolymer runs differently than other indels (this is because the majority of data that SeekDeep has been used on has been 454 and Ion Torrent data, FYI. this behavior can be turned off by using the --noHomopolymerWeighting
flag). See method paper for detail description of how this weighting is done but essentially by setting the large base indel to less than 1 this allows for clusters that differ by indels >2 bases but are comprised completely of just 1 base inside of a homopolymer run to collapse.
As above for the 454 and Ion Torrent dataset we have found by experiment a parameters that works best for Illumina data and put it in SeekDeepParametersFiles folder as well called illumina_lkmer2
and also as mentioned above SeekDeep by default weighs indels found in homopolymer differently and since this isn’t a problem in Illumina data it should be turned off, or you can do this with the --illumina
flag
stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
100:3:1:0:0:0:0:1
100:3:2:0:0:0:0:1
100:3:2:0:0:0:1:1
100:3:2:0:0:0:2:1
100:3:2:0:0:0:3:1
100:3:2:0:0:0:4:1
100:3:2:0:0:0:5:1
100:3:2:0:0:0:6:2
100:3:2:0:0:0:7:2
100:3:2:0:0:0:8:2
100:3:2:0:0:0:8:2
100:0:1:0:0:0:0:1
100:0:2:0:0:0:0:1
100:0:2:0:0:0:1:1
100:0:2:0:0:0:2:1
100:0:2:0:0:0:3:1
100:0:2:0:0:0:4:1
100:0:2:0:0:0:5:1
100:0:2:0:0:0:6:2
100:0:2:0:0:0:7:2
100:0:2:0:0:0:8:2
100:0:2:0:0:0:8:2
If you just use the --illumina
flag this will just use these parameters
You can supply a parameters files with what otu to clusters at, allow to first cluster more fine (.99%) and then in latter iterations allow .97% or you can use the flag --otu
to cluster at a specific otu for several iterations.
You can allow high quality differences in the supplied parameters file or in conjunction with the --454
, --ionTorrent
, or --illumina
flags
#allow low quality differences and 1 high quality difference
SeekDeep qluster --fastq example.fastq --illumina --hq 1
#allow low quality differences and indel differences common in 454 or ionTorrent and 1 high quality difference
#454
SeekDeep qluster --fastq example.fastq --454 --hq 1
#ion torrent
SeekDeep qluster --fastq example.fastq --ionTorrent --hq 1
To determine if a mismatch is a low quality mismatch the qualities of the mismatching bases are examined along with the flanking base qualities. The quality of the mismatching bases are compared to what is called a primary quality (default 20) threshold and the qualities of the surrounding bases (default number of flanking bases is 2) are compared to what is called a secondary quality(default 15). To change the thresholds the flag -qualThres
is used by giving two numbers separated by a comma (eg to do a primary qual of 20 and a secondary of 15 use 20,15). See methods paper for full details on this reasoning.
To change the number of flanking bases used, use the flag -qualThesWindow
The global alignments done by qluster are actually semi-global alignments where gap scoring can be applied differently for gaps appearing at the ends of sequences. Since the input data for qluster is targeted amplicon sequence the default alignment parameters are 5 for gap opening at the front and in the middle of the sequence with a penalty of 1 for extending gaps and zero gap penalty for putting gaps at the end of the sequence since a lot of times this type of data has fragmented ends but intact fronts. This can be changed in several ways and there are four flags that can be used, --gapRight
, --gap
(gaps in the middle), --gapLeft
, and --gapAll
and the input is two integers separated by a comma, the first number is the opening penalty and the second is gap extension penalty. (eg 5,1 5 open and 1 extend)
#set gaps at the end of the sequence to 5 open 1 extend
SeekDeep qluster --fastq example.fastq --par par.txt --gapRight 5,1
#make gaps at the beginning and end of sequences have no penalty
SeekDeep qluster --fastq example.fastq --par par.txt --gapRight 0,0 --gapLeft 0,0
#make gaps everywhere 5 open, 1 extend
SeekDeep qluster --fastq example.fastq --par par.txt --gapAll 5,1
To change the default directory name use the -dout
flag. SeekDeep will never overwrite a directory if it already exists and will fail and quit if it tries to create a directory that exists.
The dout option also understand the key work TODAY to mean to insert the current date and time there instead though this means a output directory name can never have TODAY all in caps in it
The default cut off for a mismatch to be considered low frequency is 1. To modify this number use the --runCutOff
flag, this can take either a specific number or a percentage
SeekDeep qluster --fastq example.fastq --par par.txt --runCutOff 5
#if the kmer with the mismatching base in the middle is found in only 5 reads count as low frequency
SeekDeep qluster --fastq example.fastq --par par.txt --runCutOff .2%
#if the kmer with the mismatching base in the middle is found in only .2% of reads count as low frequency
Also a back up frequency can be given when giving a percentage of, so if there 1000 input sequences and --runCutOff
was given .01%,1
the cut off would default to 1 since .1% of 1000 would be 0
The most expensive part of qluster is the alignments it has to do to compare the sequences, these alignments can be cached in a directory in a somewhat compressed way if qluster had to be run again, for example if you decided to change parameters to collapse one, the re-run would be much faster. Caching is turn on by using the --alnInfoDir
flag and giving it a directory to cache the alignments in
Alignments are dependent on gap scoring so if gap scoring changes new alignments have to be cached
SeekDeep qluster --fastq example.fastq --par par1.txt --alnInfoDir alnCache --gapAll 5,1
#the second time around would require caching of new alignments
SeekDeep qluster --fastq example.fastq --par par2.txt --alnInfoDir alnCache --gapAll 7,1
#but now either of the below commands would be really fast
SeekDeep qluster --fastq example.fastq --par par2.txt --alnInfoDir alnCache --gapAll 5,1
SeekDeep qluster --fastq example.fastq --par par1.txt --alnInfoDir alnCache --gapAll 7,1
qluster marks final clusters for any clusters that look suspiciously like chimeric sequence (see methods paper for details on how this is done in detail). To turn on this behavior off use the --noMarkChimeras
flag. To control what can be marked as chimeric use the --parFreqs
flag. This set the frequency cut off the parent sequences have to be for the cluster to be marked as chimeric, this defaults to 2 which would mean the parent sequences would have to be at least twice as much as the possible child chimera to be marked
The output of the final clustered haplotypes file (output.fastq) can also be directed to another directory, with is used in conjunction with SeekDeep processClusters
to organizing the input to that command. The directory can determined by using the flag --additionalOut
flag to give it a file where the first column is the associated MID name and the second is the directory to output to. See a full tutorial on SeekDeep pipeline for more details
By using the --leaveOutSinglets
flag the analysis will be done with leaving out all singlet reads. This can be useful for extreme read coverage where the number of singlets reads can be enormous but isn’t really needed.
By using the --fastClustering
flag pairwise comparisons that differ in their nucleotide composition will be skipped, the amount of difference is set by --nucCutOff
, can range from 0 to 1, defaults to 0.05 meaning a different between nucleotide differences over 0.05 will skip.
By default an iteration will run once, you can make it so an iteration will run until there is no more collapsing, this has the potential to significantly increase run time without much pay off. This behavior is turned on by using --converge
.