qluster’s Purpose

The main purpose of qluster is to take de-multiplex data/single sample data and to create final haplotypes with relative abundances by clustering reads by collapsing on specific errors. The program requires at most two options: input sequence files either in fastq(--fastq), fasta/qual(--fasta,--qual), or fastq format (--fasta) and a parameters files (--par) that determine the extent of the clustering of the reads. Several default parameter files are provided with the source code of SeekDeep.

Getting usage command line

Just typing the name of the program will give a help message on running the program

Code

SeekDeep qluster

Did not find a recognizable read in option
Options include: --fasta,--fastagz,--fastq,--fastqgz
Command line arguments
OptionNum   Flag    Option  
Need to have --par see SeekDeep qluster --help for more details
...
...

Also calling -help will do the same

Code

SeekDeep qluster --help

Also all flags in SeekDeep are case insensitive and so all the following would have the same results

Code

SeekDeep qluster --help 
SeekDeep qluster --HELP
SeekDeep qluster --HeLP
SeekDeep qluster --HeLp
SeekDeep QLUSTER --HeLP
SeekDeep qluster --HeLp
SeekDeep qlUstEr --HeLp

qluster Method Overview

Core Idea

At the core of qluster is iteratively clustering collections of raw reads, which from here on will be called clusters, based on user supplied error profiles. The errors that are consider in this profile are 1base indels, 2base indels, >2base indels, low quality errors, high quality errors, and low kmer frequency errors. See future method for more extensive explanation of these errors than the short description that follows now. The low and high quality errors are based on per base quality scores of mismatching bases along with flanking quality scores. Low kmer errors are based on the occurrence of kmers centered on the mismatch. On each iteration a different error profile is given to allow for different amount of error to cluster reads together.

Method flow

Input reads are read in and initial clusters are created based on a simple sequence comparison creating clusters of unique sequences. Clusters are then sorted by size and then the iterative clustering process is perform. Each iteration starts by taking the clusters at the very bottom of the list and comparing it to the most abundant clusters. Comparisons are done by doing a global alignment and counting and the categorizing the errors into the categories mentioned above. This error profile is then compared to the allowable error profile for that iteration. If the error profile passes the current profile it is then checked against the next clusters in line for any other possible matches and is then added to cluster it matches best. At the end of each iteration a consensus sequence is calculated for each clusters based on majority rules basis, clusters are then resorted by their new read numbers and the new iteration is started.

#Format of parameters files The parameters file will determine what errors to cluster on and how many iterations to do. The set up of the file is that every line is another iteration. Each line should contain 8 numbers. The order of what they mean is as follows

StopAfter - only check the top of this many reads
SizeCutOff - don’t check against clusters of this size
1baseIndels - The number of one base indels to allow
2baseIndels - The number of two base indels to allow
>2baseIndels - The number of >two base indels to allow
HQMismatches - The number of high quality mismatches to allow
LQMismatches - The number of low quality mismatches to allow
LKMismatches - The number of low kmer frequency mismatches to allow

Each number is separated by a colon and the first row can be a title line if it starts with a ‘s’ below is an example of a parameters file

stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
100:3:1:0:0:0:0:0
100:3:2:0:0:0:0:1
100:3:3:0:0:0:1:1
100:3:4:0:0:0:2:1
100:0:1:0:0:0:0:0
100:0:2:0:0:0:0:1
100:0:3:0:0:0:1:1
100:0:4:0:0:0:2:1

This file indicates to do eight iterations. On the first iteration only the top 100 clusters would be checked. Clusters of size 3 will be compared against (this means that clusters of size 3 or less will still be compared to larger clusters but other clusters will not be compared to clusters of 3 or less which means they can’t be the seeds for new clusters). And clusters will only be collapsed together if they only differ by 1 one base indel. On the next iteration the number of top clusters to compare against and the cluster size that can form seeds will be the same but now clusters that differ by 2 one base indels and by 1 low kmer frequency errors. On the fourth iteration clusters of any size can now form seeds for new clusters and we are back to allowing only 1 one base indels.
Though this is not enforced in anyway this is common practice in using qluster, first allow only large clusters to form seeds slowly allowing more errors between clusters so that first solid clusters can be form. Once the number of errors willing to be allowed is reached go back to the a small amount of error and allow small clusters to form seeds so not low frequency haplotypes aren’t found.

Also alternative to specifying a specific number of top clusters to check a percentage can be given instead. See example below.

stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
10%:3:1:0:0:0:0:0
100:3:2:0:0:0:0:1
100:3:3:0:0:0:1:1
100:3:4:0:0:0:2:1
10%:0:1:0:0:0:0:0
100:0:2:0:0:0:0:1
100:0:3:0:0:0:1:1
100:0:4:0:0:0:2:1

This means that in the first iteration the top 10% of reads will be checked, so if there were 2000 clusters, the top 200 would be checked. And the these formats can be mixed so the here the next iteration only the top 100 clusters would be checked no matter how many clusters there were.

OTU percent id clustering

qluster can also offer the traditional OTU percent identity clusters that is often employed by programs to cluster targeted amplicon clustering. The parameters file is similar to the previous example but there is only one column for the errors that is now percent identity.

stopCheck:smallCutoff:id
100:3:.97
100:3:.97
100:0:.97
100:0:.97

This means the program will perform four iteration while allowing clusters to collapse into each other if they differ by less than 3%. The first two columns mean the same as explained above.

The reasoning behind qluster

The first parameter/method describe was created out of a need to be more precise than the non-specific OTU percent identity clustering method. In our work with Plasmodium (Malaria) we had a need to study haplotypes that differed by only one base pair but we still had to contend with sequencing and PCR errors that confound a typical targeted amplicon sequencing approach. Thus this method of collapsing only on specific types of errors was born, it allows us to perform what is essentially a percent identity clustering but be very specific where that percent identity is coming from. This allows us to collapse only small indels (which for our work in protein coding sequence are very unlikely as they would cause a frame shift) and that plague sequencing technologies like 454 and Ion Torrent and to collapse only on base mismatches that come from bases with low quality scores (something that is provided by all technologies) and on low kmer frequency mismatches which are often PCR error while preventing high quality errors from collapsing which has allowed us to find haplotypes that only differ by one base mismatch.

Output files

An output directory will be created for all output files of qluster. The default name is the name of the input file plus the word qluster plus the current date and time when qluster was run. The name can be changed using the -dout flag.

output.fastq - The final consensus sequences for the clusters with a suffix of _t[NUM] where [NUM] is the number of reads associated with that cluster
outputInfo.tab.txt - Information on the number of reads per cluster
runLog_qluster.txt - Contains a time stamp for date and time command was run along with the location where the command was run from and what the command was, also contains total run time
clusters - A directory with a file for each final cluster containing the reads that contributed to making that cluster
internalSnpInfo - A directory with a file for each final clusters with frequency numbers for snps to the final consensus sequence of the reads that created that consensus. This file can be used to see if there was any over collapsing

Examples

General Usage

As stated above the only thing qluster needs to run is an input file and a parameter file

Code

SeekDeep qluster --fastq example.fastq --par par

Input Formats

Input to qluster can be fastq, fastq/qual, or just fasta (though then all mismatches will be high quality mismatches and advantage of quality scores is lost)

Fastq

Code

SeekDeep qluster --fastq example.fastq --par par.txt

Fastq/Qual

Code

SeekDeep qluster -fasta example.fasta -qual example.fasta.qual -par par.txt
#or if the file is named as above flag -stub can be used
SeekDeep qluster --stub example --par par.txt

Fasta

Code

SeekDeep qluster --fasta example.fasta --par par.txt

454 and Ion torrent

With the set up of the parameters file being completely at the whim of the user and what to collapse on can be somewhat arbitrary (though just picking an OTU cut off is also arbitrary) it is somewhat challenging picking the “correct” parameters. We have analyzed several control known mixture datasets and through testing out several parameters we have found the following parameters file to work out best (all clusters above .1% were expected clusters and all expected clusters were found) for 454 data, this file is provided with the SeekDeep source code in a folder called SeekDeepParametersFile and is called 454_it_lkmer2. You also use these parameters by not supplying the --par flag and using the --ionTorrent flag which will set this automatically

Code

cat 454_it_lkmer2

stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
100:3:1:0:0:0:0:1
100:3:2:0:0:0:0:1
100:3:3:0:0:0:1:1
100:3:4:0:0:0:2:1
100:3:5:0:0:0:3:1
100:3:6:0:0:0:4:1
100:3:7:1:0:0:5:1
100:3:7:2:0:0:5:1
100:3:7:3:0:0:5:2
100:3:7:4:0:0:5:2
100:3:7:5:0:0:5:2
100:0:1:0:0:0:0:1
100:0:2:0:0:0:0:1
100:0:3:0:0:0:1:1
100:0:4:0:0:0:2:1
100:0:5:0:0:0:3:1
100:0:6:0:0:0:4:1
100:0:7:1:0:0:5:1
100:0:7:2:0:0:5:1
100:0:7:3:0:0:5:2
100:0:7:4:0:0:5:2
100:0:7:5:0:0:5:2

Code

SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2

And the following file, 454_it_lkmer2_largeHpr, to work well with Ion Torrent data.

Code

cat 454_it_lkmer2_largeHpr

stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
100:3:1:0:0:0:0:1
100:3:2:0:0:0:0:1
100:3:3:0:0:0:1:1
100:3:4:0:.99:0:2:1
100:3:5:0:.99:0:3:1
100:3:6:0:.99:0:4:1
100:3:7:1:.99:0:5:1
100:3:7:2:.99:0:5:1
100:3:7:3:.99:0:5:1
100:3:7:4:.99:0:5:2
100:3:7:5:.99:0:5:2
100:0:1:0:0:0:0:1
100:0:2:0:0:0:0:1
100:0:3:0:0:0:1:1
100:0:4:0:.99:0:2:1
100:0:5:0:.99:0:3:1
100:0:6:0:.99:0:4:1
100:0:7:1:.99:0:5:1
100:0:7:2:.99:0:5:1
100:0:7:3:.99:0:5:1
100:0:7:4:.99:0:5:2
100:0:7:5:.99:0:5:2

Code

SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2

Also if you just use the --ionTorrent flag it will automatically use these parameters

Code

SeekDeep qluster --fastq example.fastq --ionTorrent

Also IonTorrent comes with a slew of problems but among them is how the quality scores are calculated that causes some trouble for qluster so there are 3 additional flags to turn on to get the best performance out of qluster for IonTorrent Data

Code

SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2 --qualTrim 3 --adjustHomopolyerRuns --useCompPerCutOff
#or to turn on all of them just use -ionTorrent
SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2 --ionTorrent

–qualTrim - Trims out bases that have a quality of less than the supply value (so 3 means bases with qual of 2 and 1),these are mostly bases at the end of long homopolymer stretches and are often error (if you convert the qual 1 and 2 into their errors rate, 10^(-qual/10), this comes out to be 80% and 63% chance of error)
–adjustHomopolyerRuns - Ion Torrent does this weird thing with their qualities where they decrease along a homopolymer stretch and sometimes quite drastically (they’ll drop down to 4 or 3 near the end), and this messes with the categorizing of errors so this flags takes the quality scores across a homopolymer run and sets their scores to the average quality
–useCompPerCutOff - Is a new flag that is somewhat experimental but in test data weird artifacts were appearing where a high frequency of an erroneous read was popping up in control datasets but they were comprised of only reads coming from one direction (Ion Torrent will give reads in both directions) and so this throws out clusters if they are comprised of only reads coming from one direction

Indels in homopolymers

SeekDeep by default weighs indels in homopolymer runs differently than other indels (this is because the majority of data that SeekDeep has been used on has been 454 and Ion Torrent data, FYI. this behavior can be turned off by using the --noHomopolymerWeighting flag). See method paper for detail description of how this weighting is done but essentially by setting the large base indel to less than 1 this allows for clusters that differ by indels >2 bases but are comprised completely of just 1 base inside of a homopolymer run to collapse.

Illumina

As above for the 454 and Ion Torrent dataset we have found by experiment a parameters that works best for Illumina data and put it in SeekDeepParametersFiles folder as well called illumina_lkmer2 and also as mentioned above SeekDeep by default weighs indels found in homopolymer differently and since this isn’t a problem in Illumina data it should be turned off, or you can do this with the --illumina flag

Code

cat illumina_lkmer2

stopCheck:smallCutoff:1baseIndel:2baseIndel:>2baseIndel:HQMismatches:LQMismatches:LKMismatches
100:3:1:0:0:0:0:1
100:3:2:0:0:0:0:1
100:3:2:0:0:0:1:1
100:3:2:0:0:0:2:1
100:3:2:0:0:0:3:1
100:3:2:0:0:0:4:1
100:3:2:0:0:0:5:1
100:3:2:0:0:0:6:2
100:3:2:0:0:0:7:2
100:3:2:0:0:0:8:2
100:3:2:0:0:0:8:2
100:0:1:0:0:0:0:1
100:0:2:0:0:0:0:1
100:0:2:0:0:0:1:1
100:0:2:0:0:0:2:1
100:0:2:0:0:0:3:1
100:0:2:0:0:0:4:1
100:0:2:0:0:0:5:1
100:0:2:0:0:0:6:2
100:0:2:0:0:0:7:2
100:0:2:0:0:0:8:2
100:0:2:0:0:0:8:2

Code

SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/illumina_lkmer2 --noHomopolymerWeighting

If you just use the --illumina flag this will just use these parameters

Code

SeekDeep qluster --fastq example.fastq --illumina

Otu Clustering

You can supply a parameters files with what otu to clusters at, allow to first cluster more fine (.99%) and then in latter iterations allow .97% or you can use the flag --otu to cluster at a specific otu for several iterations.

Code

#97% otu clustering 
SeekDeep qluster --fastq example.fastq --otu .97
#99% otu clustering 
SeekDeep qluster --fastq example.fastq --otu .99

Allowing high quality differences

You can allow high quality differences in the supplied parameters file or in conjunction with the --454, --ionTorrent, or --illumina flags

Code

#allow low quality differences and 1 high quality difference 
SeekDeep qluster --fastq example.fastq --illumina --hq 1
#allow low quality differences and indel differences common in 454 or ionTorrent and 1 high quality difference
#454
SeekDeep qluster --fastq example.fastq --454 --hq 1 
#ion torrent
SeekDeep qluster --fastq example.fastq --ionTorrent --hq 1

Quality to categorize errors

To determine if a mismatch is a low quality mismatch the qualities of the mismatching bases are examined along with the flanking base qualities. The quality of the mismatching bases are compared to what is called a primary quality (default 20) threshold and the qualities of the surrounding bases (default number of flanking bases is 2) are compared to what is called a secondary quality(default 15). To change the thresholds the flag -qualThres is used by giving two numbers separated by a comma (eg to do a primary qual of 20 and a secondary of 15 use 20,15). See methods paper for full details on this reasoning.

Code

#raise the threshold meaning more errors will counted as low quality mismatches
SeekDeep qluster --fastq example.fastq --par par.txt --qualThres 25,20

To change the number of flanking bases used, use the flag -qualThesWindow

Code

#shrink the window to just the previous base and the next base next to the mismatch 
SeekDeep qluster --fastq example.fastq --par par.txt --qualThresWindow 1

Gap scoring

The global alignments done by qluster are actually semi-global alignments where gap scoring can be applied differently for gaps appearing at the ends of sequences. Since the input data for qluster is targeted amplicon sequence the default alignment parameters are 5 for gap opening at the front and in the middle of the sequence with a penalty of 1 for extending gaps and zero gap penalty for putting gaps at the end of the sequence since a lot of times this type of data has fragmented ends but intact fronts. This can be changed in several ways and there are four flags that can be used, --gapRight, --gap (gaps in the middle), --gapLeft, and --gapAll and the input is two integers separated by a comma, the first number is the opening penalty and the second is gap extension penalty. (eg 5,1 5 open and 1 extend)

Code

#set gaps at the end of the sequence to 5 open 1 extend 
SeekDeep qluster --fastq example.fastq --par par.txt --gapRight 5,1
#make gaps at the beginning and end of sequences have no penalty
SeekDeep qluster --fastq example.fastq --par par.txt --gapRight 0,0 --gapLeft 0,0
#make gaps everywhere 5 open, 1 extend
SeekDeep qluster --fastq example.fastq --par par.txt --gapAll 5,1

Additional options

Changing out directory name

To change the default directory name use the -dout flag. SeekDeep will never overwrite a directory if it already exists and will fail and quit if it tries to create a directory that exists.

Code

SeekDeep qluster --fastq example.fastq --par par.txt --dout clusteringDir

The dout option also understand the key work TODAY to mean to insert the current date and time there instead though this means a output directory name can never have TODAY all in caps in it

Code

SeekDeep qluster --fastq example.fastq --par par.txt --dout clusteringDir_TODAY

Kmer Frequency Cut off level

The default cut off for a mismatch to be considered low frequency is 1. To modify this number use the --runCutOff flag, this can take either a specific number or a percentage

Code

SeekDeep qluster --fastq example.fastq --par par.txt --runCutOff 5 
#if the kmer with the mismatching base in the middle is found in only 5 reads count as low frequency

SeekDeep qluster --fastq example.fastq --par par.txt --runCutOff .2%
#if the kmer with the mismatching base in the middle is found in only .2% of reads count as low frequency

Also a back up frequency can be given when giving a percentage of, so if there 1000 input sequences and --runCutOff was given .01%,1 the cut off would default to 1 since .1% of 1000 would be 0

Code

SeekDeep qluster --fastq example.fastq --par par.txt --runCutOff .01%,1

Caching alignments

The most expensive part of qluster is the alignments it has to do to compare the sequences, these alignments can be cached in a directory in a somewhat compressed way if qluster had to be run again, for example if you decided to change parameters to collapse one, the re-run would be much faster. Caching is turn on by using the --alnInfoDir flag and giving it a directory to cache the alignments in

Code

SeekDeep qluster --fastq example.fastq --par par1.txt --alnInfoDir alnCache
#this second time around would run in a fraction of the time it took the first one
SeekDeep qluster --fastq example.fastq --par par2.txt --alnInfoDir alnCache

Alignments are dependent on gap scoring so if gap scoring changes new alignments have to be cached

Code

SeekDeep qluster --fastq example.fastq --par par1.txt --alnInfoDir alnCache --gapAll 5,1
#the second time around would require caching of new alignments
SeekDeep qluster --fastq example.fastq --par par2.txt --alnInfoDir alnCache --gapAll 7,1
#but now either of the below commands would be really fast
SeekDeep qluster --fastq example.fastq --par par2.txt --alnInfoDir alnCache --gapAll 5,1
SeekDeep qluster --fastq example.fastq --par par1.txt --alnInfoDir alnCache --gapAll 7,1

marking possible chimeric sequence

qluster marks final clusters for any clusters that look suspiciously like chimeric sequence (see methods paper for details on how this is done in detail). To turn on this behavior off use the --noMarkChimeras flag. To control what can be marked as chimeric use the --parFreqs flag. This set the frequency cut off the parent sequences have to be for the cluster to be marked as chimeric, this defaults to 2 which would mean the parent sequences would have to be at least twice as much as the possible child chimera to be marked

Code

SeekDeep qluster --fastq example.fastq --par par.txt --noMarkChimeras 
#increase multiplier to 5 times as much
SeekDeep qluster --fastq example.fastq --par par.txt --parFreqs 5

additional alternative directory output

The output of the final clustered haplotypes file (output.fastq) can also be directed to another directory, with is used in conjunction with SeekDeep processClusters to organizing the input to that command. The directory can determined by using the flag --additionalOut flag to give it a file where the first column is the associated MID name and the second is the directory to output to. See a full tutorial on SeekDeep pipeline for more details

Code

SeekDeep qluster --fastq example.fastq --par par.txt --additionalOut popClustering/locationByIndex/1.tab.txt

Parameters for extra speed up

Run clustering without singles

By using the --leaveOutSinglets flag the analysis will be done with leaving out all singlet reads. This can be useful for extreme read coverage where the number of singlets reads can be enormous but isn’t really needed.

Code

SeekDeep qluster --fastq example.fastq --par par.txt --leaveOutSinglets

Skipping compare of dissimilar sequences

By using the --fastClustering flag pairwise comparisons that differ in their nucleotide composition will be skipped, the amount of difference is set by --nucCutOff, can range from 0 to 1, defaults to 0.05 meaning a different between nucleotide differences over 0.05 will skip.

Code

SeekDeep qluster --fastq example.fastq --par par.txt --fastClustering

Code

SeekDeep qluster --fastq example.fastq --par par.txt --fastClustering --nucCutOff 0.1

Converging on Iteration

By default an iteration will run once, you can make it so an iteration will run until there is no more collapsing, this has the potential to significantly increase run time without much pay off. This behavior is turned on by using --converge.

Code

SeekDeep qluster --fastq example.fastq --par par.txt --converge

:::{.callout-note} # qluster's Purpose The main purpose of qluster is to take de-multiplex data/single sample data and to create final haplotypes with relative abundances by clustering reads by collapsing on specific errors. The program requires at most two options: input sequence files either in fastq(`--fastq`), fasta/qual(`--fasta`,`--qual`), or fastq format (`--fasta`) and a parameters files (`--par`) that determine the extent of the clustering of the reads. Several default parameter files are provided with the source code of SeekDeep. ::: # Getting usage command line Just typing the name of the program will give a help message on running the program ```{r, engine='bash',comment="",eval=FALSE} SeekDeep qluster ``` ```{r, engine='bash',comment="",highlight=TRUE, echo=FALSE} SeekDeep qluster | gsed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | head -15 echo ... echo ... ``` Also calling `-help` will do the same ```{r, engine='bash',comment="",eval=FALSE} SeekDeep qluster --help ``` Also all flags in SeekDeep are case insensitive and so all the following would have the same results ```{r, engine='bash',comment="",eval=FALSE} SeekDeep qluster --help SeekDeep qluster --HELP SeekDeep qluster --HeLP SeekDeep qluster --HeLp SeekDeep QLUSTER --HeLP SeekDeep qluster --HeLp SeekDeep qlUstEr --HeLp ``` ```{r, engine='bash',comment="",eval=FALSE, echo=FALSE} SeekDeep qluster --getFlags ``` ```{r, engine='bash',comment="",highlight=TRUE, eval=FALSE, echo=FALSE} SeekDeep qluster --getFlags | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20 echo ... echo ... ``` # qluster Method Overview ## Core Idea At the core of qluster is iteratively clustering collections of raw reads, which from here on will be called clusters, based on user supplied error profiles. The errors that are consider in this profile are 1base indels, 2base indels, >2base indels, low quality errors, high quality errors, and low kmer frequency errors. See future method for more extensive explanation of these errors than the short description that follows now. The low and high quality errors are based on per base quality scores of mismatching bases along with flanking quality scores. Low kmer errors are based on the occurrence of kmers centered on the mismatch. On each iteration a different error profile is given to allow for different amount of error to cluster reads together. ## Method flow Input reads are read in and initial clusters are created based on a simple sequence comparison creating clusters of unique sequences. Clusters are then sorted by size and then the iterative clustering process is perform. Each iteration starts by taking the clusters at the very bottom of the list and comparing it to the most abundant clusters. Comparisons are done by doing a global alignment and counting and the categorizing the errors into the categories mentioned above. This error profile is then compared to the allowable error profile for that iteration. If the error profile passes the current profile it is then checked against the next clusters in line for any other possible matches and is then added to cluster it matches best. At the end of each iteration a consensus sequence is calculated for each clusters based on majority rules basis, clusters are then resorted by their new read numbers and the new iteration is started. #Format of parameters files The parameters file will determine what errors to cluster on and how many iterations to do. The set up of the file is that every line is another iteration. Each line should contain 8 numbers. The order of what they mean is as follows * StopAfter - only check the top of this many reads * SizeCutOff - don't check against clusters of this size * 1baseIndels - The number of one base indels to allow * 2baseIndels - The number of two base indels to allow * \>2baseIndels - The number of >two base indels to allow * HQMismatches - The number of high quality mismatches to allow * LQMismatches - The number of low quality mismatches to allow * LKMismatches - The number of low kmer frequency mismatches to allow Each number is separated by a colon and the first row can be a title line if it starts with a 's' below is an example of a parameters file ```{r, engine='bash',comment="", echo=FALSE} cat ../extraFiles/example_parameters_file ``` This file indicates to do eight iterations. On the first iteration only the top 100 clusters would be checked. Clusters of size 3 will be compared against (this means that clusters of size 3 or less will still be compared to larger clusters but other clusters will not be compared to clusters of 3 or less which means they can't be the seeds for new clusters). And clusters will only be collapsed together if they only differ by 1 one base indel. On the next iteration the number of top clusters to compare against and the cluster size that can form seeds will be the same but now clusters that differ by 2 one base indels and by 1 low kmer frequency errors. On the fourth iteration clusters of any size can now form seeds for new clusters and we are back to allowing only 1 one base indels. Though this is not enforced in anyway this is common practice in using qluster, first allow only large clusters to form seeds slowly allowing more errors between clusters so that first solid clusters can be form. Once the number of errors willing to be allowed is reached go back to the a small amount of error and allow small clusters to form seeds so not low frequency haplotypes aren't found. Also alternative to specifying a specific number of top clusters to check a percentage can be given instead. See example below. ```{r, engine='bash',comment="", echo=FALSE} cat ../extraFiles/example_parameters_file_percent ``` This means that in the first iteration the top 10% of reads will be checked, so if there were 2000 clusters, the top 200 would be checked. And the these formats can be mixed so the here the next iteration only the top 100 clusters would be checked no matter how many clusters there were. ## OTU percent id clustering qluster can also offer the traditional OTU percent identity clusters that is often employed by programs to cluster targeted amplicon clustering. The parameters file is similar to the previous example but there is only one column for the errors that is now percent identity. ```{r, engine='bash',comment="", echo=FALSE} cat ../extraFiles/example_parameters_otu_file ``` This means the program will perform four iteration while allowing clusters to collapse into each other if they differ by less than 3%. The first two columns mean the same as explained above. # The reasoning behind qluster The first parameter/method describe was created out of a need to be more precise than the non-specific OTU percent identity clustering method. In our work with Plasmodium (Malaria) we had a need to study haplotypes that differed by only one base pair but we still had to contend with sequencing and PCR errors that confound a typical targeted amplicon sequencing approach. Thus this method of collapsing only on specific types of errors was born, it allows us to perform what is essentially a percent identity clustering but be very specific where that percent identity is coming from. This allows us to collapse only small indels (which for our work in protein coding sequence are very unlikely as they would cause a frame shift) and that plague sequencing technologies like 454 and Ion Torrent and to collapse only on base mismatches that come from bases with low quality scores (something that is provided by all technologies) and on low kmer frequency mismatches which are often PCR error while preventing high quality errors from collapsing which has allowed us to find haplotypes that only differ by one base mismatch. # Output files An output directory will be created for all output files of qluster. The default name is the name of the input file plus the word qluster plus the current date and time when qluster was run. The name can be changed using the `-dout` flag. * **output.fastq** - The final consensus sequences for the clusters with a suffix of _t[NUM] where [NUM] is the number of reads associated with that cluster * **outputInfo.tab.txt** - Information on the number of reads per cluster * **runLog_qluster.txt** - Contains a time stamp for date and time command was run along with the location where the command was run from and what the command was, also contains total run time * **clusters** - A directory with a file for each final cluster containing the reads that contributed to making that cluster * **internalSnpInfo** - A directory with a file for each final clusters with frequency numbers for snps to the final consensus sequence of the reads that created that consensus. This file can be used to see if there was any over collapsing # Examples ## General Usage As stated above the only thing qluster needs to run is an input file and a parameter file ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fastq example.fastq --par par ``` ### Input Formats Input to qluster can be fastq, fastq/qual, or just fasta (though then all mismatches will be high quality mismatches and advantage of quality scores is lost) #### Fastq ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt ``` #### Fastq/Qual ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster -fasta example.fasta -qual example.fasta.qual -par par.txt #or if the file is named as above flag -stub can be used SeekDeep qluster --stub example --par par.txt ``` #### Fasta ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fasta example.fasta --par par.txt ``` ## 454 and Ion torrent With the set up of the parameters file being completely at the whim of the user and what to collapse on can be somewhat arbitrary (though just picking an OTU cut off is also arbitrary) it is somewhat challenging picking the "correct" parameters. We have analyzed several control known mixture datasets and through testing out several parameters we have found the following parameters file to work out best (all clusters above .1% were expected clusters and all expected clusters were found) for 454 data, this file is provided with the SeekDeep source code in a folder called SeekDeepParametersFile and is called `454_it_lkmer2`. You also use these parameters by not supplying the `--par` flag and using the `--ionTorrent` flag which will set this automatically ```{r, engine='bash',comment="", echo=TRUE, eval=FALSE} cat 454_it_lkmer2 ``` ```{r, engine='bash',comment="", echo=FALSE} cat ../extraFiles/454_it_lkmer2 ``` ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2 ``` And the following file, `454_it_lkmer2_largeHpr`, to work well with Ion Torrent data. ```{r, engine='bash',comment="", echo=TRUE, eval=FALSE} cat 454_it_lkmer2_largeHpr ``` ```{r, engine='bash',comment="", echo=FALSE} cat ../extraFiles/454_it_lkmer2_largeHpr ``` ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2 ``` Also if you just use the `--ionTorrent` flag it will automatically use these parameters ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fastq example.fastq --ionTorrent ``` Also IonTorrent comes with a slew of problems but among them is how the quality scores are calculated that causes some trouble for qluster so there are 3 additional flags to turn on to get the best performance out of qluster for IonTorrent Data ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2 --qualTrim 3 --adjustHomopolyerRuns --useCompPerCutOff #or to turn on all of them just use -ionTorrent SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/454_it_lkmer2 --ionTorrent ``` * **--qualTrim** - Trims out bases that have a quality of less than the supply value (so 3 means bases with qual of 2 and 1),these are mostly bases at the end of long homopolymer stretches and are often error (if you convert the qual 1 and 2 into their errors rate, `10^(-qual/10)`, this comes out to be 80% and 63% chance of error) * **--adjustHomopolyerRuns** - Ion Torrent does this weird thing with their qualities where they decrease along a homopolymer stretch and sometimes quite drastically (they'll drop down to 4 or 3 near the end), and this messes with the categorizing of errors so this flags takes the quality scores across a homopolymer run and sets their scores to the average quality * **--useCompPerCutOff** - Is a new flag that is somewhat experimental but in test data weird artifacts were appearing where a high frequency of an erroneous read was popping up in control datasets but they were comprised of only reads coming from one direction (Ion Torrent will give reads in both directions) and so this throws out clusters if they are comprised of only reads coming from one direction ### Indels in homopolymers SeekDeep by default weighs indels in homopolymer runs differently than other indels (this is because the majority of data that SeekDeep has been used on has been 454 and Ion Torrent data, FYI. this behavior can be turned off by using the `--noHomopolymerWeighting` flag). See method paper for detail description of how this weighting is done but essentially by setting the large base indel to less than 1 this allows for clusters that differ by indels >2 bases but are comprised completely of just 1 base inside of a homopolymer run to collapse. ## Illumina As above for the 454 and Ion Torrent dataset we have found by experiment a parameters that works best for Illumina data and put it in SeekDeepParametersFiles folder as well called `illumina_lkmer2` and also as mentioned above SeekDeep by default weighs indels found in homopolymer differently and since this isn't a problem in Illumina data it should be turned off, or you can do this with the `--illumina` flag ```{r, engine='bash',comment="", echo=TRUE, eval=FALSE} cat illumina_lkmer2 ``` ```{r, engine='bash',comment="", echo=FALSE} cat ../extraFiles/illumina_lkmer2 ``` ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fastq example.fastq --par path/to/SeekDeepCode/SeekDeepParametersFiles/illumina_lkmer2 --noHomopolymerWeighting ``` If you just use the `--illumina` flag this will just use these parameters ```{r, engine='bash',comment="", eval=FALSE} SeekDeep qluster --fastq example.fastq --illumina ``` ## Otu Clustering You can supply a parameters files with what otu to clusters at, allow to first cluster more fine (.99%) and then in latter iterations allow .97% or you can use the flag `--otu` to cluster at a specific otu for several iterations. ```{r, engine='bash',comment="", eval=FALSE} #97% otu clustering SeekDeep qluster --fastq example.fastq --otu .97 #99% otu clustering SeekDeep qluster --fastq example.fastq --otu .99 ``` ## Allowing high quality differences You can allow high quality differences in the supplied parameters file or in conjunction with the `--454`, `--ionTorrent`, or `--illumina` flags ```{r, engine='bash',comment="", eval=FALSE} #allow low quality differences and 1 high quality difference SeekDeep qluster --fastq example.fastq --illumina --hq 1 #allow low quality differences and indel differences common in 454 or ionTorrent and 1 high quality difference #454 SeekDeep qluster --fastq example.fastq --454 --hq 1 #ion torrent SeekDeep qluster --fastq example.fastq --ionTorrent --hq 1 ``` ## Quality to categorize errors To determine if a mismatch is a low quality mismatch the qualities of the mismatching bases are examined along with the flanking base qualities. The quality of the mismatching bases are compared to what is called a primary quality (default 20) threshold and the qualities of the surrounding bases (default number of flanking bases is 2) are compared to what is called a secondary quality(default 15). To change the thresholds the flag `-qualThres` is used by giving two numbers separated by a comma (eg to do a primary qual of 20 and a secondary of 15 use 20,15). See methods paper for full details on this reasoning. ```{r, engine='bash',comment="", eval=FALSE} #raise the threshold meaning more errors will counted as low quality mismatches SeekDeep qluster --fastq example.fastq --par par.txt --qualThres 25,20 ``` To change the number of flanking bases used, use the flag -qualThesWindow ```{r, engine='bash',comment="", eval=FALSE} #shrink the window to just the previous base and the next base next to the mismatch SeekDeep qluster --fastq example.fastq --par par.txt --qualThresWindow 1 ``` ## Gap scoring The global alignments done by qluster are actually semi-global alignments where gap scoring can be applied differently for gaps appearing at the ends of sequences. Since the input data for qluster is targeted amplicon sequence the default alignment parameters are 5 for gap opening at the front and in the middle of the sequence with a penalty of 1 for extending gaps and zero gap penalty for putting gaps at the end of the sequence since a lot of times this type of data has fragmented ends but intact fronts. This can be changed in several ways and there are four flags that can be used, `--gapRight`, `--gap` (gaps in the middle), `--gapLeft`, and `--gapAll` and the input is two integers separated by a comma, the first number is the opening penalty and the second is gap extension penalty. (eg 5,1 5 open and 1 extend) ```{r engine='bash', eval=FALSE} #set gaps at the end of the sequence to 5 open 1 extend SeekDeep qluster --fastq example.fastq --par par.txt --gapRight 5,1 #make gaps at the beginning and end of sequences have no penalty SeekDeep qluster --fastq example.fastq --par par.txt --gapRight 0,0 --gapLeft 0,0 #make gaps everywhere 5 open, 1 extend SeekDeep qluster --fastq example.fastq --par par.txt --gapAll 5,1 ``` ## Additional options ### Changing out directory name To change the default directory name use the `-dout` flag. SeekDeep will never overwrite a directory if it already exists and will fail and quit if it tries to create a directory that exists. ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --dout clusteringDir ``` The dout option also understand the key work TODAY to mean to insert the current date and time there instead though this means a output directory name can never have TODAY all in caps in it ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --dout clusteringDir_TODAY ``` ### Kmer Frequency Cut off level The default cut off for a mismatch to be considered low frequency is 1. To modify this number use the `--runCutOff` flag, this can take either a specific number or a percentage ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --runCutOff 5 #if the kmer with the mismatching base in the middle is found in only 5 reads count as low frequency SeekDeep qluster --fastq example.fastq --par par.txt --runCutOff .2% #if the kmer with the mismatching base in the middle is found in only .2% of reads count as low frequency ``` Also a back up frequency can be given when giving a percentage of, so if there 1000 input sequences and `--runCutOff` was given `.01%,1` the cut off would default to 1 since .1% of 1000 would be 0 ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --runCutOff .01%,1 ``` ### Caching alignments The most expensive part of qluster is the alignments it has to do to compare the sequences, these alignments can be cached in a directory in a somewhat compressed way if qluster had to be run again, for example if you decided to change parameters to collapse one, the re-run would be much faster. Caching is turn on by using the `--alnInfoDir` flag and giving it a directory to cache the alignments in ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par1.txt --alnInfoDir alnCache #this second time around would run in a fraction of the time it took the first one SeekDeep qluster --fastq example.fastq --par par2.txt --alnInfoDir alnCache ``` Alignments are dependent on gap scoring so if gap scoring changes new alignments have to be cached ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par1.txt --alnInfoDir alnCache --gapAll 5,1 #the second time around would require caching of new alignments SeekDeep qluster --fastq example.fastq --par par2.txt --alnInfoDir alnCache --gapAll 7,1 #but now either of the below commands would be really fast SeekDeep qluster --fastq example.fastq --par par2.txt --alnInfoDir alnCache --gapAll 5,1 SeekDeep qluster --fastq example.fastq --par par1.txt --alnInfoDir alnCache --gapAll 7,1 ``` ### marking possible chimeric sequence qluster marks final clusters for any clusters that look suspiciously like chimeric sequence (see methods paper for details on how this is done in detail). To turn on this behavior off use the `--noMarkChimeras` flag. To control what can be marked as chimeric use the `--parFreqs` flag. This set the frequency cut off the parent sequences have to be for the cluster to be marked as chimeric, this defaults to 2 which would mean the parent sequences would have to be at least twice as much as the possible child chimera to be marked ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --noMarkChimeras #increase multiplier to 5 times as much SeekDeep qluster --fastq example.fastq --par par.txt --parFreqs 5 ``` ### additional alternative directory output The output of the final clustered haplotypes file (output.fastq) can also be directed to another directory, with is used in conjunction with `SeekDeep processClusters` to organizing the input to that command. The directory can determined by using the flag `--additionalOut` flag to give it a file where the first column is the associated MID name and the second is the directory to output to. See a full tutorial on SeekDeep pipeline for more details ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --additionalOut popClustering/locationByIndex/1.tab.txt ``` ### Parameters for extra speed up #### Run clustering without singles By using the `--leaveOutSinglets` flag the analysis will be done with leaving out all singlet reads. This can be useful for extreme read coverage where the number of singlets reads can be enormous but isn't really needed. ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --leaveOutSinglets ``` #### Skipping compare of dissimilar sequences By using the `--fastClustering` flag pairwise comparisons that differ in their nucleotide composition will be skipped, the amount of difference is set by `--nucCutOff`, can range from 0 to 1, defaults to 0.05 meaning a different between nucleotide differences over 0.05 will skip. ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --fastClustering ``` ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --fastClustering --nucCutOff 0.1 ``` ## Converging on Iteration By default an iteration will run once, you can make it so an iteration will run until there is no more collapsing, this has the potential to significantly increase run time without much pay off. This behavior is turned on by using `--converge`. ```{r engine='bash', eval=FALSE} SeekDeep qluster --fastq example.fastq --par par.txt --converge ```