processClusters’s objectives

processClusters takes care of several of the final steps of the SeekDeep pipeline.

replicate comparison - If multiple replicates are available per sample it is able to take advantage of this and take only clusters that appear in all replicates
final results filtering - A filter is applied to the final results where low frequency clusters and singlet clusters are removed. This step is applied to get rid of artifact sequence that couldn’t be handled by previous extraction and clustering steps and removal of clusters marked as chimeric.
population comparison - Final population comparison of all sample final results is done to analyze haplotypes across samples to get haplotypes of the whole population
variant and translation - Comparison to reference genome to call variants against reference and to translate final haplotypes

Getting usage command line

Just typing the name of the program will give a help message on running the program

Code

SeekDeep processClusters

SeekDeep processClusters
Ran from inside directory tree set up such that currentDir/SampleDir/RunDir/seqFiles
                                 Required Options
                                    Clustering
                                  --par ParametersFileName; required; default=; ([38;5;202mstd::string)
                              Reading Sequence Input
                                --fasta Input sequence filename, only need 1, fasta text file; required; default=; ([38;5;202mboost::filesystem::path)
                              --fastagz Input sequence filename, only need 1, fasta gzipped file; required; default=; ([38;5;202mboost::filesystem::path)
                                --fastq Input sequence filename, only need 1, fastq text file; required; default=; ([38;5;202mboost::filesystem::path)
                              --fastqgz Input sequence filename, only need 1, fastq gzipped file; required; default=; ([38;5;202mboost::filesystem::path)
                                 Optional Options
                                Additional Output
                                --extra Extra Output; default=false; ([38;5;202mbool)
               --writeExcludedOriginals Write out the excluded Originals; default=false; ([38;5;202mbool)
                                    Alignment
...
...

Also calling -help will do the same

Code

SeekDeep processClusters --help

Also all flags in SeekDeep are case insensitive and so all the following would have the same results

Code

SeekDeep processClusters --help 
SeekDeep processClusters --HELP
SeekDeep processClusters --HeLP
SeekDeep processClusters --HeLp
SeekDeep PROCESSCLUSTERS --HeLP
SeekDeep processclusters --HeLp
SeekDeep ProcessClusters --HeLp

Input

The input to processClusters is done by running it in a directory with several results files for each sample in subdirectories below the top directory. An example directory tree below

../extraFiles/sampleClusteringDir
├── Sample01
│   ├── MID01
│   │   └── output.fastq
│   └── MID02
│       └── output.fastq
├── Sample02
│   ├── MID03
│   │   └── output.fastq
│   └── MID04
│       └── output.fastq
├── Sample03
│   ├── MID05
│   │   └── output.fastq
│   └── MID06
│       └── output.fastq
└── Sample04
    ├── MID07
    │   └── output.fastq
    └── MID08
        └── output.fastq

Each output file needs to be named the same (here output.fastq) and the sequences in the output should be named so that they have a suffix indicating how many reads are associated with each cluster. Suffix should be _[NUM] or _t[NUM] where [NUM] is the number of reads associated with each cluster.

Output files

An output directory will be created to store the output of processClusters, the name of which can be changed by using the -dout flag. In this directory will be several files and directories.

selectedClustersInfo.tab.txt - This is a very table containing all information of replicate comparison, final samples haplotypes, and any population information if population clustering has been done
final - This directory will contain a sequence file for each sample that was in the clustering and will be the final haplotypes for the analysis. Each sequence in the sequence file will have a suffix in the format _f[NUM] where [NUM] is a relative abundance in the sample so _f1.0 would be 100% of the sample and _f.50 would be a haplotype that took up about 50% of the sample.
sampRepAgreementInfo - This folder contains several files that report how well replicates agree including a file containing graphs to show this If population clustering was enabled
population - This directory will be create if population clustering is enable and will contain information about the population of haplotypes found in the samples
population/populationCluster.tab.txt - This will contain information about how many samples the population haplotypes appear in.
population/PopUID.fastq - This will contain the sequences for all the haplotypes found in the population, they will have have a suffix with _f[NUM] where [NUM] is the average relative abundance it was found at. For example if the haplotype was found in 3 different sample at relative abundances of .5, .6, and .7 the suffix here would be _f0.6 If groupings info was supplied
groups - This directory will be created if grouping information is supplied, see below for details.

Column explaination for selectedClustersInfo.tab.txt

All the data from the analysis done by processClusters is located in this file. It is a very large file and can be overwhelming at first but all data is conveniently located in one place. How the table is organized is each row represents a haplotype from a sample. Therefore each row will report stats on the haplotype that will have to do with how it was found in the replicates, what the final stats of the haplotype after replicate comparison was, and if population analysis is being done stats on how the haplotype appears in the population. At the same time general stats for each sample and replicate are reported and therefore several columns will have redundant information because there will be several haplotypes for each replicate and sample. Also to define some terms to make sense of the column heading. Each sample will have a one or more replicates, these are normally PCR replicates. Reads from these replicates will first be clustered separately and then compared to each other to make the final data for the sample. The results from this comparison will results in what will be called clusters and each cluster will have a relative abundance in the sample totaling up to 1. After clusters are finalized in the samples they are then the final clusters from each samples are compared to each other to create population data incorporating all samples. This population analysis will result will what will be called haplotypes. So while in the sample these sequences are referred to as clusters and in the population as haplotype to distinguish when talking about each one. Each column has a prefix to indicate what kind of information it being reported on. s_ indicates data about the sample, p_ indicates data about the whole population h_ indicates data about the haplotype, c_ indicates data about the cluster while in the sample and RXX indicates information in the replicates where XX is the replicate number.

s_Sample - The name of the sample the haplotype appears in
h_popUID - The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in
p_TotalPopulationSampCnt - This the the total number of samples in the population analysis
Haplotype Info
h_PopFrac - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population.
h_SumOfAllFracs - This is the sum of all the relative abundances the haplotype was found at
h_AvgFracFoundAt - This is the average relative abundance the haplotype was found at
h_ReadFrac - This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment
h_SampCnt - This is the number of samples the haplotype was found in.
h_SampFrac - This is the number of samples the haplotype was found in divided by the number of samples in the experiment
h_ReadCnt - This is the total number of reads associated with this haplotype
h_ClusterCnt - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype.
h_clusterNames - This is the names of the contributing clusters
Sample Info
s_Name - The name of the sample the haplotype appears in
s_ReadCntTotUsed - This is the number of reads used out of this sample given with percentage of the total number of reads given to processClusters, the other percentage would be what was filtered off (low frequency and chimeras)
s_InputClusterCnt - The total number of clusters at the beginning of the analysis for this sample
s_FinalClusterCnt - The total number of clusters after analysis
Cluster Info
c_clusterID - The id given to the cluster in the sample, the lower the number the higher the relative abundance in the sample
c_AveragedFrac - The averaged relative abundance between all replicates used in the analysis (this basically an un-weighted average of the abundances since it wouldn’t depend on read counts)
c_ReadFrac - The number of reads associated with this cluster divided by the total number of reads found in all replicates, if there is only one replicate this number will be the same as c_AveragedFrac.
c_ReadCnt - The total number of reads associated with this cluster summed over all replicates
c_RepCnt - The number of reps this cluster was found in
c_Consensus - The consensus sequence for this cluster
c_InputCnt - The number of contributing replicate clusters to this sample cluster
c_ChiReadCnt - The number of reads for the replicate clusters that are marked chimeric
c_ChiClusCnt - The number of replicate clusters that are marked as chimeric
c_ChiRepCnt - The number of replicated this current sample was marked as chimeric
c_InputNames - The names of the replicate clusters that contributed to making this sample cluster
RXX_name - The number of the replicate
RXX_totalClusterCntExcluded - The total number of replicate clusters that were excluded from this replicate
RXX_totalCntExcluded - The total number of reads that were excluded from the replicate
RXX_totalFracExcluded - The fraction of reads excluded from this replicate
RXX_clusterCntChiExcluded - The number of clusters that were marked chimeric in this replicate
RXX_cntChiExcluded - The total read count marked as chimeric in this replicate
RXX_fracChiExcluded - The fraction of the replicate that was excluded due to chimeric formation
RXX_MapFrac - The relative abundance of this replicate cluster in this replicate
RXX_ReadCnt - The total number of reads this replicate cluster has
RXX_ClusCnt - The total number of clusters in this replicate
RXX_totalReadCnt - The total number of reads in this replicate
c_bestExpected - A string with the name of the closest reference sequence see below to see how to compare against expected references

Column explaination for populationCluster.tab.txt

h_popUID - The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in
p_TotalInputReadCnt - The total number of reads in the whole experiment/population
p_TotalInputClusterCnt - The total number of clusters in the whole experiment/population
p_TotalPopulationSampCnt - The total number of samples in the whole experiment/population
p_TotalHaplotypes - The total number of final haplotypes in the whole experiment/population
p_meanCoi - The mean complexity of infection (COI) in the whole experiment/population
p_medianCoi - The median complexity of infection (COI) in the whole experiment/population
p_minCoi - The minimum complexity of infection (COI) in the whole experiment/population
p_maxCoi - The max complexity of infection (COI) in the whole experiment/population
h_PopFrac - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population.
h_SumOfAllFracs - This is the sum of all the relative abundances the haplotype was found at
h_AvgFracFoundAt - This is the average relative abundance the haplotype was found at
h_ReadFrac - This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment
h_SampCnt - This is the number of samples the haplotype was found in.
h_SampFrac - This is the number of samples the haplotype was found in divided by the number of samples in the experiment
h_ReadCnt - This is the total number of reads associated with this haplotype
h_ClusterCnt - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype.
h_clusterNames - This is the names of the contributing clusters
h_Consensus - The consensus sequence of all contributing clusters to this haplotype
h_Protein - An estimated protein, will only be correct if the reading frame starts at 0
c_bestExpected - A string with the name of the closest reference sequence see below to see how to compare against expected references

Examples

Running

processClusters needs only two options to run, name of the results files given by --fastq or --fasta in the directory tree (see above) and a parameters file --par see qluster usage for explanation of parameters file. Alternately to the --par you can use either --noErrors to allow for now errors to collapse between comparisons or --strictErrors to allow for just a few low quality mismatches.

Input results file format

Input reads can be fasta or fastq

Fastq

Code

SeekDeep processClusters --fastq output.fastq --par pars.txt

Fasta

Code

SeekDeep processClusters --fasta output.fasta --par pars.txt

Pars

Just do a simple match allowing no differences

Code

SeekDeep processClusters --fastq output.fastq --noErrors

Allowing only 1 high quality mismatch

Code

SeekDeep processClusters --fastq output.fastq --noErrors --hq 1

Allowing a few low quality mismtaches and some 1 base indels (good for 454 and Ion Torrent)

Code

SeekDeep processClusters --fastq output.fastq --strictErrors

Allowing a few low quality mismtaches and but no indels (illumina)

Code

SeekDeep processClusters --fastq output.fastq --strictErrors --illumina

Allowing a few low quality mismtaches and but no indels (illumina) and 1 high quality mismatch

Code

SeekDeep processClusters --fastq output.fastq --strictErrors --illumina --hq 1

Different Population Parameters

If you want to give different parameters to do the population clustering with just use the ’-popPar` flag

Code

SeekDeep processClusters --fastq output.fastq --par pars.txt  --popPar popPars.txt

Code

SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors

Supplying grouping info

To output additional files of final results organized by groups, supply a grouping tab delimited file where the first column has sample names and each additional column is named for a group with sub-groups that each sample belongs to. See below for an example file

Code

cat exampleGroups.tab.txt

samples   age    location
Sample01  child  loc1
Sample02  child  loc2
Sample03  adult  loc1
Sample04  adult  loc2

Just give this file name to the --groups flag

Code

SeekDeep processClusters --fastq output.fastq --par pars.txt  --groups exampleGroups.tab.txt

This will create a directory called groups in the processClusters output directory. This will contain a series of sub directories of results files split by group and sub-group with a file called sampFile.tab.txt (set up like selectedClusters.tab.txt) and a file called popFile.tab.txt (set up like populationCluster.tab.txt) but will only contain information about that subgroup, the population UIDs will be the same so they can be linked to the main results data. This is still a new feature and will be developed further in future release when analysis between groups will automatically be done but for now the most interesting thing to look at would be the COI stats in each popFile.tab.txt between sub-groups in each group.

Supplying expected reference sequences

To supply a reference file to compare sequences to (this is helpful when you have controls with expected mixture done along with your experiment) use the --ref or --refFastq flag. The --ref flag is for when the sequences are in a fasta file and --refFastq is for when the seq file is a fastq file.

Code

SeekDeep processClusters --fastq output.fastq --par pars.txt  --ref references.fasta
#or fastq
SeekDeep processClusters --fastq output.fastq --par pars.txt  --refFastq references.fastq

Supplying experiment name

By default the final haplotypes are given the PopUID.XX but you can change the PopUID to be the name of your experiment

Code

SeekDeep processClusters --fastq output.fastq --par pars.txt  --experimentName myExperiment

Giving a relative abundance cut off

By default a relative abundance cut off of 0.005 (0.5%) is done to clear out low abundant clusters are most likely due to current technologies and PCR limitation and shouldn’t be done in the experiment. If you’re data is especially noisy or you want to be more stringent in your cut off use the --fracCutOff flag, for example the filter off all clusters at 1% and lower use the below example

Code

SeekDeep processClusters --fastq output.fastq --par pars.txt  --fracCutOff 0.01

Removing singletons

By default all singletons clusters are thrown out before analysis is done per replicate. To change this behavior use the --clusterCutOff flag

Code

#to keep singletons 
SeekDeep processClusters --fastq output.fastq --par pars.txt  --clusterCutOff 0
#to remove clusters of size 2 and less 
SeekDeep processClusters --fastq output.fastq --par pars.txt  --clusterCutOff 2

Marking Chimeras and checking Chimeras

By default all sample clusters that are made up of mostly clusters with the CHI_ flag are removed from analysis. To keep chimeras use the --keepChimera flag. Also marking clusters can be done with SeekDeep qluster but can also be done by processClusters by using the same flags as qluster to make them, --markChimeras and --parFreqs see SeekDeep qluster for an explanation of these flags.

Code

#keep chimeras 
SeekDeep processClusters --fastq output.fastq --par pars.txt  --keepChimeras
#mark chimeras when they have parents that are greater than 3x in frequency  
SeekDeep processClusters --fastq output.fastq --par pars.txt  --makrChimeras --parFreqs 3

Also marking all sequences that appear as chimeric can be removing low level recombinants. To try to prevent some of this from happening you can use the --investigateChimeras flag which will take clusters marked as chimeric and see if they are ever one of the two major clusters in an other sample. The reasoning behind this is that chimeric clusters need two parents and if a cluster appears as one of the two major clusters in a sample it’s not possible to have two parents that are greater than it is.

Code

SeekDeep processClusters --fastq output.fastq --par pars.txt  --investigateChimeras

Extra cleanup

SeekDeep processClusters also offers the clean up of low frequencies one-off haplotypes at the end of clustering. This can be especially helpful for when there are no replicates to help correct for PCR noise and there might be several haplotypes that are at low frequency and only 1 different from a major higher in frequency haplotypes (e.g. major haplotype at 95% and a haplotype 1 base different at 1%). This can be a good alternative to simply collapsing all one-off haplotypes that would collapse haplotypes both at high frequency like 50% and 50%.

Code

SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs

The different in frequencies is set by default so that the higher frequency haplotype has to be found at 30x the frequency of the lower frequency haplotypes (this number was arrived at by empirical means by looking at various control mixtures). This number can be changed using --lowFreqMultiplier

Code

SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs --lowFreqMultiplier 40

Variation calling and translation

SeekDeep processClusters also offers the ability to compare the final results to a reference genome(–genomeFnp) and associated annotation file(–gffFnp) to create a VCF as well as translate final haplotypes based on annotation to also create a protein VCF as well. You can optionally supply a list of known drug associated mutations(–knownAminoAcidChangesFnp) to report the counts of these locations regardless if they are variant or not, an example of this type of file can be found here, https://seekdeep.brown.edu/data/plasmodiumData/genomes/pf/info/pf_drug_resistant_aaPositions.tsv, should have a column of geneID that can be found in –gffFnp and 1-based position of the amino acid of interest.

Code

SeekDeep processClusters --strictErrors --dout analysis --fastqgz output.fastq.gz --overWriteDir --illumina --removeOneSampOnlyOneOffHaps --excludeCommonlyLowFreqHaplotypes --excludeLowFreqOneOffs --rescueExcludedOneOffLowFreqHaplotypes --genomeFnp /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --gffFnp /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv

The various variant calling results will be found in the output directory (above analysis) in a folder called variantCalling

:::{.callout-note} # processClusters's objectives processClusters takes care of several of the final steps of the SeekDeep pipeline. * **replicate comparison** - If multiple replicates are available per sample it is able to take advantage of this and take only clusters that appear in all replicates * **final results filtering** - A filter is applied to the final results where low frequency clusters and singlet clusters are removed. This step is applied to get rid of artifact sequence that couldn't be handled by previous extraction and clustering steps and removal of clusters marked as chimeric. * **population comparison** - Final population comparison of all sample final results is done to analyze haplotypes across samples to get haplotypes of the whole population * **variant and translation** - Comparison to reference genome to call variants against reference and to translate final haplotypes ::: # Getting usage command line Just typing the name of the program will give a help message on running the program ```{r, engine='bash',comment="",eval=FALSE} SeekDeep processClusters ``` ```{r, engine='bash',comment="",highlight=TRUE, echo=FALSE} SeekDeep processClusters | gsed -r "s/\x1B\[([0-9]{1,2}(;[0-9]{1,2})?)?[m|K]//g" | head -15 echo ... echo ... ``` Also calling `-help` will do the same ```{r, engine='bash',comment="",eval=FALSE} SeekDeep processClusters --help ``` Also all flags in SeekDeep are case insensitive and so all the following would have the same results ```{r, engine='bash',comment="",eval=FALSE} SeekDeep processClusters --help SeekDeep processClusters --HELP SeekDeep processClusters --HeLP SeekDeep processClusters --HeLp SeekDeep PROCESSCLUSTERS --HeLP SeekDeep processclusters --HeLp SeekDeep ProcessClusters --HeLp ``` ```{r, engine='bash',comment="",eval=FALSE, echo=FALSE} SeekDeep processClusters --getFlags ``` ```{r, engine='bash',comment="",highlight=TRUE, eval=FALSE, echo=FALSE} SeekDeep processClusters --getFlags | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20 echo ... echo ... ``` # Input The input to processClusters is done by running it in a directory with several results files for each sample in subdirectories below the top directory. An example directory tree below ```{r, engine='bash',comment="",highlight=TRUE, eval=T, echo=FALSE} tree -n ../extraFiles/sampleClusteringDir | head -21 ``` Each output file needs to be named the same (here output.fastq) and the sequences in the output should be named so that they have a suffix indicating how many reads are associated with each cluster. Suffix should be _[NUM] or _t[NUM] where [NUM] is the number of reads associated with each cluster. # Output files An output directory will be created to store the output of processClusters, the name of which can be changed by using the `-dout` flag. In this directory will be several files and directories. * **selectedClustersInfo.tab.txt** - This is a very table containing all information of replicate comparison, final samples haplotypes, and any population information if population clustering has been done * **final** - This directory will contain a sequence file for each sample that was in the clustering and will be the final haplotypes for the analysis. Each sequence in the sequence file will have a suffix in the format _f[NUM] where [NUM] is a relative abundance in the sample so _f1.0 would be 100% of the sample and _f.50 would be a haplotype that took up about 50% of the sample. * **sampRepAgreementInfo** - This folder contains several files that report how well replicates agree including a file containing graphs to show this If population clustering was enabled * **population** - This directory will be create if population clustering is enable and will contain information about the population of haplotypes found in the samples * **population/populationCluster.tab.txt** - This will contain information about how many samples the population haplotypes appear in. * **population/PopUID.fastq** - This will contain the sequences for all the haplotypes found in the population, they will have have a suffix with _f[NUM] where [NUM] is the average relative abundance it was found at. For example if the haplotype was found in 3 different sample at relative abundances of .5, .6, and .7 the suffix here would be _f0.6 If groupings info was supplied * **groups** - This directory will be created if grouping information is supplied, see [below](#supplying-grouping-info) for details. ## Column explaination for selectedClustersInfo.tab.txt All the data from the analysis done by processClusters is located in this file. It is a very large file and can be overwhelming at first but all data is conveniently located in one place. How the table is organized is each row represents a haplotype from a sample. Therefore each row will report stats on the haplotype that will have to do with how it was found in the replicates, what the final stats of the haplotype after replicate comparison was, and if population analysis is being done stats on how the haplotype appears in the population. At the same time general stats for each sample and replicate are reported and therefore several columns will have redundant information because there will be several haplotypes for each replicate and sample. Also to define some terms to make sense of the column heading. Each sample will have a one or more replicates, these are normally PCR replicates. Reads from these replicates will first be clustered separately and then compared to each other to make the final data for the sample. The results from this comparison will results in what will be called clusters and each cluster will have a relative abundance in the sample totaling up to 1. After clusters are finalized in the samples they are then the final clusters from each samples are compared to each other to create population data incorporating all samples. This population analysis will result will what will be called haplotypes. So while in the sample these sequences are referred to as clusters and in the population as haplotype to distinguish when talking about each one. Each column has a prefix to indicate what kind of information it being reported on. s_ indicates data about the sample, p_ indicates data about the whole population h_ indicates data about the haplotype, c_ indicates data about the cluster while in the sample and RXX indicates information in the replicates where XX is the replicate number. * **s_Sample** - The name of the sample the haplotype appears in * **h_popUID** - The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in * **p_TotalPopulationSampCnt** - This the the total number of samples in the population analysis ***Haplotype Info*** * **h_PopFrac** - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population. * **h_SumOfAllFracs** - This is the sum of all the relative abundances the haplotype was found at * **h_AvgFracFoundAt** - This is the average relative abundance the haplotype was found at * **h_ReadFrac** - This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment * **h_SampCnt** - This is the number of samples the haplotype was found in. * **h_SampFrac** - This is the number of samples the haplotype was found in divided by the number of samples in the experiment * **h_ReadCnt** - This is the total number of reads associated with this haplotype * **h_ClusterCnt** - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype. * **h_clusterNames** - This is the names of the contributing clusters ***Sample Info*** * **s_Name** - The name of the sample the haplotype appears in * **s_ReadCntTotUsed** - This is the number of reads used out of this sample given with percentage of the total number of reads given to processClusters, the other percentage would be what was filtered off (low frequency and chimeras) * **s_InputClusterCnt** - The total number of clusters at the beginning of the analysis for this sample * **s_FinalClusterCnt** - The total number of clusters after analysis ***Cluster Info*** * **c_clusterID** - The id given to the cluster in the sample, the lower the number the higher the relative abundance in the sample * **c_AveragedFrac** - The averaged relative abundance between all replicates used in the analysis (this basically an un-weighted average of the abundances since it wouldn't depend on read counts) * **c_ReadFrac** - The number of reads associated with this cluster divided by the total number of reads found in all replicates, if there is only one replicate this number will be the same as c_AveragedFrac. * **c_ReadCnt** - The total number of reads associated with this cluster summed over all replicates * **c_RepCnt** - The number of reps this cluster was found in * **c_Consensus** - The consensus sequence for this cluster * **c_InputCnt** - The number of contributing replicate clusters to this sample cluster * **c_ChiReadCnt** - The number of reads for the replicate clusters that are marked chimeric * **c_ChiClusCnt** - The number of replicate clusters that are marked as chimeric * **c_ChiRepCnt** - The number of replicated this current sample was marked as chimeric * **c_InputNames** - The names of the replicate clusters that contributed to making this sample cluster * **RXX_name** - The number of the replicate * **RXX_totalClusterCntExcluded** - The total number of replicate clusters that were excluded from this replicate * **RXX_totalCntExcluded** - The total number of reads that were excluded from the replicate * **RXX_totalFracExcluded** - The fraction of reads excluded from this replicate * **RXX_clusterCntChiExcluded** - The number of clusters that were marked chimeric in this replicate * **RXX_cntChiExcluded** - The total read count marked as chimeric in this replicate * **RXX_fracChiExcluded** - The fraction of the replicate that was excluded due to chimeric formation * **RXX_MapFrac** - The relative abundance of this replicate cluster in this replicate * **RXX_ReadCnt** - The total number of reads this replicate cluster has * **RXX_ClusCnt** - The total number of clusters in this replicate * **RXX_totalReadCnt** - The total number of reads in this replicate * **c_bestExpected** - A string with the name of the closest reference sequence see [below](#supplying-expected-reference-sequences) to see how to compare against expected references ## Column explaination for populationCluster.tab.txt * **h_popUID** - The population id given to haplotype. These are named with PopUID.XX where XX is the population rank number and this number is determined by the number of samples the haplotype appears in * **p_TotalInputReadCnt** - The total number of reads in the whole experiment/population * **p_TotalInputClusterCnt** - The total number of clusters in the whole experiment/population * **p_TotalPopulationSampCnt** - The total number of samples in the whole experiment/population * **p_TotalHaplotypes** - The total number of final haplotypes in the whole experiment/population * **p_meanCoi** - The mean complexity of infection (COI) in the whole experiment/population * **p_medianCoi** - The median complexity of infection (COI) in the whole experiment/population * **p_minCoi** - The minimum complexity of infection (COI) in the whole experiment/population * **p_maxCoi** - The max complexity of infection (COI) in the whole experiment/population * **h_PopFrac** - This is the sum of the relative abundances the haplotype was found in divided by the total number of samples in the population. * **h_SumOfAllFracs** - This is the sum of all the relative abundances the haplotype was found at * **h_AvgFracFoundAt** - This is the average relative abundance the haplotype was found at * **h_ReadFrac** - This is the total number of reads that are associated with the haplotype across all samples divided by the total number of reads in the the whole experiment * **h_SampCnt** - This is the number of samples the haplotype was found in. * **h_SampFrac** - This is the number of samples the haplotype was found in divided by the number of samples in the experiment * **h_ReadCnt** - This is the total number of reads associated with this haplotype * **h_ClusterCnt** - This is the total number of clusters that went into making this population haplotype, this is normally the number of samples the haplotype was found in because otherwise that would mean a sample contributed more than one cluster to the haplotype. * **h_clusterNames** - This is the names of the contributing clusters * **h_Consensus** - The consensus sequence of all contributing clusters to this haplotype * **h_Protein** - An estimated protein, will only be correct if the reading frame starts at 0 * **c_bestExpected** - A string with the name of the closest reference sequence see [below](#supplying-expected-reference-sequences) to see how to compare against expected references # Examples ## Running processClusters needs only two options to run, name of the results files given by `--fastq` or `--fasta` in the directory tree (see [above](#input)) and a parameters file `--par` see [qluster usage](qluster_usage.html##format-of-paramters-files) for explanation of parameters file. Alternately to the `--par` you can use either `--noErrors` to allow for now errors to collapse between comparisons or `--strictErrors` to allow for just a few low quality mismatches. ## Input results file format Input reads can be fasta or fastq ### Fastq ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --par pars.txt ``` ### Fasta ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fasta output.fasta --par pars.txt ``` ## Pars ### Just do a simple match allowing no differences ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --noErrors ``` ### Allowing only 1 high quality mismatch ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --noErrors --hq 1 ``` ### Allowing a few low quality mismtaches and some 1 base indels (good for 454 and Ion Torrent) ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --strictErrors ``` ### Allowing a few low quality mismtaches and but no indels (illumina) ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --strictErrors --illumina ``` ### Allowing a few low quality mismtaches and but no indels (illumina) and 1 high quality mismatch ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --strictErrors --illumina --hq 1 ``` ## Different Population Parameters If you want to give different parameters to do the population clustering with just use the '-popPar` flag ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --par pars.txt --popPar popPars.txt ``` ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors ``` ## Supplying grouping info To output additional files of final results organized by groups, supply a grouping tab delimited file where the first column has sample names and each additional column is named for a group with sub-groups that each sample belongs to. See below for an example file ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} cat exampleGroups.tab.txt ``` ```{r, engine='bash',comment="",highlight=TRUE, eval=T, echo=F} cat ../extraFiles/exampleGroups.tab.txt | column -t ``` Just give this file name to the `--groups` flag ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --par pars.txt --groups exampleGroups.tab.txt ``` This will create a directory called **groups** in the processClusters output directory. This will contain a series of sub directories of results files split by group and sub-group with a file called sampFile.tab.txt (set up like [selectedClusters.tab.txt](#column-explaination-for-selectedclustersinfo.tab.txt)) and a file called popFile.tab.txt (set up like [populationCluster.tab.txt](#column-explaination-forcluster.tab.txt)) but will only contain information about that subgroup, the population UIDs will be the same so they can be linked to the main results data. This is still a new feature and will be developed further in future release when analysis between groups will automatically be done but for now the most interesting thing to look at would be the COI stats in each popFile.tab.txt between sub-groups in each group. ## Supplying expected reference sequences To supply a reference file to compare sequences to (this is helpful when you have controls with expected mixture done along with your experiment) use the `--ref` or `--refFastq` flag. The `--ref` flag is for when the sequences are in a fasta file and `--refFastq` is for when the seq file is a fastq file. ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --par pars.txt --ref references.fasta #or fastq SeekDeep processClusters --fastq output.fastq --par pars.txt --refFastq references.fastq ``` ## Supplying experiment name By default the final haplotypes are given the PopUID.XX but you can change the PopUID to be the name of your experiment ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --par pars.txt --experimentName myExperiment ``` ## Giving a relative abundance cut off By default a relative abundance cut off of 0.005 (0.5%) is done to clear out low abundant clusters are most likely due to current technologies and PCR limitation and shouldn't be done in the experiment. If you're data is especially noisy or you want to be more stringent in your cut off use the `--fracCutOff` flag, for example the filter off all clusters at 1% and lower use the below example ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --par pars.txt --fracCutOff 0.01 ``` ## Removing singletons By default all singletons clusters are thrown out before analysis is done per replicate. To change this behavior use the `--clusterCutOff` flag ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} #to keep singletons SeekDeep processClusters --fastq output.fastq --par pars.txt --clusterCutOff 0 #to remove clusters of size 2 and less SeekDeep processClusters --fastq output.fastq --par pars.txt --clusterCutOff 2 ``` ## Marking Chimeras and checking Chimeras By default all sample clusters that are made up of mostly clusters with the CHI_ flag are removed from analysis. To keep chimeras use the `--keepChimera` flag. Also marking clusters can be done with [SeekDeep qluster](qluster_usage.html#marking-possible-chimeric-sequence) but can also be done by processClusters by using the same flags as qluster to make them, `--markChimeras` and `--parFreqs` see [SeekDeep qluster](qluster_usage.html#marking-possible-chimeric-sequence) for an explanation of these flags. ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} #keep chimeras SeekDeep processClusters --fastq output.fastq --par pars.txt --keepChimeras #mark chimeras when they have parents that are greater than 3x in frequency SeekDeep processClusters --fastq output.fastq --par pars.txt --makrChimeras --parFreqs 3 ``` Also marking all sequences that appear as chimeric can be removing low level recombinants. To try to prevent some of this from happening you can use the `--investigateChimeras` flag which will take clusters marked as chimeric and see if they are ever one of the two major clusters in an other sample. The reasoning behind this is that chimeric clusters need two parents and if a cluster appears as one of the two major clusters in a sample it's not possible to have two parents that are greater than it is. ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --par pars.txt --investigateChimeras ``` ## Extra cleanup `SeekDeep processClusters` also offers the clean up of low frequencies one-off haplotypes at the end of clustering. This can be especially helpful for when there are no replicates to help correct for PCR noise and there might be several haplotypes that are at low frequency and only 1 different from a major higher in frequency haplotypes (e.g. major haplotype at 95% and a haplotype 1 base different at 1%). This can be a good alternative to simply collapsing all one-off haplotypes that would collapse haplotypes both at high frequency like 50% and 50%. ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs ``` The different in frequencies is set by default so that the higher frequency haplotype has to be found at 30x the frequency of the lower frequency haplotypes (this number was arrived at by empirical means by looking at various control mixtures). This number can be changed using `--lowFreqMultiplier` ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --fastq output.fastq --strictErrors --pop-noErrors --collapseLowFreqOneOffs --lowFreqMultiplier 40 ``` ## Variation calling and translation `SeekDeep processClusters` also offers the ability to compare the final results to a reference genome(**--genomeFnp**) and associated annotation file(**--gffFnp**) to create a VCF as well as translate final haplotypes based on annotation to also create a protein VCF as well. You can optionally supply a list of known drug associated mutations(**--knownAminoAcidChangesFnp**) to report the counts of these locations regardless if they are variant or not, an example of this type of file can be found here, <https://seekdeep.brown.edu/data/plasmodiumData/genomes/pf/info/pf_drug_resistant_aaPositions.tsv>, should have a column of geneID that can be found in **--gffFnp** and 1-based position of the amino acid of interest. ```{r, engine='bash',comment="",highlight=TRUE, eval=F, echo=T} SeekDeep processClusters --strictErrors --dout analysis --fastqgz output.fastq.gz --overWriteDir --illumina --removeOneSampOnlyOneOffHaps --excludeCommonlyLowFreqHaplotypes --excludeLowFreqOneOffs --rescueExcludedOneOffLowFreqHaplotypes --genomeFnp /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --gffFnp /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv ``` The various variant calling results will be found in the output directory (above **analysis**) in a folder called **variantCalling**