Code
SeekDeep processClusters
processClusters takes care of several of the final steps of the SeekDeep pipeline.
Just typing the name of the program will give a help message on running the program
SeekDeep processClusters
Ran from inside directory tree set up such that currentDir/SampleDir/RunDir/seqFiles
Required Options
Clustering
--par ParametersFileName; required; default=; ([38;5;202mstd::string)
Reading Sequence Input
--fasta Input sequence filename, only need 1, fasta text file; required; default=; ([38;5;202mboost::filesystem::path)
--fastagz Input sequence filename, only need 1, fasta gzipped file; required; default=; ([38;5;202mboost::filesystem::path)
--fastq Input sequence filename, only need 1, fastq text file; required; default=; ([38;5;202mboost::filesystem::path)
--fastqgz Input sequence filename, only need 1, fastq gzipped file; required; default=; ([38;5;202mboost::filesystem::path)
Optional Options
Additional Output
--extra Extra Output; default=false; ([38;5;202mbool)
--writeExcludedOriginals Write out the excluded Originals; default=false; ([38;5;202mbool)
Alignment
...
...
Also calling -help
will do the same
Also all flags in SeekDeep are case insensitive and so all the following would have the same results
The input to processClusters is done by running it in a directory with several results files for each sample in subdirectories below the top directory. An example directory tree below
../extraFiles/sampleClusteringDir
├── Sample01
│ ├── MID01
│ │ └── output.fastq
│ └── MID02
│ └── output.fastq
├── Sample02
│ ├── MID03
│ │ └── output.fastq
│ └── MID04
│ └── output.fastq
├── Sample03
│ ├── MID05
│ │ └── output.fastq
│ └── MID06
│ └── output.fastq
└── Sample04
├── MID07
│ └── output.fastq
└── MID08
└── output.fastq
Each output file needs to be named the same (here output.fastq) and the sequences in the output should be named so that they have a suffix indicating how many reads are associated with each cluster. Suffix should be _[NUM] or _t[NUM] where [NUM] is the number of reads associated with each cluster.
An output directory will be created to store the output of processClusters, the name of which can be changed by using the -dout
flag. In this directory will be several files and directories.
All the data from the analysis done by processClusters is located in this file. It is a very large file and can be overwhelming at first but all data is conveniently located in one place. How the table is organized is each row represents a haplotype from a sample. Therefore each row will report stats on the haplotype that will have to do with how it was found in the replicates, what the final stats of the haplotype after replicate comparison was, and if population analysis is being done stats on how the haplotype appears in the population. At the same time general stats for each sample and replicate are reported and therefore several columns will have redundant information because there will be several haplotypes for each replicate and sample. Also to define some terms to make sense of the column heading. Each sample will have a one or more replicates, these are normally PCR replicates. Reads from these replicates will first be clustered separately and then compared to each other to make the final data for the sample. The results from this comparison will results in what will be called clusters and each cluster will have a relative abundance in the sample totaling up to 1. After clusters are finalized in the samples they are then the final clusters from each samples are compared to each other to create population data incorporating all samples. This population analysis will result will what will be called haplotypes. So while in the sample these sequences are referred to as clusters and in the population as haplotype to distinguish when talking about each one. Each column has a prefix to indicate what kind of information it being reported on. s_ indicates data about the sample, p_ indicates data about the whole population h_ indicates data about the haplotype, c_ indicates data about the cluster while in the sample and RXX indicates information in the replicates where XX is the replicate number.
processClusters needs only two options to run, name of the results files given by --fastq
or --fasta
in the directory tree (see above) and a parameters file --par
see qluster usage for explanation of parameters file. Alternately to the --par
you can use either --noErrors
to allow for now errors to collapse between comparisons or --strictErrors
to allow for just a few low quality mismatches.
Input reads can be fasta or fastq
If you want to give different parameters to do the population clustering with just use the ’-popPar` flag
To output additional files of final results organized by groups, supply a grouping tab delimited file where the first column has sample names and each additional column is named for a group with sub-groups that each sample belongs to. See below for an example file
samples age location
Sample01 child loc1
Sample02 child loc2
Sample03 adult loc1
Sample04 adult loc2
Just give this file name to the --groups
flag
This will create a directory called groups in the processClusters output directory. This will contain a series of sub directories of results files split by group and sub-group with a file called sampFile.tab.txt (set up like selectedClusters.tab.txt) and a file called popFile.tab.txt (set up like populationCluster.tab.txt) but will only contain information about that subgroup, the population UIDs will be the same so they can be linked to the main results data. This is still a new feature and will be developed further in future release when analysis between groups will automatically be done but for now the most interesting thing to look at would be the COI stats in each popFile.tab.txt between sub-groups in each group.
To supply a reference file to compare sequences to (this is helpful when you have controls with expected mixture done along with your experiment) use the --ref
or --refFastq
flag. The --ref
flag is for when the sequences are in a fasta file and --refFastq
is for when the seq file is a fastq file.
By default the final haplotypes are given the PopUID.XX but you can change the PopUID to be the name of your experiment
By default a relative abundance cut off of 0.005 (0.5%) is done to clear out low abundant clusters are most likely due to current technologies and PCR limitation and shouldn’t be done in the experiment. If you’re data is especially noisy or you want to be more stringent in your cut off use the --fracCutOff
flag, for example the filter off all clusters at 1% and lower use the below example
By default all singletons clusters are thrown out before analysis is done per replicate. To change this behavior use the --clusterCutOff
flag
By default all sample clusters that are made up of mostly clusters with the CHI_ flag are removed from analysis. To keep chimeras use the --keepChimera
flag. Also marking clusters can be done with SeekDeep qluster but can also be done by processClusters by using the same flags as qluster to make them, --markChimeras
and --parFreqs
see SeekDeep qluster for an explanation of these flags.
Also marking all sequences that appear as chimeric can be removing low level recombinants. To try to prevent some of this from happening you can use the --investigateChimeras
flag which will take clusters marked as chimeric and see if they are ever one of the two major clusters in an other sample. The reasoning behind this is that chimeric clusters need two parents and if a cluster appears as one of the two major clusters in a sample it’s not possible to have two parents that are greater than it is.
SeekDeep processClusters
also offers the clean up of low frequencies one-off haplotypes at the end of clustering. This can be especially helpful for when there are no replicates to help correct for PCR noise and there might be several haplotypes that are at low frequency and only 1 different from a major higher in frequency haplotypes (e.g. major haplotype at 95% and a haplotype 1 base different at 1%). This can be a good alternative to simply collapsing all one-off haplotypes that would collapse haplotypes both at high frequency like 50% and 50%.
The different in frequencies is set by default so that the higher frequency haplotype has to be found at 30x the frequency of the lower frequency haplotypes (this number was arrived at by empirical means by looking at various control mixtures). This number can be changed using --lowFreqMultiplier
SeekDeep processClusters
also offers the ability to compare the final results to a reference genome(–genomeFnp) and associated annotation file(–gffFnp) to create a VCF as well as translate final haplotypes based on annotation to also create a protein VCF as well. You can optionally supply a list of known drug associated mutations(–knownAminoAcidChangesFnp) to report the counts of these locations regardless if they are variant or not, an example of this type of file can be found here, https://seekdeep.brown.edu/data/plasmodiumData/genomes/pf/info/pf_drug_resistant_aaPositions.tsv, should have a column of geneID that can be found in –gffFnp and 1-based position of the amino acid of interest.
SeekDeep processClusters --strictErrors --dout analysis --fastqgz output.fastq.gz --overWriteDir --illumina --removeOneSampOnlyOneOffHaps --excludeCommonlyLowFreqHaplotypes --excludeLowFreqOneOffs --rescueExcludedOneOffLowFreqHaplotypes --genomeFnp /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --gffFnp /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv
The various variant calling results will be found in the output directory (above analysis) in a folder called variantCalling