SeekDeep Variant Calling
Variant Calling
Variant calling and protein variant calling can be done during the processCluster step and can be done with input a results file with the minimum of 4 or 5 columns containing the following data:
- sample name
- target name
- target population unique identifier
- read count
An additional column can be supplied that has the sequence for the target or a directory with sequences for each target can be supplied. The directory should be set up so that each target from the target column has a file named [TARGET].fasta
And a genome and gff file will be needed which will be used to call the variants and the gff annotation will be used to translate and annotate the protein variants as well.
SeekDeep variantCallOnSeqAndProtein
The program within SeekDeep is called variantCallOnSeqAndProtein
. An example command is below. The data for the example below can be downloaded using the buttons below. This program requires that bowtie2
(to align to genome) and samtools
(for sorting bams) be installed. If a microhaplotype doesn’t map it’s excluded from the analysis and if a haplotype’s translation ends up being managled (either from read shift mutation or bad haplotype call, it will be excluded from the protein variant calling, these unmappable and untranslatable will be marked out). The translations are done by mapping to the reference and trim off the introns from the alignment, because of this if haplotypes don’t map well or if there is variation of intron haplotypes might not be able to be translated in this manner.
And the genome and gff used for this commands can be downloaded using the following
Code
wget http://seekdeep.brown.edu/data/plasmodiumData/pf.tar.gz
tar -zxvf pf.tar.gz
Code
= readr::read_tsv("pf_drug_resistant_aaPositions.tsv")
pf_drug_resistant_aaPositions create_dt(pf_drug_resistant_aaPositions)
Code
SeekDeep variantCallOnSeqAndProtein --resultsFnp allSelectedClustersInfo_drugAndTopDiv.tsv.gz --genomeFnp pf/genomes/Pf3D7.fasta --gffFnp pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --dout drugAndTopDiv_varCalls --overWriteDir --metaFnp meta.tsv --numThreads 14 --getPairwiseComps
Please see SeekDeep variantCallOnSeqAndProtein --help
to see all flags, will explain several of them here
Flags
Required
- --resultsFnp - A table with at least 4 columns that supply: sample, target, popUIDForTarget, readCount, and also can have a column for sequence
- Flags for setting which columns apply which of the data input above, the defaults are expected columns from a SeekDeep output
- --sampleColName - Column Name; default=s_Sample
- --targetNameColName - target Name Column Name, the column name in the table which indicates the different targets;default=p_name
- --popHapIdColName - a population level naming across samples for haplotypes for this target;default=h_popUID
- --popHapSeqColName - population Haplotype Sequence Column Name, the seq to call variants on;default=h_Consensus
- --withinSampleReadCntColName - within Sample Read Cnt Col Column Name;default=c_ReadCnt
- Flags for setting which columns apply which of the data input above, the defaults are expected columns from a SeekDeep output
- --genome - the path to a genome in FASTA format
- --gff - the path to a gff file that annotates the genome fasta file
Optional
- --knownAminoAcidChangesFnp - a tab delimited (tsv) file with know amino acid changes, should have at least 2 columns, 1 named either ID or transcriptID and the other named AAPosition (case insenitivte), additional columns can be present and optionally added into outpus as additional meta
- --metaFnp - a file that supplies meta data, should have at least two columns, a column with the name sample, the names of which match the samples in the --resultsFnp input file, this meta will be added to majority of outputs so it variants can be linked to output
- --getPairwiseComps - compare microhaplotypes between each other
- --numThreads - number of threads to add
- --dout - the name of the output directory
- --overWriteDir - over write the ouput directory if it already exists
Output
The full output can be downloaded here:
There will be a directory created for each target in the input and this will have a breakdown about the comparison of sequences within the target, diversity, and the variants. There will also be a directory called reports
that combines the data from all targets.
An example of the layout of the results of a target analysis
Code
tree -n drugAndTopDiv_varCalls/Pf07-0403499-0403683
drugAndTopDiv_varCalls/Pf07-0403499-0403683
├── inputSeqs.fasta.gz
└── variantCalling
├── divMeasures.tab.txt
├── uniqueSeqs.fasta.gz
├── uniqueSeqs_labIsolateNames.tab.txt.gz
├── uniqueSeqs_meta.tab.txt.gz
├── uniqueSeqs_names.tab.txt.gz
└── variantCalls
├── PF3D7_0709000.1-protein.vcf
├── PF3D7_0709000.1-protein_aminoAcidKnownMutations.tab.txt.gz
├── PF3D7_0709000.1-protein_aminoAcidVariable.tab.txt.gz
├── PF3D7_0709000.1-protein_aminoAcidsAll.tab.txt.gz
├── PF3D7_0709000.1-protein_variableRegion.bed
├── PF3D7_0709000.1-translatedSeqsAATyped.tab.txt.gz
├── PF3D7_0709000.1-uniqueTranslatedSeqs.fasta.gz
├── PF3D7_0709000.1-uniqueTranslatedSeqs_labIsolateNames.tab.txt.gz
├── PF3D7_0709000.1-uniqueTranslatedSeqs_meta.tab.txt.gz
├── PF3D7_0709000.1-uniqueTranslatedSeqs_names.tab.txt.gz
├── Pf3D7_07_v3-SNPs.tab.txt.gz
├── Pf3D7_07_v3-allBases.tab.txt.gz
├── Pf3D7_07_v3-chromosome_variableRegion.bed
├── Pf3D7_07_v3-genomic.vcf
├── Pf3D7_07_v3-knownAA_SNPs.tab.txt.gz
├── geneInfos
│ ├── gene_PF3D7_0709000.1.bed
│ ├── gene_PF3D7_0709000.1_basePositions.tab.txt
│ ├── gene_PF3D7_0709000.1_cDNA.fasta
│ ├── gene_PF3D7_0709000.1_exonIntronPositions.bed
│ ├── gene_PF3D7_0709000.1_exonIntronPositions.tab.txt
│ ├── gene_PF3D7_0709000.1_gDNA.fasta
│ └── gene_PF3D7_0709000.1_protein.fasta
├── inputSeqs.bed
├── seqSNPTyped.tab.txt.gz
├── seqsAATyped.tab.txt.gz
├── seqsTranslationFiltered.txt
├── seqsUnableToBeMapped.txt
├── translatedDivMeasures.tab.txt
├── translatedInput.bed
├── translatedInput.fasta.gz
├── variantsPerSeqAln.tab.txt.gz
└── variantsPerTranslatedSeq.tab.txt.gz
4 directories, 38 files
The reports directory
Code
tree -n drugAndTopDiv_varCalls/reports
drugAndTopDiv_varCalls/reports
├── allDiversityMeasures.tsv.gz
├── allGenomicVariantCalls.vcf
├── allProteinVariantCalls.vcf
├── allTranslatedDivMeasures.tsv.gz
├── genomicLocsForKnownAAChanges.bed
├── hapCountsPerSamplePerTarget.tsv.gz
├── knownAAChangesGenomicVariantCalls.vcf
├── knownAAChangesProteinVariantCalls.vcf
├── meta.tsv
├── readCountsPerSamplePerTarget.tsv.gz
├── targetsIntersectingWithGenesInfo.tsv
├── transcriptLocsForKnownAAChanges.bed
└── unmmappedUntranslatableHapCounts.tsv.gz
1 directory, 13 files
VCF outputs:
- allGenomicVariantCalls.vcf - All genomic variants called
- allProteinVariantCalls.vcf - All protein variants called, chromosomes are protein transcripts
- knownAAChangesGenomicVariantCalls.vcf - If positions supplied, will subset to just those positions of the genomic locations that cover those
- knownAAChangesProteinVariantCalls.vcf - If positions supplied, will subset to just those positions
Locations:
- transcriptLocsForKnownAAChanges.bed - The transcripts bed file of known mutations if supplied, will also contain meta on what genomic locations this covers
- genomicLocsForKnownAAChanges.bed - The genomic locations that overlap the known mutations if supplied
- targetsIntersectingWithGenesInfo.tsv - A list of the genes intersected with what targets (to see what specific amino acids are covered go to drugAndTopDiv_varCalls/[TARGET]/variantCalling/variantCalls/translatedInput.bed)
Counts:
- hapCountsPerSamplePerTarget.tsv.gz - A target by sample table of the number of haplotypes per target
- readCountsPerSamplePerTarget.tsv.gz - A target by sample table of the read depth per target
- unmmappedUntranslatableHapCounts.tsv.gz - A count of the number of haplotypes per target that couldn’t be mapped and of ones that could be mapped but not translated
Diversity measures:
- allDiversityMeasures.tsv.gz - Several diversity measures calculated for each target
- allTranslatedDivMeasures.tsv.gz - The same diversity measures as above but for the translated sequences
Column names explantation
- id - the name of the target
- totalHaplotypes - the total number of targets (this is different than number of samples since samples could have multiple unique haplotypes)
- uniqueHaplotypes - the total number of unique microhaplotypes
- singlets - the number of microhaplotypes that are found only once
- doublets - the number of microhaplotypes that are found at least twice
- expShannonEntropy - exp(ShannonEntropy)
- ShannonEntropyE - [ShannonEntropyE](https://www.sciencedirect.com/topics/engineering/shannon-entropy#:~:text=The%20Shannon%20entropy%20S%20(%20x,new%20value%20in%20the%20process.)
- effectiveNumOfAlleles - effective Num Of Alleles, would have to gather at least this number of haplotypes from the same population to get similar diversity measures as seen here
- SimpsonIndex - Simpson Index of diversity
- he - expected heterozygosity (the chance that if you drew two microhaplotypes at random from the population that they are different)
- ExpP3 - the chance that if you drew three microhaplotypes at random from the population that they are three different microhaplotypes
- ExpP4 - the chance that if you drew four microhaplotypes at random from the population that they are four different microhaplotypes
- ExpP5 - the chance that if you drew five microhaplotypes at random from the population that they are five different microhaplotypes
- lengthPolymorphism - whether there is significant length polymorphism in the microhaplotypes
if --getPairwiseComps flag is on
- avgPercentID - the average percent identity between pairwise comparisons between microhaplotypes
- avgNumOfDiffs - the average number of differences between pairwise comparisons between microhaplotypes
- NucleotideDiversity - NucleotideDiversity
- simpleAvalance -
- completeAvalance -
- nSegratingSites - the number of sites that differ between seqs
- TajimaD - the Tajima’s D for these microhaplotypes
- TajimaDPVal - the p-value for Tajima’s D calc
Output files tables from this analysis
allDiversityMeasures.tsv.gz
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/allDiversityMeasures.tsv.gz"))
allTranslatedDivMeasures.tsv.gz
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/allTranslatedDivMeasures.tsv.gz"))
genomicLocsForKnownAAChanges.bed
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/transcriptLocsForKnownAAChanges.bed"))
transcriptLocsForKnownAAChanges.bed
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/transcriptLocsForKnownAAChanges.bed"))
targetsIntersectingWithGenesInfo.tsv
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/targetsIntersectingWithGenesInfo.tsv"))
unmmappedUntranslatableHapCounts.tsv.gz
Code
create_dt(readr::read_tsv("drugAndTopDiv_varCalls/reports/unmmappedUntranslatableHapCounts.tsv.gz"))