Code
= readr::read_tsv("benchmarking/samplesToMixFnp.tsv")
sampleToMixture create_dt(sampleToMixture)
When running targeted amplicon analysis it’s helpful to use known control mixtures with known expected sequences and frequencies in order to test different programs, settings, lab experiments designs, etc. For this purpose several utilities were created and added to SeekDeep
to help evaluate performance on several common metrics which will be explained below.
The three utilities, which only different slightly in input are:
SeekDeep benchmarkTarAmpControlMixtures
- Benchmark a single amplicon targetSeekDeep benchmarkMultiTarAmpControlMixtures
- Benchmark several amplicon targets at onceSeekDeep benchmarkControlMixturesOnProcessedClustersDir
- Benchmark on the output of the SeekDeep processClusters
results dir, which can be either single or multiple targetRequirement to put in a file that specifices which samples are control mixtures and what mixtures they contain, and a separator file that contains what strains and at what relative abundance they are in the mixture.
--skipMissingSamples
then you can add this flag to fill in the table with just 0’s for all metrics--sampleToMixture
--newColumnElement
will provide the actual values, e.g. --newColumnName Program,Technology --newColumnElement SeekDeep,Illumina
etcSeekDeep benchmarkTarAmpControlMixtures
This is for benchmarking a singular amplicon target analysis, see below
SeekDeep
outputs:
--popSeqsFnp
is provided--popSeqsFnp
, fasta record names should match --popHapIdColNamestrain
column in the --mixtureSetUp table, if a strain is completely missing a target a stand fasta record can be put in and it’s sequence as all Ns, this will indicate that mixtures with this strain won’t ever be able to detect strain given there’s no sequence to detectExamples
SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir
SeekDeep benchmarkTarAmpControlMixtures --resultsFnp analysis/selectedClustersInfo.tab.txt.gz --expectedSeqsFnp ../refSeqsTrimmed/t9_split.fasta --name t9 --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program --metaFnp ../misc/paragon_sampleMeta.tab.txt
SeekDeep benchmarkMultiTarAmpControlMixtures
Very similar to SeekDeep benchmarkTarAmpControlMixtures
, the only difference are that there is an additional required column to indicate target and the input expected/observed inputs are directories within which should be a fasta file with TARGET_NAME.fasta
SeekDeep
outputs:
--popSeqsDirFnp
is provided--popSeqsDirFnp
, fasta files should be named TARGET_NAME.fasta and fasta record names should match --popHapIdColNamestrain
column in the --mixtureSetUp table, if a strain is completely missing a target a stand fasta record can be put in and it’s sequence as all Ns, this will indicate that mixtures with this strain won’t ever be able to detect strain given there’s no sequence to detectCan also set up different mixture expectation per target for when the mixture setup is more complex than above with simple lab strains, this is done by adding the --targetNameColName to the --sampleToMixture table and --mixtureSetUp table, like below:
Examples
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples
SeekDeep benchmarkMultiTarAmpControlMixtures --resultsFnp allSelectedClustersInfo.tsv.gz --expectedSeqsDirFnp refSeqsTrimmed --sampleToMixture ../misc/samplesToMixFnp.tab.txt --mixtureSetUp ../misc/mixSetUpFnp.tab.txt --dout benchmarkLabControlsAnalysis --overWriteDir --skipMissingSamples --fillInMissingSamples --newColumnElement SeekDeep --newColumnName Program
All three programs will have the same output.
Output files:
columns explained below
--skipMissingSamples
this will have the samples that were missing--sampleToMixture
input, copied over to help with downstream analysis--mixtureSetUp
input for the mixtures found within --sampleToMixture
Summation of performance each sample per target.
recoveredExpectedHaps + falseHaps
)This breaks down the performance per haplotype within in sample.
A break down of all the haplotypes that didn’t match expected sequences compare to the expected sequences (will have a comparison for each expected so each haplotype has multiple rows)
mismatches + oneBaseIndels + twoBaseIndels + largeIndels
)Similar to falseHaplotypesComparedToExpected.tsv but instead of comparing to the expected haplotypes, it compares to the rest of the haplotypes within the same sample. this can be helpful to see if there are a bunch of 1-off from the major haplotype within the sample
mismatches + oneBaseIndels + twoBaseIndels + largeIndels
)