---
title: "Running Haplotype reconstruction on initial windows"
---
```{r setup, echo=FALSE, message=FALSE}
source("../common.R")
```
# HRP2 HPR3 chr11 regions
## Run PathWeaver
Below will run PathWeaver[ @Hathaway_undated-st ] on the initial windows of interest to gather the local haplotypes and will also create sequence coverage for each location. PathWeaver requires a bam file, the reference file that the bam file was aligned against and then a bed file with the regions of interest.
```{bash, eval = F}
#| filename: runRegionBamTrimCmds.txt
PREP:mkdir -p {DIRNAME}
CMD:if [ ! -f {DIRNAME}/{SAMPLE}_{DIRNAME}/final/coiCounts.tab.txt ]; then PathWeaver BamExtractPathwaysFromRegion --bamExtractTrimToRegion --bed {BEDFNP} --genomeDir /tank/data/genomes/plasmodium/genomes/pf/genomes/ --primaryGenome Pf3D7 --bam bams/{SAMPLE}.sorted.bam --dout {DIRNAME}/{SAMPLE}_{DIRNAME} --overWriteDir --numThreads {NCPUS}; fi;
{NCPUS}:4
```
```{bash, eval = F}
nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTuned;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned.bed\"" --numThreads 12 &
```
### Post processing
PathWeaver first works on each sample individually. Post processing will then compare called haplotypes for each region across samples to analyze haplotype on a population level, do some additional post processing correction and filtering. This step will also create population level IDs and summary of the final called haplotypes as well compile the coverage info for each region within each sample into 1 file.
```{bash, eval = F}
cd finalHRPII_HRPIII_windows_withTuned/
nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTuned --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt &
cd popClustering
mkdir splitBeds
```
## Post Processing final Haplotypes
Running analysis on the called haplotypes to determine variation within each window to do population genetics analysis like expected heterozygostiy, tajima's d as well as giving the ability to type the windows based on the actual variation within each window.
The main program will be `elucidator createSharedSubSegmentsFromRefSeqs`
```{r}
finalHRPII_HRPIII_windows_withTuned = readr:: read_tsv ("windows/finalHRPII_HRPIII_windows_withTuned.bed" , col_names = F) %>%
mutate (genomicID = paste0 (X1, "-" , X2, "-" , X3))
subsegmentAnalysisCmdsFnps = tibble ()
if (! dir.exists ("windows/splitBeds" )){
dir.create ("windows/splitBeds" )
}
if (! dir.exists ("cmds" )){
dir.create ("cmds" )
}
for (row in 1 : nrow (finalHRPII_HRPIII_windows_withTuned)){
regionName = finalHRPII_HRPIII_windows_withTuned$ X4[row]
genomicID = finalHRPII_HRPIII_windows_withTuned$ genomicID[row]
cmd = paste0 ("cd " , regionName, "/population/ && elucidator createSharedSubSegmentsFromRefSeqs --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --refBedFnp ../../splitBeds/" , genomicID, ".bed --refBedID " , regionName, " --fastagz popSeqsWithMetaWtihSampleName.fasta.gz --dout " , genomicID, "_subSegments_corrected40 --correctionOccurenceCutOff 40 --lowFreqCutOff 40 --overWriteDir --lenFilter --lenFiltMultiplier 0.85 --kSimFilter --uniqueHapCountCutOff 2" )
SharedLocsFnp = paste0 (regionName, "/population/" , genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_sharedLocs_genomic.bed" )
VarLocsFnp= paste0 (regionName, "/population/" , genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_variable_expanded_genomic.bed" )
DivMeasuresPerVarRegionFnp= paste0 (regionName, "/population/" , genomicID, "_subSegments_corrected40/subRegionInfo/divMeasuresPerVarRegion.tab.txt" )
subsegmentAnalysisCmdsFnps = bind_rows (subsegmentAnalysisCmdsFnps,
tibble (
regionName = regionName,
genomicID = genomicID,
cmd = cmd,
SharedLocsFnp = SharedLocsFnp,
VarLocsFnp = VarLocsFnp,
DivMeasuresPerVarRegionFnp = DivMeasuresPerVarRegionFnp
))
write_tsv (finalHRPII_HRPIII_windows_withTuned[row,], file = paste0 ("windows/splitBeds/" , genomicID, ".bed" ), col_names = F)
}
cat (subsegmentAnalysisCmdsFnps$ cmd, sep = " \n " , file = "cmds/subsegmentAnalysisCmds.txt" )
cat (subsegmentAnalysisCmdsFnps$ SharedLocsFnp, sep = " \n " , file = "cmds/allSharedLocsFiles.txt" )
cat (subsegmentAnalysisCmdsFnps$ VarLocsFnp, sep = " \n " , file = "cmds/allVarLocsFiles.txt" )
cat (subsegmentAnalysisCmdsFnps$ DivMeasuresPerVarRegionFnp, sep = " \n " , file = "cmds/allDivMeasuresPerVarRegionFiles.txt" )
```
### Compiling the final calls
Gather together all the final result files into one and then analyze the variation windows. Below will get all the variation sub windows within each larger region. For regions that don't have variation we will grab the conserved region in order to ensure a region kept for each of the larger windows.
```{bash, eval = F}
nohup elucidator runMultipleCommands --cmdFile subsegmentAnalysisCmds.txt --numThreads 20 --raw &
elucidator rBind --files allSharedLocsFiles.txt --delim tab --out allConservedRegions.bed --overWrite --skipNonExistFiles
elucidator rBind --files allVarLocsFiles.txt --delim tab --out allVarRegions.bed --overWrite --skipNonExistFiles
elucidator rBind --files allDivMeasuresPerVarRegionFiles.txt --delim tab --out allDivMeasuresPerVarRegion.tab.txt --overWrite --header --skipNonExistFiles
elucidator getNonOverlappingBedRegions --bed /tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_withGeneInfo.bed --intersectWithBed allVarRegions.bed --overWrite --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed
elucidator getOverlappingBedRegions --bed allConservedRegions.bed --intersectWithBed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed --overWrite
cat allVarRegions.bed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed | elucidator bedCoordSort --bed STDIN | cut -f1-6 > raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed
elucidator bedFilterRegionsCompletelyInOther --bed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed --intersectWithBed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed | elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --bed STDIN --overWrite --out raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed
```
Create genomes without chromosome 11 and chromosome 13 to get proper variant calls for those locations for the duplicated region
```{bash, eval = F}
cd /tank/data/genomes/plasmodium/genomes
mkdir -p pf_chromsRemoved/genomes
mkdir -p pf_chromsRemoved/info/gff
cd pf_chromsRemoved/genomes/
elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_11_v3 > ../info/Pf3D7_11_v3_name.txt
elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_13_v3 > ../info/Pf3D7_13_v3_name.txt
elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_11_v3_name.txt --out Pf3D7_noChr11.fasta --overWrite
elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_13_v3_name.txt --out Pf3D7_noChr13.fasta --overWrite
elucidator bioIndexGenomes --genomeDir ./ --numThreads 2
```
```{bash, eval = F}
cd /tank/data/plasmodium/falciparum/pfdata/finalHRPII_HRPIII_windows_withTuned/popClustering
## non-duplicated area
rm -f runSimpleCallVariantCmds.txt && mkdir -p reports/varCalls && mkdir -p reports/alnCaches && for x in `ls -d Pf3D7_* | egrep -v for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.2bit --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x} --alnInfoDir reports/alnCaches/alnCache_${x} --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;
## duplicated area against 11
for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr13.2bit --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against11 --alnInfoDir reports/alnCaches/alnCache_${x} --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;
## duplicated area against 13
for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr11.2bit --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against13 --alnInfoDir reports/alnCaches/alnCache_${x} --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;
nohup elucidator runMultipleCommands --cmdFile runSimpleCallVariantCmds.txt --numThreads 22 --raw &
```
```{bash, eval = F}
cd reports/varCalls
elucidator rBind --contains divMeasures.tab.txt --doesNotContains perMetaFields --delim tab --header --recursive --overWrite --out allDivMeasures.tab.txt
elucidator rBind --contains region_divMeasures.tab.txt --delim tab --header --recursive --overWrite --out region_allDivMeasures.tab.txt
elucidator rBind --contains country_divMeasures.tab.txt --delim tab --header --recursive --overWrite --out country_allDivMeasures.tab.txt
elucidator rBind --contains secondaryRegion_divMeasures.tab.txt --delim tab --header --recursive --overWrite --out secondaryRegion_allDivMeasures.tab.txt
elucidator tableExtractColumns --file allDivMeasures.tab.txt --columns id,avgPercentID,avgNumOfDiffs,nSegratingSites,TajimaD,TajimaDPVal --delim tab --header --overWrite --out seqPairwiseMeasures.tab.txt
cat vars_Pf*/variantCalls/*-chromosome_variableRegion.bed > allVariableRegions.bed
rm -f combinedVariantCalls.vcf && for x in vars_Pf3D7_*/variantCalls/*-genomic.vcf; do cat ${x} | egrep -v "^#" >> combinedVariantCalls.vcf ; done;
bedtools sort -i combinedVariantCalls.vcf > sorted_combinedVariantCalls.vcf
```
```{r}
variants = readr:: read_tsv (paste0 ("sorted_combinedVariantCalls.vcf" ),
col_names = F)
variants_mod = variants%>%
mutate (differentCall = row_number ()) %>%
#filter(grepl(",", X5) | nchar(X4) > 1) %>%
mutate (X8 = strsplit (X8, split = ";" )) %>%
unnest (X8) %>%
separate (X8, into = c ("variable" , "value" ), sep = "=" ) %>%
spread (variable, value)
variants_mod = variants_mod %>%
mutate (X5 = strsplit (X5, split = "," ),
AC = strsplit (AC, split = "," ),
AF = strsplit (AF, split = "," ))
variants_mod_unnest = variants_mod %>%
unnest (X5, AC, AF) %>%
mutate (ID = paste0 (X1, "-" , X2, "-" , X4, "-" , X5)) %>%
group_by (ID) %>%
mutate (IDnumber = row_number ()) %>%
filter (DP == max (DP)) %>%
group_by (ID) %>%
mutate (IDnumber = row_number ()) %>%
filter (1 == IDnumber) %>%
mutate (AC = as.numeric (AC),
AF = as.numeric (AF),
DP = as.numeric (DP),
NS = as.numeric (NS))
variants_mod_unnest_multi = variants_mod_unnest %>%
group_by (ID) %>%
filter (max (IDnumber) > 1 )
variants_mod_unnest = variants_mod_unnest %>%
mutate (type = case_when (nchar (X4) > 1 ~ "deletion" ,
nchar (X5) > 1 ~ "insertion" ,
(nchar (X5 == 1 ) & nchar (X4) == 1 ) ~ "SNP" ,
T ~ "other" ))
variants_mod_unnest_mod = variants_mod_unnest %>%
mutate (X2 = ifelse ("SNP" == type, X2, X2 + 1 )) %>%
mutate (X2 = X2 - 1 ) %>%
mutate (end = ifelse ("deletion" == type, X2 + nchar (X4) - 1 , X2 + 1 ) ) %>%
dplyr:: rename (start = X2,
` #chrom ` = X1) %>%
mutate (len = end - start,
strand = "+" )
variants_mod_unnest_out = variants_mod_unnest_mod %>%
group_by () %>%
select (` #chrom ` , start, end, ID, len, strand, type, AC, AF, DP, NS)
variants_mod_unnest_out_filt = variants_mod_unnest_out %>%
filter (1 - AF >= 0.0099 ) %>%
filter (DP > 100 )
write_tsv (variants_mod_unnest_out, paste0 ("sorted_combinedVariantCalls.bed" ))
write_tsv (variants_mod_unnest_out_filt %>%
filter ("SNP" == type), paste0 ("snps_sorted_combinedVariantCalls.bed" ))
variants_mod_unnest_out_filt_SNPs = variants_mod_unnest_out_filt %>%
filter ("SNP" == type) %>%
ungroup () %>%
select (` #chrom ` , start, end) %>%
mutate (name = paste0 (` #chrom ` ,"-" , start, "-" , end)) %>%
mutate (len = 1 , strand = "+" )
write_tsv (variants_mod_unnest_out_filt_SNPs, "snps_locs.bed" )
```
```{bash}
elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --overWrite --bed snps_locs.bed --out snps_locs_withGeneInfo.bed
```
```{r}
#| results: asis
#| echo: false
cat (createDownloadLink ("snps_locs_withGeneInfo.bed" ))
```
```{r}
snps_locs_withGeneInfo = readr:: read_tsv ("snps_locs_withGeneInfo.bed" , col_names = F)
write_tsv (snps_locs_withGeneInfo, col_names = F, file = "windows/Supplemental Table - HRPII HRPIII chr11 SNPs in non-paralogous windows.bed" )
```
## Final variable windows
Process the compiled windows to rename and create a new file to call haplotype with again now on these subwindows
```{r}
subWindows = readr:: read_tsv ("windows/raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed" , col_names = F)
subWindows = subWindows %>%
mutate (superWindow = gsub ("__.*" , "" , X4)) %>%
mutate (suffix = gsub (".*__" , "" , X4)) %>%
left_join (subsegmentAnalysisCmdsFnps %>%
select (genomicID, regionName) %>%
mutate (genomicID = paste0 (genomicID, "-for" )) %>%
rename (superWindow = genomicID)) %>%
mutate (renameRegion = paste0 (regionName, "__" , suffix))
subWindows_out = subWindows %>%
mutate (X4 = renameRegion) %>%
select (1 : 7 ) %>%
mutate (X4 = gsub (" \\ ." , "-" , X4))
write_tsv (subWindows_out, "windows/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed" , col_names = F)
write_tsv (subWindows_out, "windows/Supplemental Table - HRPII HRPIII chr11 variation windows in non-paralogous windows.bed" , col_names = F)
write_tsv (finalHRPII_HRPIII_windows_withTuned , "windows/Supplemental Table - HRPII HRPIII chr11 non-paralogous windows.bed" , col_names = F)
```
```{r}
#| results: asis
#| echo: false
cat (createDownloadLink ("windows/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed" ))
```
## Run PathWeaver on variation sub windows
Like above, call PathWeaver again to analyze the sub-windows including population analysis.
```{bash, eval = F}
nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTunedSubWindows;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed\"" --numThreads 12 &
cd finalHRPII_HRPIII_windows_withTunedSubWindows
nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTunedSubWindows --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt &
cd popClustering/reports
nohup elucidator tableExtractColumns --columns s_Sample,p_name,h_popUID,c_AveragedFrac --file allSelectedClustersInfo.tab.txt.gz --delim tab --header --overWrite --out slim_allSelectedClustersInfo.tab.txt.gz &
nohup elucidator doPairwiseComparisonOnHapsSharingDev --tableFnp slim_allSelectedClustersInfo.tab.txt.gz --sampleCol s_Sample --targetNameCol p_name --popIDCol h_popUID --relAbundCol c_AveragedFrac --numThreads 20 --dout pairwiseComps --verbose --overWriteDir --metaFnp //tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt --metaFieldsToCalcPopDiffs country,region,secondaryRegion &
```
# MDR1 region
See [ this page ](../DeletionPatternAnalysis/deletion_patterns/process_chr05_hrp3_deletion_pfmdr1_dup.qmd) for analysis of this region