Analysis Surrounding HRP2 and HRP3 deletions
  • Home
  • Final Manuscript
  • Window Analysis
    • Windows
    • Running Haplotype Reconstruction on Windows
    • Genomic Locations Of Final Windows

    • Window analysis by coverage
    • Processing Coverage Initial Windows
    • Processing Coverage on Sub Windows

    • Window analysis of deletion patterns
    • Telomere healing
    • Processing Samples with HRP2 Deletions TARE1
    • Processing Samples with Chr11 Deletions TARE1
    • Processing Samples for chr13 TARE1 presence
    • pfmdr1 duplication
    • Processing Samples with pfhrp3 13-5++ deletion pattern

    • Final Coverage Windows
    • Processing Coverage on Sub Windows - final

    • Window analysis by sequence/variation
    • Plotting haplotype variation within regions

    • Analysis by SNP variant analysis
    • Calling variants and Estimating COI
    • Plotting BiAllelic Variant Plots
  • HB3/SD01 Longreads analysis
    • Set up
    • Creating Hybrid genomes

    • Spanning Raw Reads analysis
    • Processing Spanning Reads
    • SD01 spanning specific

    • HB3
    • Processing chr11 and chr13
    • Final Process Assembly

    • SD01
    • Running SD01 assemblies
    • Processing SD01 assemblies

    • Both
    • Illumina against HB3/SD01 Assemblies
    • Comparison To 3D7 Simplified View
  • rRNA Segmental Duplications

    • Chr11/13 Duplicated Region
    • Characterizing Duplicated Region
  • Related Genomic Regions Vis
    • Analysis
    • Finding shared regions genome wide
    • Mapping out surrounding Genes on Assembled Strains

    • Misc
    • Plotting HRPs Tandem Repeats
    • Info on all rRNA
  • Comparing to related Plasmodiums
    • Comparing to all 6 Plasmodium Laverania
    • Comparing to all 6 Plasmodium Laverania Gene Arrangements chr05,07,08,11,13
    • Comparing to HRP2/3 falciparum sequences
  • References
    • Getting Raw Data References
    • References
    • R Session and Commandline tools info

Contents

  • HRP2 HPR3 chr11 regions
    • Run PathWeaver
      • Post processing
    • Post Processing final Haplotypes
      • Compiling the final calls
    • Final variable windows
    • Run PathWeaver on variation sub windows
  • MDR1 region

Running Haplotype reconstruction on initial windows

  • Show All Code
  • Hide All Code

  • View Source

HRP2 HPR3 chr11 regions

Run PathWeaver

Below will run PathWeaver(Hathaway, n.d.) on the initial windows of interest to gather the local haplotypes and will also create sequence coverage for each location. PathWeaver requires a bam file, the reference file that the bam file was aligned against and then a bed file with the regions of interest.

Code
runRegionBamTrimCmds.txt
PREP:mkdir -p {DIRNAME}
CMD:if [ ! -f {DIRNAME}/{SAMPLE}_{DIRNAME}/final/coiCounts.tab.txt ]; then PathWeaver BamExtractPathwaysFromRegion --bamExtractTrimToRegion --bed {BEDFNP} --genomeDir /tank/data/genomes/plasmodium/genomes/pf/genomes/ --primaryGenome Pf3D7 --bam bams/{SAMPLE}.sorted.bam --dout {DIRNAME}/{SAMPLE}_{DIRNAME} --overWriteDir --numThreads {NCPUS}; fi;
{NCPUS}:4
Code
nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields  "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTuned;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned.bed\""  --numThreads 12 &

Post processing

PathWeaver first works on each sample individually. Post processing will then compare called haplotypes for each region across samples to analyze haplotype on a population level, do some additional post processing correction and filtering. This step will also create population level IDs and summary of the final called haplotypes as well compile the coverage info for each region within each sample into 1 file.

Code
cd finalHRPII_HRPIII_windows_withTuned/
nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTuned --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt &

cd popClustering 

mkdir splitBeds

Post Processing final Haplotypes

Running analysis on the called haplotypes to determine variation within each window to do population genetics analysis like expected heterozygostiy, tajima’s d as well as giving the ability to type the windows based on the actual variation within each window.

The main program will be elucidator createSharedSubSegmentsFromRefSeqs

Code
finalHRPII_HRPIII_windows_withTuned = readr::read_tsv("windows/finalHRPII_HRPIII_windows_withTuned.bed", col_names = F) %>% 
  mutate(genomicID = paste0(X1, "-", X2, "-", X3))
subsegmentAnalysisCmdsFnps = tibble()
if(!dir.exists("windows/splitBeds")){
  dir.create("windows/splitBeds")
}
if(!dir.exists("cmds")){
  dir.create("cmds")
}
for(row in 1:nrow(finalHRPII_HRPIII_windows_withTuned)){
  regionName = finalHRPII_HRPIII_windows_withTuned$X4[row]
  genomicID = finalHRPII_HRPIII_windows_withTuned$genomicID[row]
  cmd = paste0("cd ", regionName, "/population/ && elucidator createSharedSubSegmentsFromRefSeqs --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --refBedFnp ../../splitBeds/", genomicID, ".bed --refBedID ", regionName, " --fastagz popSeqsWithMetaWtihSampleName.fasta.gz --dout ", genomicID, "_subSegments_corrected40 --correctionOccurenceCutOff 40 --lowFreqCutOff 40 --overWriteDir --lenFilter --lenFiltMultiplier 0.85 --kSimFilter --uniqueHapCountCutOff 2")
  SharedLocsFnp = paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_sharedLocs_genomic.bed")
  VarLocsFnp= paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_variable_expanded_genomic.bed")
  DivMeasuresPerVarRegionFnp= paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/divMeasuresPerVarRegion.tab.txt")
  subsegmentAnalysisCmdsFnps = bind_rows(subsegmentAnalysisCmdsFnps,
                                         tibble(
                                           regionName = regionName, 
                                           genomicID = genomicID, 
                                           cmd = cmd, 
                                           SharedLocsFnp = SharedLocsFnp, 
                                           VarLocsFnp = VarLocsFnp, 
                                           DivMeasuresPerVarRegionFnp = DivMeasuresPerVarRegionFnp
                                         ))
  write_tsv(finalHRPII_HRPIII_windows_withTuned[row,], file = paste0("windows/splitBeds/", genomicID, ".bed"), col_names = F)
}
cat(subsegmentAnalysisCmdsFnps$cmd, sep = "\n", file = "cmds/subsegmentAnalysisCmds.txt")
cat(subsegmentAnalysisCmdsFnps$SharedLocsFnp, sep = "\n", file = "cmds/allSharedLocsFiles.txt")
cat(subsegmentAnalysisCmdsFnps$VarLocsFnp, sep = "\n", file = "cmds/allVarLocsFiles.txt")
cat(subsegmentAnalysisCmdsFnps$DivMeasuresPerVarRegionFnp, sep = "\n", file = "cmds/allDivMeasuresPerVarRegionFiles.txt")

Compiling the final calls

Gather together all the final result files into one and then analyze the variation windows. Below will get all the variation sub windows within each larger region. For regions that don’t have variation we will grab the conserved region in order to ensure a region kept for each of the larger windows.

Code
nohup elucidator runMultipleCommands --cmdFile subsegmentAnalysisCmds.txt --numThreads 20 --raw & 
elucidator rBind --files allSharedLocsFiles.txt --delim tab --out allConservedRegions.bed --overWrite  --skipNonExistFiles
elucidator rBind --files allVarLocsFiles.txt --delim tab --out allVarRegions.bed --overWrite  --skipNonExistFiles
elucidator rBind --files allDivMeasuresPerVarRegionFiles.txt --delim tab --out allDivMeasuresPerVarRegion.tab.txt --overWrite --header --skipNonExistFiles


elucidator getNonOverlappingBedRegions --bed /tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_withGeneInfo.bed --intersectWithBed allVarRegions.bed --overWrite --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed

elucidator getOverlappingBedRegions --bed allConservedRegions.bed --intersectWithBed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed --overWrite

cat allVarRegions.bed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed | elucidator bedCoordSort --bed STDIN | cut -f1-6 > raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed

elucidator bedFilterRegionsCompletelyInOther --bed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed --intersectWithBed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed | elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --bed STDIN --overWrite --out raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed

Create genomes without chromosome 11 and chromosome 13 to get proper variant calls for those locations for the duplicated region

Code
cd /tank/data/genomes/plasmodium/genomes

mkdir -p pf_chromsRemoved/genomes
mkdir -p pf_chromsRemoved/info/gff
cd pf_chromsRemoved/genomes/


elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_11_v3 > ../info/Pf3D7_11_v3_name.txt 
elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_13_v3 > ../info/Pf3D7_13_v3_name.txt 

elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_11_v3_name.txt  --out Pf3D7_noChr11.fasta --overWrite
elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_13_v3_name.txt  --out Pf3D7_noChr13.fasta --overWrite

elucidator bioIndexGenomes --genomeDir ./ --numThreads 2
Code
cd /tank/data/plasmodium/falciparum/pfdata/finalHRPII_HRPIII_windows_withTuned/popClustering

## non-duplicated area  
rm -f runSimpleCallVariantCmds.txt && mkdir -p reports/varCalls && mkdir -p reports/alnCaches && for x in `ls -d Pf3D7_* | egrep -v for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x} --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

## duplicated area against 11 
for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr13.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against11 --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

## duplicated area against 13 
for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr11.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against13 --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

nohup elucidator runMultipleCommands --cmdFile runSimpleCallVariantCmds.txt --numThreads 22 --raw  &
Code
cd reports/varCalls 

elucidator rBind --contains divMeasures.tab.txt --doesNotContains perMetaFields --delim tab --header --recursive --overWrite --out allDivMeasures.tab.txt
elucidator rBind --contains region_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out region_allDivMeasures.tab.txt
elucidator rBind --contains country_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out country_allDivMeasures.tab.txt
elucidator rBind --contains secondaryRegion_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out secondaryRegion_allDivMeasures.tab.txt

elucidator tableExtractColumns --file allDivMeasures.tab.txt --columns id,avgPercentID,avgNumOfDiffs,nSegratingSites,TajimaD,TajimaDPVal  --delim tab --header --overWrite --out seqPairwiseMeasures.tab.txt
cat vars_Pf*/variantCalls/*-chromosome_variableRegion.bed > allVariableRegions.bed

rm -f combinedVariantCalls.vcf && for x in vars_Pf3D7_*/variantCalls/*-genomic.vcf; do cat ${x} | egrep -v "^#" >> combinedVariantCalls.vcf ; done;

bedtools sort -i combinedVariantCalls.vcf > sorted_combinedVariantCalls.vcf
Code
variants = readr::read_tsv(paste0("sorted_combinedVariantCalls.vcf"),
                           col_names = F)
variants_mod = variants%>% 
  mutate(differentCall = row_number()) %>% 
  #filter(grepl(",", X5) | nchar(X4) > 1) %>% 
  mutate(X8 = strsplit(X8, split = ";")) %>% 
  unnest(X8) %>% 
  separate(X8, into = c("variable", "value"), sep = "=") %>% 
  spread(variable, value)


variants_mod = variants_mod %>% 
  mutate(X5 = strsplit(X5, split = ","), 
         AC = strsplit(AC, split = ","), 
         AF = strsplit(AF, split = ","))

variants_mod_unnest = variants_mod %>% 
  unnest(X5, AC, AF) %>% 
  mutate(ID = paste0(X1, "-", X2, "-", X4, "-", X5)) %>% 
  group_by(ID) %>% 
  mutate(IDnumber = row_number()) %>% 
  filter(DP == max(DP)) %>% 
  group_by(ID) %>% 
  mutate(IDnumber = row_number()) %>% 
  filter(1 == IDnumber) %>% 
  mutate(AC = as.numeric(AC), 
         AF = as.numeric(AF), 
         DP = as.numeric(DP), 
         NS = as.numeric(NS))

variants_mod_unnest_multi = variants_mod_unnest %>% 
  group_by(ID) %>% 
  filter(max(IDnumber) > 1)

variants_mod_unnest = variants_mod_unnest %>% 
  mutate(type = case_when(nchar(X4) > 1 ~ "deletion", 
                          nchar(X5) > 1 ~ "insertion", 
                          (nchar(X5 == 1) & nchar(X4) == 1) ~ "SNP",
                          T ~ "other"))

variants_mod_unnest_mod = variants_mod_unnest %>% 
  mutate(X2 = ifelse("SNP" == type, X2, X2 + 1)) %>% 
  mutate(X2 = X2 -1) %>% 
  mutate(end = ifelse("deletion" == type, X2 + nchar(X4) - 1, X2 + 1) ) %>% 
  dplyr::rename(start = X2, 
         `#chrom` = X1) %>% 
  mutate(len = end - start, 
         strand = "+")

variants_mod_unnest_out = variants_mod_unnest_mod %>% 
  group_by() %>% 
  select(`#chrom`, start, end, ID, len, strand, type, AC, AF, DP, NS)

variants_mod_unnest_out_filt = variants_mod_unnest_out %>% 
  filter(1 - AF >= 0.0099) %>% 
  filter(DP > 100)

write_tsv(variants_mod_unnest_out, paste0("sorted_combinedVariantCalls.bed"))

write_tsv(variants_mod_unnest_out_filt %>% 
            filter("SNP" == type), paste0("snps_sorted_combinedVariantCalls.bed"))


variants_mod_unnest_out_filt_SNPs = variants_mod_unnest_out_filt %>% 
  filter("SNP" == type) %>% 
  ungroup() %>% 
  select(`#chrom`, start, end) %>% 
  mutate(name = paste0(`#chrom`,"-", start, "-", end)) %>% 
  mutate(len = 1, strand = "+")
write_tsv(variants_mod_unnest_out_filt_SNPs, "snps_locs.bed")
Code
elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --overWrite --bed snps_locs.bed --out snps_locs_withGeneInfo.bed

snps_locs_withGeneInfo.bed

Code
snps_locs_withGeneInfo = readr::read_tsv("snps_locs_withGeneInfo.bed", col_names = F)
write_tsv(snps_locs_withGeneInfo, col_names = F, file = "windows/Supplemental Table - HRPII HRPIII chr11 SNPs in non-paralogous windows.bed")

Final variable windows

Process the compiled windows to rename and create a new file to call haplotype with again now on these subwindows

Code
subWindows = readr::read_tsv("windows/raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed", col_names = F)
                                      
subWindows = subWindows %>% 
  mutate(superWindow = gsub("__.*", "", X4)) %>% 
  mutate(suffix = gsub(".*__", "", X4)) %>% 
  left_join(subsegmentAnalysisCmdsFnps %>% 
              select(genomicID, regionName) %>% 
              mutate(genomicID = paste0(genomicID, "-for")) %>% 
              rename(superWindow = genomicID)) %>% 
  mutate(renameRegion = paste0(regionName, "__", suffix))

subWindows_out = subWindows %>% 
  mutate(X4 = renameRegion) %>% 
  select(1:7) %>% 
  mutate(X4 = gsub("\\.", "-", X4)) 

write_tsv(subWindows_out, "windows/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed", col_names = F)
write_tsv(subWindows_out, "windows/Supplemental Table - HRPII HRPIII chr11 variation windows in non-paralogous windows.bed", col_names = F)
write_tsv(finalHRPII_HRPIII_windows_withTuned , "windows/Supplemental Table - HRPII HRPIII chr11 non-paralogous windows.bed", col_names = F)

finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed

Run PathWeaver on variation sub windows

Like above, call PathWeaver again to analyze the sub-windows including population analysis.

Code

nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields  "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTunedSubWindows;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed\""  --numThreads 12 &

cd finalHRPII_HRPIII_windows_withTunedSubWindows

nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTunedSubWindows --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt &

cd popClustering/reports

nohup elucidator tableExtractColumns --columns s_Sample,p_name,h_popUID,c_AveragedFrac --file allSelectedClustersInfo.tab.txt.gz --delim tab --header  --overWrite --out slim_allSelectedClustersInfo.tab.txt.gz &

nohup elucidator doPairwiseComparisonOnHapsSharingDev --tableFnp slim_allSelectedClustersInfo.tab.txt.gz --sampleCol s_Sample --targetNameCol p_name --popIDCol h_popUID --relAbundCol c_AveragedFrac --numThreads 20 --dout pairwiseComps --verbose --overWriteDir --metaFnp //tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt --metaFieldsToCalcPopDiffs country,region,secondaryRegion & 

MDR1 region

See this page for analysis of this region

References

Hathaway, Nick. n.d. “PathWeaver: A Suite of Tools for Local Haplotype Reconstruction.” Github.
Source Code
---
title: "Running Haplotype reconstruction on initial windows"
---

```{r setup, echo=FALSE, message=FALSE}
source("../common.R")
```

# HRP2 HPR3 chr11 regions 

## Run PathWeaver 

Below will run PathWeaver[@Hathaway_undated-st] on the initial windows of interest to gather the local haplotypes and will also create sequence coverage for each location. PathWeaver requires a bam file, the reference file that the bam file was aligned against and then a bed file with the regions of interest.  


```{bash, eval = F}
#| filename: runRegionBamTrimCmds.txt
PREP:mkdir -p {DIRNAME}
CMD:if [ ! -f {DIRNAME}/{SAMPLE}_{DIRNAME}/final/coiCounts.tab.txt ]; then PathWeaver BamExtractPathwaysFromRegion --bamExtractTrimToRegion --bed {BEDFNP} --genomeDir /tank/data/genomes/plasmodium/genomes/pf/genomes/ --primaryGenome Pf3D7 --bam bams/{SAMPLE}.sorted.bam --dout {DIRNAME}/{SAMPLE}_{DIRNAME} --overWriteDir --numThreads {NCPUS}; fi;
{NCPUS}:4
```

```{bash, eval = F}
nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields  "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTuned;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned.bed\""  --numThreads 12 &

```

### Post processing  

PathWeaver first works on each sample individually. Post processing will then compare called haplotypes for each region across samples to analyze haplotype on a population level, do some additional post processing correction and filtering. This step will also create population level IDs and summary of the final called haplotypes as well compile the coverage info for each region within each sample into 1 file. 


```{bash, eval = F}
cd finalHRPII_HRPIII_windows_withTuned/
nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTuned --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt &

cd popClustering 

mkdir splitBeds

```

## Post Processing final Haplotypes  

Running analysis on the called haplotypes to determine variation within each window to do population genetics analysis like expected heterozygostiy, tajima's d as well as giving the ability to type the windows based on the actual variation within each window.  

The main program will be `elucidator createSharedSubSegmentsFromRefSeqs`   

```{r}
finalHRPII_HRPIII_windows_withTuned = readr::read_tsv("windows/finalHRPII_HRPIII_windows_withTuned.bed", col_names = F) %>% 
  mutate(genomicID = paste0(X1, "-", X2, "-", X3))
subsegmentAnalysisCmdsFnps = tibble()
if(!dir.exists("windows/splitBeds")){
  dir.create("windows/splitBeds")
}
if(!dir.exists("cmds")){
  dir.create("cmds")
}
for(row in 1:nrow(finalHRPII_HRPIII_windows_withTuned)){
  regionName = finalHRPII_HRPIII_windows_withTuned$X4[row]
  genomicID = finalHRPII_HRPIII_windows_withTuned$genomicID[row]
  cmd = paste0("cd ", regionName, "/population/ && elucidator createSharedSubSegmentsFromRefSeqs --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --refBedFnp ../../splitBeds/", genomicID, ".bed --refBedID ", regionName, " --fastagz popSeqsWithMetaWtihSampleName.fasta.gz --dout ", genomicID, "_subSegments_corrected40 --correctionOccurenceCutOff 40 --lowFreqCutOff 40 --overWriteDir --lenFilter --lenFiltMultiplier 0.85 --kSimFilter --uniqueHapCountCutOff 2")
  SharedLocsFnp = paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_sharedLocs_genomic.bed")
  VarLocsFnp= paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_variable_expanded_genomic.bed")
  DivMeasuresPerVarRegionFnp= paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/divMeasuresPerVarRegion.tab.txt")
  subsegmentAnalysisCmdsFnps = bind_rows(subsegmentAnalysisCmdsFnps,
                                         tibble(
                                           regionName = regionName, 
                                           genomicID = genomicID, 
                                           cmd = cmd, 
                                           SharedLocsFnp = SharedLocsFnp, 
                                           VarLocsFnp = VarLocsFnp, 
                                           DivMeasuresPerVarRegionFnp = DivMeasuresPerVarRegionFnp
                                         ))
  write_tsv(finalHRPII_HRPIII_windows_withTuned[row,], file = paste0("windows/splitBeds/", genomicID, ".bed"), col_names = F)
}
cat(subsegmentAnalysisCmdsFnps$cmd, sep = "\n", file = "cmds/subsegmentAnalysisCmds.txt")
cat(subsegmentAnalysisCmdsFnps$SharedLocsFnp, sep = "\n", file = "cmds/allSharedLocsFiles.txt")
cat(subsegmentAnalysisCmdsFnps$VarLocsFnp, sep = "\n", file = "cmds/allVarLocsFiles.txt")
cat(subsegmentAnalysisCmdsFnps$DivMeasuresPerVarRegionFnp, sep = "\n", file = "cmds/allDivMeasuresPerVarRegionFiles.txt")


```


### Compiling the final calls  

Gather together all the final result files into one and then analyze the variation windows. Below will get all the variation sub windows within each larger region. For regions that don't have variation we will grab the conserved region in order to ensure a region kept for each of the larger windows.  

```{bash, eval = F}
nohup elucidator runMultipleCommands --cmdFile subsegmentAnalysisCmds.txt --numThreads 20 --raw & 
elucidator rBind --files allSharedLocsFiles.txt --delim tab --out allConservedRegions.bed --overWrite  --skipNonExistFiles
elucidator rBind --files allVarLocsFiles.txt --delim tab --out allVarRegions.bed --overWrite  --skipNonExistFiles
elucidator rBind --files allDivMeasuresPerVarRegionFiles.txt --delim tab --out allDivMeasuresPerVarRegion.tab.txt --overWrite --header --skipNonExistFiles


elucidator getNonOverlappingBedRegions --bed /tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_withGeneInfo.bed --intersectWithBed allVarRegions.bed --overWrite --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed

elucidator getOverlappingBedRegions --bed allConservedRegions.bed --intersectWithBed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed --overWrite

cat allVarRegions.bed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed | elucidator bedCoordSort --bed STDIN | cut -f1-6 > raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed

elucidator bedFilterRegionsCompletelyInOther --bed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed --intersectWithBed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed | elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --bed STDIN --overWrite --out raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed

```


Create genomes without chromosome 11 and chromosome 13 to get proper variant calls for those locations for the duplicated region 
```{bash, eval = F}
cd /tank/data/genomes/plasmodium/genomes

mkdir -p pf_chromsRemoved/genomes
mkdir -p pf_chromsRemoved/info/gff
cd pf_chromsRemoved/genomes/


elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_11_v3 > ../info/Pf3D7_11_v3_name.txt 
elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_13_v3 > ../info/Pf3D7_13_v3_name.txt 

elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_11_v3_name.txt  --out Pf3D7_noChr11.fasta --overWrite
elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_13_v3_name.txt  --out Pf3D7_noChr13.fasta --overWrite

elucidator bioIndexGenomes --genomeDir ./ --numThreads 2

```

```{bash, eval = F}
cd /tank/data/plasmodium/falciparum/pfdata/finalHRPII_HRPIII_windows_withTuned/popClustering

## non-duplicated area  
rm -f runSimpleCallVariantCmds.txt && mkdir -p reports/varCalls && mkdir -p reports/alnCaches && for x in `ls -d Pf3D7_* | egrep -v for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x} --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

## duplicated area against 11 
for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr13.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against11 --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

## duplicated area against 13 
for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr11.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against13 --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

nohup elucidator runMultipleCommands --cmdFile runSimpleCallVariantCmds.txt --numThreads 22 --raw  &

```

```{bash, eval = F}
cd reports/varCalls 

elucidator rBind --contains divMeasures.tab.txt --doesNotContains perMetaFields --delim tab --header --recursive --overWrite --out allDivMeasures.tab.txt
elucidator rBind --contains region_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out region_allDivMeasures.tab.txt
elucidator rBind --contains country_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out country_allDivMeasures.tab.txt
elucidator rBind --contains secondaryRegion_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out secondaryRegion_allDivMeasures.tab.txt

elucidator tableExtractColumns --file allDivMeasures.tab.txt --columns id,avgPercentID,avgNumOfDiffs,nSegratingSites,TajimaD,TajimaDPVal  --delim tab --header --overWrite --out seqPairwiseMeasures.tab.txt
cat vars_Pf*/variantCalls/*-chromosome_variableRegion.bed > allVariableRegions.bed

rm -f combinedVariantCalls.vcf && for x in vars_Pf3D7_*/variantCalls/*-genomic.vcf; do cat ${x} | egrep -v "^#" >> combinedVariantCalls.vcf ; done;

bedtools sort -i combinedVariantCalls.vcf > sorted_combinedVariantCalls.vcf


```


```{r}
variants = readr::read_tsv(paste0("sorted_combinedVariantCalls.vcf"),
                           col_names = F)
variants_mod = variants%>% 
  mutate(differentCall = row_number()) %>% 
  #filter(grepl(",", X5) | nchar(X4) > 1) %>% 
  mutate(X8 = strsplit(X8, split = ";")) %>% 
  unnest(X8) %>% 
  separate(X8, into = c("variable", "value"), sep = "=") %>% 
  spread(variable, value)


variants_mod = variants_mod %>% 
  mutate(X5 = strsplit(X5, split = ","), 
         AC = strsplit(AC, split = ","), 
         AF = strsplit(AF, split = ","))

variants_mod_unnest = variants_mod %>% 
  unnest(X5, AC, AF) %>% 
  mutate(ID = paste0(X1, "-", X2, "-", X4, "-", X5)) %>% 
  group_by(ID) %>% 
  mutate(IDnumber = row_number()) %>% 
  filter(DP == max(DP)) %>% 
  group_by(ID) %>% 
  mutate(IDnumber = row_number()) %>% 
  filter(1 == IDnumber) %>% 
  mutate(AC = as.numeric(AC), 
         AF = as.numeric(AF), 
         DP = as.numeric(DP), 
         NS = as.numeric(NS))

variants_mod_unnest_multi = variants_mod_unnest %>% 
  group_by(ID) %>% 
  filter(max(IDnumber) > 1)

variants_mod_unnest = variants_mod_unnest %>% 
  mutate(type = case_when(nchar(X4) > 1 ~ "deletion", 
                          nchar(X5) > 1 ~ "insertion", 
                          (nchar(X5 == 1) & nchar(X4) == 1) ~ "SNP",
                          T ~ "other"))

variants_mod_unnest_mod = variants_mod_unnest %>% 
  mutate(X2 = ifelse("SNP" == type, X2, X2 + 1)) %>% 
  mutate(X2 = X2 -1) %>% 
  mutate(end = ifelse("deletion" == type, X2 + nchar(X4) - 1, X2 + 1) ) %>% 
  dplyr::rename(start = X2, 
         `#chrom` = X1) %>% 
  mutate(len = end - start, 
         strand = "+")

variants_mod_unnest_out = variants_mod_unnest_mod %>% 
  group_by() %>% 
  select(`#chrom`, start, end, ID, len, strand, type, AC, AF, DP, NS)

variants_mod_unnest_out_filt = variants_mod_unnest_out %>% 
  filter(1 - AF >= 0.0099) %>% 
  filter(DP > 100)

write_tsv(variants_mod_unnest_out, paste0("sorted_combinedVariantCalls.bed"))

write_tsv(variants_mod_unnest_out_filt %>% 
            filter("SNP" == type), paste0("snps_sorted_combinedVariantCalls.bed"))


variants_mod_unnest_out_filt_SNPs = variants_mod_unnest_out_filt %>% 
  filter("SNP" == type) %>% 
  ungroup() %>% 
  select(`#chrom`, start, end) %>% 
  mutate(name = paste0(`#chrom`,"-", start, "-", end)) %>% 
  mutate(len = 1, strand = "+")
write_tsv(variants_mod_unnest_out_filt_SNPs, "snps_locs.bed")

```

```{bash}
elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --overWrite --bed snps_locs.bed --out snps_locs_withGeneInfo.bed
```


```{r}
#| results: asis
#| echo: false

cat(createDownloadLink("snps_locs_withGeneInfo.bed"))
```

```{r}
snps_locs_withGeneInfo = readr::read_tsv("snps_locs_withGeneInfo.bed", col_names = F)
write_tsv(snps_locs_withGeneInfo, col_names = F, file = "windows/Supplemental Table - HRPII HRPIII chr11 SNPs in non-paralogous windows.bed")
```


## Final variable windows  

Process the compiled windows to rename and create a new file to call haplotype with again now on these subwindows  
```{r}
subWindows = readr::read_tsv("windows/raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed", col_names = F)
                                      
subWindows = subWindows %>% 
  mutate(superWindow = gsub("__.*", "", X4)) %>% 
  mutate(suffix = gsub(".*__", "", X4)) %>% 
  left_join(subsegmentAnalysisCmdsFnps %>% 
              select(genomicID, regionName) %>% 
              mutate(genomicID = paste0(genomicID, "-for")) %>% 
              rename(superWindow = genomicID)) %>% 
  mutate(renameRegion = paste0(regionName, "__", suffix))

subWindows_out = subWindows %>% 
  mutate(X4 = renameRegion) %>% 
  select(1:7) %>% 
  mutate(X4 = gsub("\\.", "-", X4)) 

write_tsv(subWindows_out, "windows/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed", col_names = F)
write_tsv(subWindows_out, "windows/Supplemental Table - HRPII HRPIII chr11 variation windows in non-paralogous windows.bed", col_names = F)
write_tsv(finalHRPII_HRPIII_windows_withTuned , "windows/Supplemental Table - HRPII HRPIII chr11 non-paralogous windows.bed", col_names = F)

```


```{r}
#| results: asis
#| echo: false

cat(createDownloadLink("windows/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed"))
```




## Run PathWeaver on variation sub windows  

Like above, call PathWeaver again to analyze the sub-windows including population analysis.  

```{bash, eval = F}

nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields  "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTunedSubWindows;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed\""  --numThreads 12 &

cd finalHRPII_HRPIII_windows_withTunedSubWindows

nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTunedSubWindows --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt &

cd popClustering/reports

nohup elucidator tableExtractColumns --columns s_Sample,p_name,h_popUID,c_AveragedFrac --file allSelectedClustersInfo.tab.txt.gz --delim tab --header  --overWrite --out slim_allSelectedClustersInfo.tab.txt.gz &

nohup elucidator doPairwiseComparisonOnHapsSharingDev --tableFnp slim_allSelectedClustersInfo.tab.txt.gz --sampleCol s_Sample --targetNameCol p_name --popIDCol h_popUID --relAbundCol c_AveragedFrac --numThreads 20 --dout pairwiseComps --verbose --overWriteDir --metaFnp //tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt --metaFieldsToCalcPopDiffs country,region,secondaryRegion & 

```


# MDR1 region 

See [this page](../DeletionPatternAnalysis/deletion_patterns/process_chr05_hrp3_deletion_pfmdr1_dup.qmd) for analysis of this region