Running Haplotype reconstruction on initial windows

HRP2 HPR3 chr11 regions

Run PathWeaver

Below will run PathWeaver(Hathaway, n.d.) on the initial windows of interest to gather the local haplotypes and will also create sequence coverage for each location. PathWeaver requires a bam file, the reference file that the bam file was aligned against and then a bed file with the regions of interest.

Code

runRegionBamTrimCmds.txt

PREP:mkdir -p {DIRNAME}
CMD:if [ ! -f {DIRNAME}/{SAMPLE}_{DIRNAME}/final/coiCounts.tab.txt ]; then PathWeaver BamExtractPathwaysFromRegion --bamExtractTrimToRegion --bed {BEDFNP} --genomeDir /tank/data/genomes/plasmodium/genomes/pf/genomes/ --primaryGenome Pf3D7 --bam bams/{SAMPLE}.sorted.bam --dout {DIRNAME}/{SAMPLE}_{DIRNAME} --overWriteDir --numThreads {NCPUS}; fi;
{NCPUS}:4

Code

nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields  "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTuned;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned.bed\""  --numThreads 12 &

Post processing

PathWeaver first works on each sample individually. Post processing will then compare called haplotypes for each region across samples to analyze haplotype on a population level, do some additional post processing correction and filtering. This step will also create population level IDs and summary of the final called haplotypes as well compile the coverage info for each region within each sample into 1 file.

Code

cd finalHRPII_HRPIII_windows_withTuned/
nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTuned --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt &

cd popClustering 

mkdir splitBeds

Post Processing final Haplotypes

Running analysis on the called haplotypes to determine variation within each window to do population genetics analysis like expected heterozygostiy, tajima’s d as well as giving the ability to type the windows based on the actual variation within each window.

The main program will be elucidator createSharedSubSegmentsFromRefSeqs

Code

finalHRPII_HRPIII_windows_withTuned = readr::read_tsv("windows/finalHRPII_HRPIII_windows_withTuned.bed", col_names = F) %>% 
  mutate(genomicID = paste0(X1, "-", X2, "-", X3))
subsegmentAnalysisCmdsFnps = tibble()
if(!dir.exists("windows/splitBeds")){
  dir.create("windows/splitBeds")
}
if(!dir.exists("cmds")){
  dir.create("cmds")
}
for(row in 1:nrow(finalHRPII_HRPIII_windows_withTuned)){
  regionName = finalHRPII_HRPIII_windows_withTuned$X4[row]
  genomicID = finalHRPII_HRPIII_windows_withTuned$genomicID[row]
  cmd = paste0("cd ", regionName, "/population/ && elucidator createSharedSubSegmentsFromRefSeqs --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --refBedFnp ../../splitBeds/", genomicID, ".bed --refBedID ", regionName, " --fastagz popSeqsWithMetaWtihSampleName.fasta.gz --dout ", genomicID, "_subSegments_corrected40 --correctionOccurenceCutOff 40 --lowFreqCutOff 40 --overWriteDir --lenFilter --lenFiltMultiplier 0.85 --kSimFilter --uniqueHapCountCutOff 2")
  SharedLocsFnp = paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_sharedLocs_genomic.bed")
  VarLocsFnp= paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_variable_expanded_genomic.bed")
  DivMeasuresPerVarRegionFnp= paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/divMeasuresPerVarRegion.tab.txt")
  subsegmentAnalysisCmdsFnps = bind_rows(subsegmentAnalysisCmdsFnps,
                                         tibble(
                                           regionName = regionName, 
                                           genomicID = genomicID, 
                                           cmd = cmd, 
                                           SharedLocsFnp = SharedLocsFnp, 
                                           VarLocsFnp = VarLocsFnp, 
                                           DivMeasuresPerVarRegionFnp = DivMeasuresPerVarRegionFnp
                                         ))
  write_tsv(finalHRPII_HRPIII_windows_withTuned[row,], file = paste0("windows/splitBeds/", genomicID, ".bed"), col_names = F)
}
cat(subsegmentAnalysisCmdsFnps$cmd, sep = "\n", file = "cmds/subsegmentAnalysisCmds.txt")
cat(subsegmentAnalysisCmdsFnps$SharedLocsFnp, sep = "\n", file = "cmds/allSharedLocsFiles.txt")
cat(subsegmentAnalysisCmdsFnps$VarLocsFnp, sep = "\n", file = "cmds/allVarLocsFiles.txt")
cat(subsegmentAnalysisCmdsFnps$DivMeasuresPerVarRegionFnp, sep = "\n", file = "cmds/allDivMeasuresPerVarRegionFiles.txt")

Compiling the final calls

Gather together all the final result files into one and then analyze the variation windows. Below will get all the variation sub windows within each larger region. For regions that don’t have variation we will grab the conserved region in order to ensure a region kept for each of the larger windows.

Code

nohup elucidator runMultipleCommands --cmdFile subsegmentAnalysisCmds.txt --numThreads 20 --raw & 
elucidator rBind --files allSharedLocsFiles.txt --delim tab --out allConservedRegions.bed --overWrite  --skipNonExistFiles
elucidator rBind --files allVarLocsFiles.txt --delim tab --out allVarRegions.bed --overWrite  --skipNonExistFiles
elucidator rBind --files allDivMeasuresPerVarRegionFiles.txt --delim tab --out allDivMeasuresPerVarRegion.tab.txt --overWrite --header --skipNonExistFiles


elucidator getNonOverlappingBedRegions --bed /tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_withGeneInfo.bed --intersectWithBed allVarRegions.bed --overWrite --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed

elucidator getOverlappingBedRegions --bed allConservedRegions.bed --intersectWithBed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed --overWrite

cat allVarRegions.bed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed | elucidator bedCoordSort --bed STDIN | cut -f1-6 > raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed

elucidator bedFilterRegionsCompletelyInOther --bed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed --intersectWithBed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed | elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --bed STDIN --overWrite --out raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed

Create genomes without chromosome 11 and chromosome 13 to get proper variant calls for those locations for the duplicated region

Code

cd /tank/data/genomes/plasmodium/genomes

mkdir -p pf_chromsRemoved/genomes
mkdir -p pf_chromsRemoved/info/gff
cd pf_chromsRemoved/genomes/


elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_11_v3 > ../info/Pf3D7_11_v3_name.txt 
elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_13_v3 > ../info/Pf3D7_13_v3_name.txt 

elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_11_v3_name.txt  --out Pf3D7_noChr11.fasta --overWrite
elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_13_v3_name.txt  --out Pf3D7_noChr13.fasta --overWrite

elucidator bioIndexGenomes --genomeDir ./ --numThreads 2

Code

cd /tank/data/plasmodium/falciparum/pfdata/finalHRPII_HRPIII_windows_withTuned/popClustering

## non-duplicated area  
rm -f runSimpleCallVariantCmds.txt && mkdir -p reports/varCalls && mkdir -p reports/alnCaches && for x in `ls -d Pf3D7_* | egrep -v for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x} --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

## duplicated area against 11 
for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr13.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against11 --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

## duplicated area against 13 
for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr11.2bit  --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against13 --alnInfoDir reports/alnCaches/alnCache_${x}  --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done;

nohup elucidator runMultipleCommands --cmdFile runSimpleCallVariantCmds.txt --numThreads 22 --raw  &

Code

cd reports/varCalls 

elucidator rBind --contains divMeasures.tab.txt --doesNotContains perMetaFields --delim tab --header --recursive --overWrite --out allDivMeasures.tab.txt
elucidator rBind --contains region_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out region_allDivMeasures.tab.txt
elucidator rBind --contains country_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out country_allDivMeasures.tab.txt
elucidator rBind --contains secondaryRegion_divMeasures.tab.txt  --delim tab --header --recursive --overWrite --out secondaryRegion_allDivMeasures.tab.txt

elucidator tableExtractColumns --file allDivMeasures.tab.txt --columns id,avgPercentID,avgNumOfDiffs,nSegratingSites,TajimaD,TajimaDPVal  --delim tab --header --overWrite --out seqPairwiseMeasures.tab.txt
cat vars_Pf*/variantCalls/*-chromosome_variableRegion.bed > allVariableRegions.bed

rm -f combinedVariantCalls.vcf && for x in vars_Pf3D7_*/variantCalls/*-genomic.vcf; do cat ${x} | egrep -v "^#" >> combinedVariantCalls.vcf ; done;

bedtools sort -i combinedVariantCalls.vcf > sorted_combinedVariantCalls.vcf

Code

variants = readr::read_tsv(paste0("sorted_combinedVariantCalls.vcf"),
                           col_names = F)
variants_mod = variants%>% 
  mutate(differentCall = row_number()) %>% 
  #filter(grepl(",", X5) | nchar(X4) > 1) %>% 
  mutate(X8 = strsplit(X8, split = ";")) %>% 
  unnest(X8) %>% 
  separate(X8, into = c("variable", "value"), sep = "=") %>% 
  spread(variable, value)


variants_mod = variants_mod %>% 
  mutate(X5 = strsplit(X5, split = ","), 
         AC = strsplit(AC, split = ","), 
         AF = strsplit(AF, split = ","))

variants_mod_unnest = variants_mod %>% 
  unnest(X5, AC, AF) %>% 
  mutate(ID = paste0(X1, "-", X2, "-", X4, "-", X5)) %>% 
  group_by(ID) %>% 
  mutate(IDnumber = row_number()) %>% 
  filter(DP == max(DP)) %>% 
  group_by(ID) %>% 
  mutate(IDnumber = row_number()) %>% 
  filter(1 == IDnumber) %>% 
  mutate(AC = as.numeric(AC), 
         AF = as.numeric(AF), 
         DP = as.numeric(DP), 
         NS = as.numeric(NS))

variants_mod_unnest_multi = variants_mod_unnest %>% 
  group_by(ID) %>% 
  filter(max(IDnumber) > 1)

variants_mod_unnest = variants_mod_unnest %>% 
  mutate(type = case_when(nchar(X4) > 1 ~ "deletion", 
                          nchar(X5) > 1 ~ "insertion", 
                          (nchar(X5 == 1) & nchar(X4) == 1) ~ "SNP",
                          T ~ "other"))

variants_mod_unnest_mod = variants_mod_unnest %>% 
  mutate(X2 = ifelse("SNP" == type, X2, X2 + 1)) %>% 
  mutate(X2 = X2 -1) %>% 
  mutate(end = ifelse("deletion" == type, X2 + nchar(X4) - 1, X2 + 1) ) %>% 
  dplyr::rename(start = X2, 
         `#chrom` = X1) %>% 
  mutate(len = end - start, 
         strand = "+")

variants_mod_unnest_out = variants_mod_unnest_mod %>% 
  group_by() %>% 
  select(`#chrom`, start, end, ID, len, strand, type, AC, AF, DP, NS)

variants_mod_unnest_out_filt = variants_mod_unnest_out %>% 
  filter(1 - AF >= 0.0099) %>% 
  filter(DP > 100)

write_tsv(variants_mod_unnest_out, paste0("sorted_combinedVariantCalls.bed"))

write_tsv(variants_mod_unnest_out_filt %>% 
            filter("SNP" == type), paste0("snps_sorted_combinedVariantCalls.bed"))


variants_mod_unnest_out_filt_SNPs = variants_mod_unnest_out_filt %>% 
  filter("SNP" == type) %>% 
  ungroup() %>% 
  select(`#chrom`, start, end) %>% 
  mutate(name = paste0(`#chrom`,"-", start, "-", end)) %>% 
  mutate(len = 1, strand = "+")
write_tsv(variants_mod_unnest_out_filt_SNPs, "snps_locs.bed")

Code

elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --overWrite --bed snps_locs.bed --out snps_locs_withGeneInfo.bed

snps_locs_withGeneInfo.bed

Code

snps_locs_withGeneInfo = readr::read_tsv("snps_locs_withGeneInfo.bed", col_names = F)
write_tsv(snps_locs_withGeneInfo, col_names = F, file = "windows/Supplemental Table - HRPII HRPIII chr11 SNPs in non-paralogous windows.bed")

Final variable windows

Process the compiled windows to rename and create a new file to call haplotype with again now on these subwindows

Code

subWindows = readr::read_tsv("windows/raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed", col_names = F)
                                      
subWindows = subWindows %>% 
  mutate(superWindow = gsub("__.*", "", X4)) %>% 
  mutate(suffix = gsub(".*__", "", X4)) %>% 
  left_join(subsegmentAnalysisCmdsFnps %>% 
              select(genomicID, regionName) %>% 
              mutate(genomicID = paste0(genomicID, "-for")) %>% 
              rename(superWindow = genomicID)) %>% 
  mutate(renameRegion = paste0(regionName, "__", suffix))

subWindows_out = subWindows %>% 
  mutate(X4 = renameRegion) %>% 
  select(1:7) %>% 
  mutate(X4 = gsub("\\.", "-", X4)) 

write_tsv(subWindows_out, "windows/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed", col_names = F)
write_tsv(subWindows_out, "windows/Supplemental Table - HRPII HRPIII chr11 variation windows in non-paralogous windows.bed", col_names = F)
write_tsv(finalHRPII_HRPIII_windows_withTuned , "windows/Supplemental Table - HRPII HRPIII chr11 non-paralogous windows.bed", col_names = F)

finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed

Run PathWeaver on variation sub windows

Like above, call PathWeaver again to analyze the sub-windows including population analysis.

Code


nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields  "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTunedSubWindows;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed\""  --numThreads 12 &

cd finalHRPII_HRPIII_windows_withTunedSubWindows

nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTunedSubWindows --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt &

cd popClustering/reports

nohup elucidator tableExtractColumns --columns s_Sample,p_name,h_popUID,c_AveragedFrac --file allSelectedClustersInfo.tab.txt.gz --delim tab --header  --overWrite --out slim_allSelectedClustersInfo.tab.txt.gz &

nohup elucidator doPairwiseComparisonOnHapsSharingDev --tableFnp slim_allSelectedClustersInfo.tab.txt.gz --sampleCol s_Sample --targetNameCol p_name --popIDCol h_popUID --relAbundCol c_AveragedFrac --numThreads 20 --dout pairwiseComps --verbose --overWriteDir --metaFnp //tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt --metaFieldsToCalcPopDiffs country,region,secondaryRegion &

MDR1 region

See this page for analysis of this region

References

Hathaway, Nick. n.d. “PathWeaver: A Suite of Tools for Local Haplotype Reconstruction.” Github.

--- title: "Running Haplotype reconstruction on initial windows" --- ```{r setup, echo=FALSE, message=FALSE} source("../common.R") ``` # HRP2 HPR3 chr11 regions ## Run PathWeaver Below will run PathWeaver[@Hathaway_undated-st] on the initial windows of interest to gather the local haplotypes and will also create sequence coverage for each location. PathWeaver requires a bam file, the reference file that the bam file was aligned against and then a bed file with the regions of interest. ```{bash, eval = F} #| filename: runRegionBamTrimCmds.txt PREP:mkdir -p {DIRNAME} CMD:if [ ! -f {DIRNAME}/{SAMPLE}_{DIRNAME}/final/coiCounts.tab.txt ]; then PathWeaver BamExtractPathwaysFromRegion --bamExtractTrimToRegion --bed {BEDFNP} --genomeDir /tank/data/genomes/plasmodium/genomes/pf/genomes/ --primaryGenome Pf3D7 --bam bams/{SAMPLE}.sorted.bam --dout {DIRNAME}/{SAMPLE}_{DIRNAME} --overWriteDir --numThreads {NCPUS}; fi; {NCPUS}:4 ``` ```{bash, eval = F} nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTuned;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned.bed\"" --numThreads 12 & ``` ### Post processing PathWeaver first works on each sample individually. Post processing will then compare called haplotypes for each region across samples to analyze haplotype on a population level, do some additional post processing correction and filtering. This step will also create population level IDs and summary of the final called haplotypes as well compile the coverage info for each region within each sample into 1 file. ```{bash, eval = F} cd finalHRPII_HRPIII_windows_withTuned/ nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTuned --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt & cd popClustering mkdir splitBeds ``` ## Post Processing final Haplotypes Running analysis on the called haplotypes to determine variation within each window to do population genetics analysis like expected heterozygostiy, tajima's d as well as giving the ability to type the windows based on the actual variation within each window. The main program will be `elucidator createSharedSubSegmentsFromRefSeqs` ```{r} finalHRPII_HRPIII_windows_withTuned = readr::read_tsv("windows/finalHRPII_HRPIII_windows_withTuned.bed", col_names = F) %>% mutate(genomicID = paste0(X1, "-", X2, "-", X3)) subsegmentAnalysisCmdsFnps = tibble() if(!dir.exists("windows/splitBeds")){ dir.create("windows/splitBeds") } if(!dir.exists("cmds")){ dir.create("cmds") } for(row in 1:nrow(finalHRPII_HRPIII_windows_withTuned)){ regionName = finalHRPII_HRPIII_windows_withTuned$X4[row] genomicID = finalHRPII_HRPIII_windows_withTuned$genomicID[row] cmd = paste0("cd ", regionName, "/population/ && elucidator createSharedSubSegmentsFromRefSeqs --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --refBedFnp ../../splitBeds/", genomicID, ".bed --refBedID ", regionName, " --fastagz popSeqsWithMetaWtihSampleName.fasta.gz --dout ", genomicID, "_subSegments_corrected40 --correctionOccurenceCutOff 40 --lowFreqCutOff 40 --overWriteDir --lenFilter --lenFiltMultiplier 0.85 --kSimFilter --uniqueHapCountCutOff 2") SharedLocsFnp = paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_sharedLocs_genomic.bed") VarLocsFnp= paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/0_ref_variable_expanded_genomic.bed") DivMeasuresPerVarRegionFnp= paste0(regionName, "/population/", genomicID, "_subSegments_corrected40/subRegionInfo/divMeasuresPerVarRegion.tab.txt") subsegmentAnalysisCmdsFnps = bind_rows(subsegmentAnalysisCmdsFnps, tibble( regionName = regionName, genomicID = genomicID, cmd = cmd, SharedLocsFnp = SharedLocsFnp, VarLocsFnp = VarLocsFnp, DivMeasuresPerVarRegionFnp = DivMeasuresPerVarRegionFnp )) write_tsv(finalHRPII_HRPIII_windows_withTuned[row,], file = paste0("windows/splitBeds/", genomicID, ".bed"), col_names = F) } cat(subsegmentAnalysisCmdsFnps$cmd, sep = "\n", file = "cmds/subsegmentAnalysisCmds.txt") cat(subsegmentAnalysisCmdsFnps$SharedLocsFnp, sep = "\n", file = "cmds/allSharedLocsFiles.txt") cat(subsegmentAnalysisCmdsFnps$VarLocsFnp, sep = "\n", file = "cmds/allVarLocsFiles.txt") cat(subsegmentAnalysisCmdsFnps$DivMeasuresPerVarRegionFnp, sep = "\n", file = "cmds/allDivMeasuresPerVarRegionFiles.txt") ``` ### Compiling the final calls Gather together all the final result files into one and then analyze the variation windows. Below will get all the variation sub windows within each larger region. For regions that don't have variation we will grab the conserved region in order to ensure a region kept for each of the larger windows. ```{bash, eval = F} nohup elucidator runMultipleCommands --cmdFile subsegmentAnalysisCmds.txt --numThreads 20 --raw & elucidator rBind --files allSharedLocsFiles.txt --delim tab --out allConservedRegions.bed --overWrite --skipNonExistFiles elucidator rBind --files allVarLocsFiles.txt --delim tab --out allVarRegions.bed --overWrite --skipNonExistFiles elucidator rBind --files allDivMeasuresPerVarRegionFiles.txt --delim tab --out allDivMeasuresPerVarRegion.tab.txt --overWrite --header --skipNonExistFiles elucidator getNonOverlappingBedRegions --bed /tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_withGeneInfo.bed --intersectWithBed allVarRegions.bed --overWrite --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed elucidator getOverlappingBedRegions --bed allConservedRegions.bed --intersectWithBed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected.bed --out finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed --overWrite cat allVarRegions.bed finalHRPII_HRPIII_windows_withTuned_withGeneInfo_noVarRegionsDetected_conservedRegions.bed | elucidator bedCoordSort --bed STDIN | cut -f1-6 > raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed elucidator bedFilterRegionsCompletelyInOther --bed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed --intersectWithBed raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed | elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --bed STDIN --overWrite --out raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed ``` Create genomes without chromosome 11 and chromosome 13 to get proper variant calls for those locations for the duplicated region ```{bash, eval = F} cd /tank/data/genomes/plasmodium/genomes mkdir -p pf_chromsRemoved/genomes mkdir -p pf_chromsRemoved/info/gff cd pf_chromsRemoved/genomes/ elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_11_v3 > ../info/Pf3D7_11_v3_name.txt elucidator printNames --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta | egrep Pf3D7_13_v3 > ../info/Pf3D7_13_v3_name.txt elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_11_v3_name.txt --out Pf3D7_noChr11.fasta --overWrite elucidator extractByName --fasta /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.fasta --excluding --names ../info/Pf3D7_13_v3_name.txt --out Pf3D7_noChr13.fasta --overWrite elucidator bioIndexGenomes --genomeDir ./ --numThreads 2 ``` ```{bash, eval = F} cd /tank/data/plasmodium/falciparum/pfdata/finalHRPII_HRPIII_windows_withTuned/popClustering ## non-duplicated area rm -f runSimpleCallVariantCmds.txt && mkdir -p reports/varCalls && mkdir -p reports/alnCaches && for x in `ls -d Pf3D7_* | egrep -v for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.2bit --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x} --alnInfoDir reports/alnCaches/alnCache_${x} --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done; ## duplicated area against 11 for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr13.2bit --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against11 --alnInfoDir reports/alnCaches/alnCache_${x} --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done; ## duplicated area against 13 for x in `ls -d Pf3D7_* | egrep for`; do echo elucidator callVariantsAgainstRefGenome --fastagz ${x}/population/popSeqsWithMetaWtihSampleName.fasta.gz --genome /tank/data/genomes/plasmodium/genomes/pf_chromsRemoved/genomes/Pf3D7_noChr11.2bit --ignoreSubFields site:LabCross,site:LabControl,site:LabContaminated --overWriteDir --lowVariantCutOff 0.0099 --occurrenceCutOff 10 --dout reports/varCalls/vars_${x}_against13 --alnInfoDir reports/alnCaches/alnCache_${x} --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --knownAminoAcidChangesFnp /tank/data/genomes/plasmodium/genomes/pf/info/pf_drug_resistant_aaPositions.tsv --getPairwiseComps --identifier ${x} --metaFieldsToCalcPopDiffs region,country,secondaryRegion >> runSimpleCallVariantCmds.txt; done; nohup elucidator runMultipleCommands --cmdFile runSimpleCallVariantCmds.txt --numThreads 22 --raw & ``` ```{bash, eval = F} cd reports/varCalls elucidator rBind --contains divMeasures.tab.txt --doesNotContains perMetaFields --delim tab --header --recursive --overWrite --out allDivMeasures.tab.txt elucidator rBind --contains region_divMeasures.tab.txt --delim tab --header --recursive --overWrite --out region_allDivMeasures.tab.txt elucidator rBind --contains country_divMeasures.tab.txt --delim tab --header --recursive --overWrite --out country_allDivMeasures.tab.txt elucidator rBind --contains secondaryRegion_divMeasures.tab.txt --delim tab --header --recursive --overWrite --out secondaryRegion_allDivMeasures.tab.txt elucidator tableExtractColumns --file allDivMeasures.tab.txt --columns id,avgPercentID,avgNumOfDiffs,nSegratingSites,TajimaD,TajimaDPVal --delim tab --header --overWrite --out seqPairwiseMeasures.tab.txt cat vars_Pf*/variantCalls/*-chromosome_variableRegion.bed > allVariableRegions.bed rm -f combinedVariantCalls.vcf && for x in vars_Pf3D7_*/variantCalls/*-genomic.vcf; do cat ${x} | egrep -v "^#" >> combinedVariantCalls.vcf ; done; bedtools sort -i combinedVariantCalls.vcf > sorted_combinedVariantCalls.vcf ``` ```{r} variants = readr::read_tsv(paste0("sorted_combinedVariantCalls.vcf"), col_names = F) variants_mod = variants%>% mutate(differentCall = row_number()) %>% #filter(grepl(",", X5) | nchar(X4) > 1) %>% mutate(X8 = strsplit(X8, split = ";")) %>% unnest(X8) %>% separate(X8, into = c("variable", "value"), sep = "=") %>% spread(variable, value) variants_mod = variants_mod %>% mutate(X5 = strsplit(X5, split = ","), AC = strsplit(AC, split = ","), AF = strsplit(AF, split = ",")) variants_mod_unnest = variants_mod %>% unnest(X5, AC, AF) %>% mutate(ID = paste0(X1, "-", X2, "-", X4, "-", X5)) %>% group_by(ID) %>% mutate(IDnumber = row_number()) %>% filter(DP == max(DP)) %>% group_by(ID) %>% mutate(IDnumber = row_number()) %>% filter(1 == IDnumber) %>% mutate(AC = as.numeric(AC), AF = as.numeric(AF), DP = as.numeric(DP), NS = as.numeric(NS)) variants_mod_unnest_multi = variants_mod_unnest %>% group_by(ID) %>% filter(max(IDnumber) > 1) variants_mod_unnest = variants_mod_unnest %>% mutate(type = case_when(nchar(X4) > 1 ~ "deletion", nchar(X5) > 1 ~ "insertion", (nchar(X5 == 1) & nchar(X4) == 1) ~ "SNP", T ~ "other")) variants_mod_unnest_mod = variants_mod_unnest %>% mutate(X2 = ifelse("SNP" == type, X2, X2 + 1)) %>% mutate(X2 = X2 -1) %>% mutate(end = ifelse("deletion" == type, X2 + nchar(X4) - 1, X2 + 1) ) %>% dplyr::rename(start = X2, `#chrom` = X1) %>% mutate(len = end - start, strand = "+") variants_mod_unnest_out = variants_mod_unnest_mod %>% group_by() %>% select(`#chrom`, start, end, ID, len, strand, type, AC, AF, DP, NS) variants_mod_unnest_out_filt = variants_mod_unnest_out %>% filter(1 - AF >= 0.0099) %>% filter(DP > 100) write_tsv(variants_mod_unnest_out, paste0("sorted_combinedVariantCalls.bed")) write_tsv(variants_mod_unnest_out_filt %>% filter("SNP" == type), paste0("snps_sorted_combinedVariantCalls.bed")) variants_mod_unnest_out_filt_SNPs = variants_mod_unnest_out_filt %>% filter("SNP" == type) %>% ungroup() %>% select(`#chrom`, start, end) %>% mutate(name = paste0(`#chrom`,"-", start, "-", end)) %>% mutate(len = 1, strand = "+") write_tsv(variants_mod_unnest_out_filt_SNPs, "snps_locs.bed") ``` ```{bash} elucidator bedGetIntersectingGenesInGff --gff /tank/data/genomes/plasmodium/genomes/pf/info/gff/Pf3D7.gff --extraAttributes description --overWrite --bed snps_locs.bed --out snps_locs_withGeneInfo.bed ``` ```{r} #| results: asis #| echo: false cat(createDownloadLink("snps_locs_withGeneInfo.bed")) ``` ```{r} snps_locs_withGeneInfo = readr::read_tsv("snps_locs_withGeneInfo.bed", col_names = F) write_tsv(snps_locs_withGeneInfo, col_names = F, file = "windows/Supplemental Table - HRPII HRPIII chr11 SNPs in non-paralogous windows.bed") ``` ## Final variable windows Process the compiled windows to rename and create a new file to call haplotype with again now on these subwindows ```{r} subWindows = readr::read_tsv("windows/raw_finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions_withGeneInfo.bed", col_names = F) subWindows = subWindows %>% mutate(superWindow = gsub("__.*", "", X4)) %>% mutate(suffix = gsub(".*__", "", X4)) %>% left_join(subsegmentAnalysisCmdsFnps %>% select(genomicID, regionName) %>% mutate(genomicID = paste0(genomicID, "-for")) %>% rename(superWindow = genomicID)) %>% mutate(renameRegion = paste0(regionName, "__", suffix)) subWindows_out = subWindows %>% mutate(X4 = renameRegion) %>% select(1:7) %>% mutate(X4 = gsub("\\.", "-", X4)) write_tsv(subWindows_out, "windows/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed", col_names = F) write_tsv(subWindows_out, "windows/Supplemental Table - HRPII HRPIII chr11 variation windows in non-paralogous windows.bed", col_names = F) write_tsv(finalHRPII_HRPIII_windows_withTuned , "windows/Supplemental Table - HRPII HRPIII chr11 non-paralogous windows.bed", col_names = F) ``` ```{r} #| results: asis #| echo: false cat(createDownloadLink("windows/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed")) ``` ## Run PathWeaver on variation sub windows Like above, call PathWeaver again to analyze the sub-windows including population analysis. ```{bash, eval = F} nohup elucidator runMultipleCommands --cmdFile runRegionBamTrimCmds.txt --additionalFields "{SAMPLE}:allSamples.txt;{DIRNAME}:finalHRPII_HRPIII_windows_withTunedSubWindows;{BEDFNP}:\"/tank/data/plasmodium/falciparum/beds/hrps/redesign_2020_11_22/finalHRPII_HRPIII_windows_withTuned_combinedVarConservedRegions.bed\"" --numThreads 12 & cd finalHRPII_HRPIII_windows_withTunedSubWindows nohup PathWeaver runProcessClustersOnRecon --alnInfoDir alnCache --overWriteDir --dout popClustering --pat _finalHRPII_HRPIII_windows_withTunedSubWindows --numThreads 5 --groupingsFile /tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt & cd popClustering/reports nohup elucidator tableExtractColumns --columns s_Sample,p_name,h_popUID,c_AveragedFrac --file allSelectedClustersInfo.tab.txt.gz --delim tab --header --overWrite --out slim_allSelectedClustersInfo.tab.txt.gz & nohup elucidator doPairwiseComparisonOnHapsSharingDev --tableFnp slim_allSelectedClustersInfo.tab.txt.gz --sampleCol s_Sample --targetNameCol p_name --popIDCol h_popUID --relAbundCol c_AveragedFrac --numThreads 20 --dout pairwiseComps --verbose --overWriteDir --metaFnp //tank/data/plasmodium/falciparum/pfdata/metadata/meta.tab.txt --metaFieldsToCalcPopDiffs country,region,secondaryRegion & ``` # MDR1 region See [this page](../DeletionPatternAnalysis/deletion_patterns/process_chr05_hrp3_deletion_pfmdr1_dup.qmd) for analysis of this region