Analysis Surrounding HRP2 and HRP3 deletions
  • Home
  • Final Manuscript
  • Window Analysis
    • Windows
    • Running Haplotype Reconstruction on Windows
    • Genomic Locations Of Final Windows

    • Window analysis by coverage
    • Processing Coverage Initial Windows
    • Processing Coverage on Sub Windows

    • Window analysis of deletion patterns
    • Telomere healing
    • Processing Samples with HRP2 Deletions TARE1
    • Processing Samples with Chr11 Deletions TARE1
    • Processing Samples for chr13 TARE1 presence
    • pfmdr1 duplication
    • Processing Samples with pfhrp3 13-5++ deletion pattern

    • Final Coverage Windows
    • Processing Coverage on Sub Windows - final

    • Window analysis by sequence/variation
    • Plotting haplotype variation within regions

    • Analysis by SNP variant analysis
    • Calling variants and Estimating COI
    • Plotting BiAllelic Variant Plots
  • HB3/SD01 Longreads analysis
    • Set up
    • Creating Hybrid genomes

    • Spanning Raw Reads analysis
    • Processing Spanning Reads
    • SD01 spanning specific

    • HB3
    • Processing chr11 and chr13
    • Final Process Assembly

    • SD01
    • Running SD01 assemblies
    • Processing SD01 assemblies

    • Both
    • Illumina against HB3/SD01 Assemblies
    • Comparison To 3D7 Simplified View
  • rRNA Segmental Duplications

    • Chr11/13 Duplicated Region
    • Characterizing Duplicated Region
  • Related Genomic Regions Vis
    • Analysis
    • Finding shared regions genome wide
    • Mapping out surrounding Genes on Assembled Strains

    • Misc
    • Plotting HRPs Tandem Repeats
    • Info on all rRNA
  • Comparing to related Plasmodiums
    • Comparing to all 6 Plasmodium Laverania
    • Comparing to all 6 Plasmodium Laverania Gene Arrangements chr05,07,08,11,13
    • Comparing to HRP2/3 falciparum sequences
  • References
    • Getting Raw Data References
    • References
    • R Session and Commandline tools info

Contents

  • Plotting intersected long reads
    • Nanopore
    • Pacbio
  • Extracting and typing reads
    • Extracting long reads
    • Renaming
    • Typed intersected reads
  • Processing on pure hybrid
    • Nanopore
    • Pacbio
    • Extracting and typing reads
    • Typing
    • Linkage between variants
    • Plotting by strongest chromosome association

SD01 specific spanning reads analysis

  • Show All Code
  • Hide All Code

  • View Source

Getting specific spanning reads for the variability in the shared region for SD01. The read and overall sample quality of SD01 is poor (fragile DNA) so assembly and long read depth is hard to achieve. This plus the fact that basically from this shared region to the end chromosomes 11 and hybrid 13-11 are perfectly copied (a region close to 200kb) it’s hard to assembly at baseline. For this reason to help support the claim that the hybrid genome 13 exist we can link the same amount of variation seen within the shared region between 11 and 13 to try to link variation within this region to reads that span chromosome 13 into this region. If variation specific to only to chromosome 13 reads are also contained within reads that then span from within the shared region into chromosome 11 sequence (either on the hybrid 13-11 or into 11 since from the shared region to the end of the chromosome this sequence is identical within the 3D7 hybrid genome so reads should be able to multi-map to both these regions).

These regions below have variation within the strain SD01 within the 15.3kb shared region between 11 and 13.

Code
finalHrpSubwindows_regions_withSD01MultiInShared = readr::read_tsv("../../../../Windows_Surrounding_HRP2_3_deletions/windowAnalysis/finalHrpSubwindows_regions_withSD01MultiInShared.bed", col_names = F)
create_dt(finalHrpSubwindows_regions_withSD01MultiInShared)

Determine the locations within hybrid genome in order to pull the long reads from the alignments to the hybrid (core 3D7 genome plus hybrid chromosomes chr11-chr13, chr13-chr11) genome

Code
cd ../../../../Windows_Surrounding_HRP2_3_deletions/windowAnalysis/
elucidator getFastaWithBed  --twoBit /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.2bit --overWrite --bed finalHrpSubwindows_regions_withSD01MultiInShared.bed --out finalHrpSubwindows_regions_withSD01MultiInShared.fasta 

elucidator determineRegionLastz --identity 100 --coverage 100 --fasta finalHrpSubwindows_regions_withSD01MultiInShared.fasta --individual --genome /tank/data/genomes/plasmodium/genomes/pf_hybrid/genomes/Pf3D7_plus_11-13_13-11_hybrid.fasta | sort | uniq > raw_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed 
# have to do this cause LASTZ bugs out and doesn't do the matching of Pf3D7_11_v3 1924680 1924740 Pf3D7_11_v3-1924664-1924740-for__var-0 60 + for whatever goodness forsaken reason 

cut -f1-6 finalHrpSubwindows_regions_withSD01MultiInShared.bed raw_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed | sort | uniq > finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed

elucidator getOverlappingBedRegions --bed finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --intersectWithBed ../../sharedBetween11_and_13/investigatingChrom11Chrom13/shared_11_13_region_onPureHybrid.bed  | cut -f1-6 > finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed

Get the nanopore/pacbio reads that intersect this region

Code
cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore 
elucidator bamToBed --bam hybridAlnsMinimap2/PfSD01PermethION.sorted.bam | elucidator getOverlappingBedRegions --bed STDIN --intersectWithBed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --overWrite --out PfSD01PermethION_intersecting_withSD01MultiInShared.bed 

cd /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies 
elucidator bamToBed --bam hybridAlnsMinimap2/PfSD01.sorted.bam | elucidator getOverlappingBedRegions --bed STDIN --intersectWithBed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --overWrite --out PfSD01Pacbio_intersecting_withSD01MultiInShared.bed
Code
homologousRegionWithHybrid = readr::read_tsv("../../../../sharedBetween11_and_13/investigatingChrom11Chrom13/combined_shared_11_13_region.bed", col_names = F)
multiRegion = readr::read_tsv("../../../../Windows_Surrounding_HRP2_3_deletions/windowAnalysis/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed", col_names = F)

Plotting intersected long reads

Shared region is shown in red, the regions within wich SD01 has variation are shown in blue

Code
PfSD01PermethION = readr::read_tsv("PfSD01PermethION_intersecting_withSD01MultiInShared.bed", col_names = F) %>% 
  group_by(X1) %>% 
  mutate(rowid = row_number()) %>% 
  group_by()
PfSD01Pacbio = readr::read_tsv("PfSD01Pacbio_intersecting_withSD01MultiInShared.bed", col_names = F) %>% 
  group_by(X1) %>% 
  mutate(rowid = row_number()) %>% 
  group_by()

PfSD01PermethION_filt = PfSD01PermethION %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((X2 < sharedStart & X3 > sharedStart)
         |
           (X2 < sharedStop & X3 > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by()

PfSD01Pacbio_filt = PfSD01Pacbio %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((X2 < sharedStart & X3 > sharedStart)
         |
           (X2 < sharedStop & X3 > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by()

Nanopore

Code
ggplot() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = rowid, 
                ymax = rowid + 1), 
            data = PfSD01PermethION_filt)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free")

Pacbio

Code
ggplot() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = rowid, 
                ymax = rowid + 1), 
            data = PfSD01Pacbio_filt)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion)+ 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free")

Extracting and typing reads

Extracting long reads

Code

cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore
cd PfSD01_withSD01MultiInShared_extractions

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/hybridAlnsMinimap2/PfSD01PermethION.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --trimToRegion --overWriteDir --dout PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions --rename

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies/hybridAlnsMinimap2/PfSD01.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --trimToRegion --overWriteDir --dout PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions --rename

Add in illumina reads to help with error correcting

Code
elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/data/plasmodium/falciparum/pfdata/bams/PfSD01.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared.bed --trimToRegion --overWriteDir --dout PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions --rename

mkdir PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions

for x in PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix illumina_; done; 
for x in PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix nano_; done; 
for x in PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix pacbio_; done; 

Concatenate reads and add the extraction info

Code
for x in PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/prepended_Pf3D7*gz; do cat ${x} PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/$(basename ${x}) PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/$(basename ${x}) > PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/$(basename ${x} | sed 's/prepended_//g'); done;

cat PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/prepended_Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz  PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/prepended_Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz  > PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz

rsync -raPh PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/illumina_extraction_nameKey.tab.txt
rsync -raPh PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/nano_extraction_nameKey.tab.txt
rsync -raPh PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/pacbio_extraction_nameKey.tab.txt

Cluster the reads using clustering meant for error prone long reads elucidatorlab kmerClusteringRate to help type the reads.

Code
cd PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions 
for x in Pf3D7_11_v3-19*gz; do elucidatorlab kmerClusteringRate --qualThres 7,5 --cutOff 0.1 --idCutOff 0.80 --numThreads 15 --lqMismatches 2 --oneBaseIndel 5 --twoBaseIndel 2 --map --recalcConsensus --checkChimeras --overWriteDir --writeInitialClusters --sizeCutOff 10  --fastq  ${x} --dout clustering_${x%%.fastq.gz} --postCollapseForceOneComp; done;
elucidator rBind --contains clusterNames.tab.txt --recursive --delim tab --header --overWrite --out allClusterNames.tab.txt

Renaming

Type the reads based on their clustering

Code
allClusterNames = readr::read_tsv("allClusterNames.tab.txt")

allClusterNames_filt = allClusterNames %>%
  filter(!grepl("_t0\\.[0-9]*", readName)) %>%
  filter("Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.2_t12" != clusterName) %>% 
  mutate(region = gsub("\\.fastq.*", "", clusterName)) %>%
  mutate(clusterID = gsub(".*\\.fastq\\.", "", clusterName)) %>%
  mutate(clusterID = gsub("_t[0-9\\.]*", "", clusterID)) %>% 
  mutate(regionID = paste0(region, ".", clusterID)) 

illuminaExtraction = readr::read_tsv("illumina_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("illumina_", newName))
pacbioExtraction = readr::read_tsv("pacbio_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("pacbio_", newName))
nanoExtraction = readr::read_tsv("nano_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("nano_", newName))

bothExtraction = bind_rows(pacbioExtraction,
                           nanoExtraction)


bothExtraction_filt = bothExtraction %>% 
  filter(newName %in% allClusterNames_filt$readName) %>% 
  mutate(genomeID = paste0(`#chrom`, "-", start, "-", end)) %>% 
  left_join(allClusterNames_filt %>% 
              rename(newName = readName) %>% 
              select(newName, region, regionID)) %>% 
  select(-newName) %>% 
  spread(region, regionID) %>% 
  replace_na(
    list(
    `Pf3D7_11_v3-1919677-1919761-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1921054-1921153-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1924664-1924740-for__var-0` = "notTypeable", 
    `Pf3D7_11_v3-1927700-1928008-for__var-0` = "notTypeable")
  )

illuminaExtraction_filt = illuminaExtraction %>% 
  filter(newName %in% allClusterNames_filt$readName) %>% 
  mutate(genomeID = paste0(`#chrom`, "-", start, "-", end)) %>% 
  left_join(allClusterNames_filt %>% 
              rename(newName = readName) %>% 
              select(newName, region, regionID)) %>% 
  select(-newName) %>% 
  spread(region, regionID) %>% 
  replace_na(
    list(
    `Pf3D7_11_v3-1919677-1919761-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1921054-1921153-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1924664-1924740-for__var-0` = "notTypeable", 
    `Pf3D7_11_v3-1927700-1928008-for__var-0` = "notTypeable")
  )

bothExtraction_filt_counts = bothExtraction_filt %>% 
  group_by(`Pf3D7_11_v3-1919677-1919761-for__var-0`, 
          `Pf3D7_11_v3-1921054-1921153-for__var-0`, 
          `Pf3D7_11_v3-1924664-1924740-for__var-0`, 
          `Pf3D7_11_v3-1927700-1928008-for__var-0`) %>% 
  count()

create_dt(bothExtraction_filt_counts)

Typed intersected reads

Typing the reads by the variants within each of the 4 regions, the current region typing the read by is colored pink in the shared region below. If a read didn’t cover the region or there was too much error to confidently type it then it’s colored as “untypeable”.

Code
outBound = 5000

bothExtraction_filt_mod = bothExtraction_filt %>% 
  group_by(`#chrom`) %>% 
  arrange(`#chrom`, start, end )%>% 
  mutate(rowid = row_number()) %>% 
  mutate(X1 = `#chrom`) %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((start < sharedStart & end > sharedStart)
         |
           (start < sharedStop & end > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by() %>% 
  mutate(startDiff = sharedStart - start, 
         endDiff = sharedStop - end) %>% 
  mutate(start = ifelse(start < sharedStart - outBound, sharedStart - outBound, start), 
         end =   ifelse(end > sharedStop + outBound, sharedStop + outBound, end))

illuminaExtraction_filt_mod = illuminaExtraction_filt %>% 
  group_by(`#chrom`) %>% 
  arrange(`#chrom`, start, end )%>% 
  mutate(rowid = row_number()) %>% 
  mutate(X1 = `#chrom`) %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) 
Code
ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1919677-1919761-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegion %>% 
              filter(X4 == "Pf3D7_11_v3-1919677-1919761-for__var-0"))  + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1919677-1919761-for__var-0")

Code
ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1921054-1921153-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegion %>% 
              filter(X4 == "Pf3D7_11_v3-1921054-1921153-for__var-0"))  + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1921054-1921153-for__var-0")

Code
ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1924664-1924740-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion)+ 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegion %>% 
              filter(X4 == "Pf3D7_11_v3-1924664-1924740-for__var-0")) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1924664-1924740-for__var-0")

Code
ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1927700-1928008-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion)+ 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegion %>% 
              filter(X4 == "Pf3D7_11_v3-1927700-1928008-for__var-0")) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1927700-1928008-for__var-0")

Processing on pure hybrid

The same exact work flow as above but except hybrid genome was modified to include only the expected regular chromosome 11 and the hybrid 13-11 (which is the expected hybrid genome for SD01). This way it is easier to deal with less multi mappers when so much genome was duplicated to map to.

Code
cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore
mkdir purehybridAlnsMinimap2
minimap2 -x map-ont -t 40 -a /tank/data/genomes/plasmodium/genomes/pf_hybrid/genomes/Pf3D7_pure_11-13_13-11_hybrid.fasta rawFastq/PfSD01PermethION.fastq.gz | samtools sort --threads 40 -o purehybridAlnsMinimap2/PfSD01PermethION.sorted.bam 
samtools index purehybridAlnsMinimap2/PfSD01PermethION.sorted.bam 

cd /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies 
mkdir purehybridAlnsMinimap2
minimap2 -x map-hifi -t 48 -a /tank/data/genomes/plasmodium/genomes/pf_hybrid/genomes/Pf3D7_pure_11-13_13-11_hybrid.fasta rawFastq/ERR1036246.fastq.gz   | samtools sort --threads 48 -o purehybridAlnsMinimap2/PfSD01.sorted.bam && samtools index purehybridAlnsMinimap2/PfSD01.sorted.bam 
Code
cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore 
elucidator bamToBed --bam purehybridAlnsMinimap2/PfSD01PermethION.sorted.bam | elucidator getOverlappingBedRegions --bed STDIN --intersectWithBed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed --overWrite --out PfSD01PermethION_intersecting_withSD01MultiInShared_pureHybrid.bed 

cd /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies 
elucidator bamToBed --bam purehybridAlnsMinimap2/PfSD01.sorted.bam | elucidator getOverlappingBedRegions --bed STDIN --intersectWithBed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed --overWrite --out PfSD01Pacbio_intersecting_withSD01MultiInShared_pureHybrid.bed
Code
homologousRegionWithPureHybrid = readr::read_tsv("../../../../sharedBetween11_and_13/investigatingChrom11Chrom13/shared_11_13_region_onPureHybrid.bed", col_names = F, skip = 1)
multiRegionPureHybrid = readr::read_tsv("../../../../Windows_Surrounding_HRP2_3_deletions/windowAnalysis/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed", col_names = F)
Code
PfSD01PermethION = readr::read_tsv("againstPureHybrid/PfSD01PermethION_intersecting_withSD01MultiInShared_pureHybrid.bed", col_names = F) %>% 
  group_by(X1) %>% 
  mutate(rowid = row_number()) %>% 
  group_by()
PfSD01Pacbio = readr::read_tsv("againstPureHybrid/PfSD01Pacbio_intersecting_withSD01MultiInShared_pureHybrid.bed", col_names = F) %>% 
  group_by(X1) %>% 
  mutate(rowid = row_number()) %>% 
  group_by()

PfSD01PermethION_filt = PfSD01PermethION %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((X2 < sharedStart & X3 > sharedStart)
         |
           (X2 < sharedStop & X3 > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by()

PfSD01Pacbio_filt = PfSD01Pacbio %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((X2 < sharedStart & X3 > sharedStart)
         |
           (X2 < sharedStop & X3 > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by()

Nanopore

Code
ggplot() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = rowid, 
                ymax = rowid + 1), 
            data = PfSD01PermethION_filt)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free")

Pacbio

Code
ggplot() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = rowid, 
                ymax = rowid + 1), 
            data = PfSD01Pacbio_filt)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid)+ 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free")

Extracting and typing reads

Code

cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore
mkdir PfSD01_withSD01MultiInShared_extractions_pureHybrid
cd PfSD01_withSD01MultiInShared_extractions_pureHybrid

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/purehybridAlnsMinimap2/PfSD01PermethION.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed --trimToRegion --overWriteDir --dout PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions --rename

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies/purehybridAlnsMinimap2/PfSD01.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed --trimToRegion --overWriteDir --dout PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions --rename

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/data/plasmodium/falciparum/pfdata/bams/PfSD01.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared.bed --trimToRegion --overWriteDir --dout PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions --rename

mkdir PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions

for x in PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix illumina_; done; 
for x in PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix nano_; done; 
for x in PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix pacbio_; done; 

for x in PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/prepended_Pf3D7*gz; do cat ${x} PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/$(basename ${x}) PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/$(basename ${x}) > PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/$(basename ${x} | sed 's/prepended_//g'); done;

cat PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/prepended_Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz  PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/prepended_Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz  > PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz

rsync -raPh PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/illumina_extraction_nameKey.tab.txt
rsync -raPh PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/nano_extraction_nameKey.tab.txt
rsync -raPh PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/pacbio_extraction_nameKey.tab.txt

cd PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions 
for x in Pf3D7_11_v3-19*gz; do elucidatorlab kmerClusteringRate --qualThres 7,5 --cutOff 0.1 --idCutOff 0.80 --numThreads 15 --lqMismatches 2 --oneBaseIndel 5 --twoBaseIndel 2 --map --recalcConsensus --checkChimeras --overWriteDir --writeInitialClusters --sizeCutOff 10  --fastq  ${x} --dout clustering_${x%%.fastq.gz} --postCollapseForceOneComp; done;
elucidator rBind --contains clusterNames.tab.txt --recursive --delim tab --header --overWrite --out allClusterNames.tab.txt

Typing

Code
allClusterNames = readr::read_tsv("againstPureHybrid/allClusterNames.tab.txt")

allClusterNames_filt = allClusterNames %>%
  filter(!grepl("_t0\\.[0-9]*", readName)) %>%
  filter("Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.2_t12" != clusterName) %>% 
  mutate(region = gsub("\\.fastq.*", "", clusterName)) %>%
  mutate(clusterID = gsub(".*\\.fastq\\.", "", clusterName)) %>%
  mutate(clusterID = gsub("_t[0-9\\.]*", "", clusterID)) %>% 
  mutate(regionID = paste0(region, ".", clusterID)) 

illuminaExtraction = readr::read_tsv("againstPureHybrid/illumina_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("illumina_", newName))
pacbioExtraction = readr::read_tsv("againstPureHybrid/pacbio_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("pacbio_", newName))
nanoExtraction = readr::read_tsv("againstPureHybrid/nano_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("nano_", newName))

bothExtraction = bind_rows(pacbioExtraction,
                           nanoExtraction)


bothExtraction_filt = bothExtraction %>% 
  filter(newName %in% allClusterNames_filt$readName) %>% 
  mutate(genomeID = paste0(`#chrom`, "-", start, "-", end)) %>% 
  left_join(allClusterNames_filt %>% 
              rename(newName = readName) %>% 
              select(newName, region, regionID)) %>% 
  select(-newName) %>% 
  spread(region, regionID) %>% 
  replace_na(
    list(
    `Pf3D7_11_v3-1919677-1919761-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1921054-1921153-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1924664-1924740-for__var-0` = "notTypeable", 
    `Pf3D7_11_v3-1927700-1928008-for__var-0` = "notTypeable")
  )


illuminaExtraction_filt = illuminaExtraction %>% 
  filter(newName %in% allClusterNames_filt$readName) %>% 
  mutate(genomeID = paste0(`#chrom`, "-", start, "-", end)) %>% 
  left_join(allClusterNames_filt %>% 
              rename(newName = readName) %>% 
              select(newName, region, regionID)) %>% 
  select(-newName) %>% 
  spread(region, regionID) %>% 
  replace_na(
    list(
    `Pf3D7_11_v3-1919677-1919761-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1921054-1921153-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1924664-1924740-for__var-0` = "notTypeable", 
    `Pf3D7_11_v3-1927700-1928008-for__var-0` = "notTypeable")
  )

bothExtraction_filt_counts = bothExtraction_filt %>% 
  group_by(`Pf3D7_11_v3-1919677-1919761-for__var-0`, 
          `Pf3D7_11_v3-1921054-1921153-for__var-0`, 
          `Pf3D7_11_v3-1924664-1924740-for__var-0`, 
          `Pf3D7_11_v3-1927700-1928008-for__var-0`) %>% 
  count()

create_dt(bothExtraction_filt_counts)
Code
outBound = 5000

bothExtraction_filt_mod = bothExtraction_filt %>% 
  group_by(`#chrom`) %>% 
  arrange(`#chrom`, start, end )%>% 
  mutate(rowid = row_number()) %>% 
  mutate(X1 = `#chrom`) %>%
  left_join(
    homologousRegionWithPureHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((start < sharedStart & end > sharedStart)
         |
           (start < sharedStop & end > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by() %>% 
  mutate(startDiff = sharedStart - start, 
         endDiff = sharedStop - end) %>% 
  mutate(start = ifelse(start < sharedStart - outBound, sharedStart - outBound, start), 
         end =   ifelse(end > sharedStop + outBound, sharedStop + outBound, end))


illuminaExtraction_filt_mod = illuminaExtraction_filt %>% 
  group_by(`#chrom`) %>% 
  arrange(`#chrom`, start, end )%>% 
  mutate(rowid = row_number()) %>% 
  mutate(X1 = `#chrom`) %>%
  left_join(
    homologousRegionWithPureHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  )

ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1919677-1919761-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegionPureHybrid %>% 
              filter(X4 == "Pf3D7_11_v3-1919677-1919761-for__var-0"))  + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1919677-1919761-for__var-0")

Code
ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1921054-1921153-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegionPureHybrid %>% 
              filter(X4 == "Pf3D7_11_v3-1921054-1921153-for__var-0"))  + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1921054-1921153-for__var-0")

Code
ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1924664-1924740-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid)+ 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegionPureHybrid %>% 
              filter(X4 == "Pf3D7_11_v3-1924664-1924740-for__var-0")) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1924664-1924740-for__var-0")

Code
ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1927700-1928008-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid)+ 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegionPureHybrid %>% 
              filter(X4 == "Pf3D7_11_v3-1927700-1928008-for__var-0")) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1927700-1928008-for__var-0")

Linkage between variants

Looking at the linkage between variants by linking each variant within the long reads to each subsequent variant and then linkng the chromosome to the first variant. It appears that all the variants 0.1 are correlated to chromosome 11 and all variants 0.0 are correlated to chromosome 13

Code
# Load package
library(networkD3)

bothExtraction_filt_counts_gat = bothExtraction_filt_counts %>% 
  gather(region, haplotype, 1:4)

bothExtraction_filt_counts_gat_hapCounts = bothExtraction_filt_counts_gat %>% 
  filter(haplotype != "notTypeable") %>% 
  group_by(haplotype) %>% 
  summarise(n = sum(n)) %>% 
  mutate(haplotypeCoded = row_number())

bothExtraction_filt_counts_hapLinkCounts = tibble()
# for(col1 in 1:3) {
#   for (col2 in (col1 + 1):4) {
#    bothExtraction_filt_counts_gat_subSelection = bothExtraction_filt_counts %>% 
#      group_by() %>% 
#      select(col1, col2, n)
#    colnames(bothExtraction_filt_counts_gat_subSelection)[1:2] = c("hap1", "hap2")
#    bothExtraction_filt_counts_gat_subSelection_filt = bothExtraction_filt_counts_gat_subSelection %>% 
#      filter(hap1 != "notTypeable", 
#             hap2 != "notTypeable") %>% 
#      group_by(hap1, hap2) %>% 
#      summarise(n = sum(n))
#    bothExtraction_filt_counts_hapLinkCounts = bind_rows(
#      bothExtraction_filt_counts_hapLinkCounts, 
#      bothExtraction_filt_counts_gat_subSelection_filt
#    )
#   }
# }

for(col1 in 1:3) {
  col2 = col1 + 1
   bothExtraction_filt_counts_gat_subSelection = bothExtraction_filt_counts %>% 
     group_by() %>% 
     select(col1, col2, n)
   colnames(bothExtraction_filt_counts_gat_subSelection)[1:2] = c("hap1", "hap2")
   bothExtraction_filt_counts_gat_subSelection_filt = bothExtraction_filt_counts_gat_subSelection %>% 
     filter(hap1 != "notTypeable", 
            hap2 != "notTypeable") %>% 
     group_by(hap1, hap2) %>% 
     summarise(n = sum(n))
   bothExtraction_filt_counts_hapLinkCounts = bind_rows(
     bothExtraction_filt_counts_hapLinkCounts, 
     bothExtraction_filt_counts_gat_subSelection_filt
   )
}

bothExtraction_filt_chromosomeLinkageInfo =  bothExtraction_filt %>% 
   select(`#chrom`, `Pf3D7_11_v3-1919677-1919761-for__var-0`) %>% 
   filter(`Pf3D7_11_v3-1919677-1919761-for__var-0` != "notTypeable") %>% 
  group_by(`#chrom`, `Pf3D7_11_v3-1919677-1919761-for__var-0`) %>% 
  count() 

bothExtraction_filt_chromosomeLinkageInfo_addToHapCounts = bothExtraction_filt_chromosomeLinkageInfo %>% 
  rename(haplotype = `#chrom` ) %>% 
  group_by(haplotype) %>% 
  summarise(n = sum(n))

bothExtraction_filt_counts_gat_hapCounts = bothExtraction_filt_counts_gat_hapCounts %>% 
  bind_rows(bothExtraction_filt_chromosomeLinkageInfo_addToHapCounts)


#zero based position source/target nodes 
bothExtraction_filt_counts_hapLinkCounts = bothExtraction_filt_counts_hapLinkCounts %>% 
  mutate(source = match(hap1, bothExtraction_filt_counts_gat_hapCounts$haplotype) -1) %>% 
  mutate(target = match(hap2, bothExtraction_filt_counts_gat_hapCounts$haplotype) -1)

bothExtraction_filt_chromosomeLinkageInfo = bothExtraction_filt_chromosomeLinkageInfo  %>% 
  rename(hap1 = `#chrom`, 
         hap2 = `Pf3D7_11_v3-1919677-1919761-for__var-0`) %>% 
  mutate(source = match(hap1, bothExtraction_filt_counts_gat_hapCounts$haplotype) -1) %>% 
  mutate(target = match(hap2, bothExtraction_filt_counts_gat_hapCounts$haplotype) -1)
bothExtraction_filt_counts_hapLinkCounts = bothExtraction_filt_counts_hapLinkCounts %>% 
  bind_rows(bothExtraction_filt_chromosomeLinkageInfo)
# adding in chromosome linkage to front 


bothExtraction_filt_counts_hapLinkCounts = as.data.frame(bothExtraction_filt_counts_hapLinkCounts)
bothExtraction_filt_counts_gat_hapCounts = as.data.frame(bothExtraction_filt_counts_gat_hapCounts)

# Thus we can plot it
p <- sankeyNetwork(
  Links = bothExtraction_filt_counts_hapLinkCounts,
  Nodes = bothExtraction_filt_counts_gat_hapCounts,
  Source = "source",
  Target = "target",
  Value = "n",
  NodeID = "haplotype",
  units = "reads",
  fontSize = 12,
  nodeWidth = 30
)

There is some cross linkage between the variants but this is low level and could either represent typing error or even possibly low level of continued recombination between these regions.

Code
p
Code
bothExtraction_filt_mod_sel = bothExtraction_filt_mod %>% 
  select(starts_with("Pf3D7"), `#chrom`, rowid) %>% 
  gather(region, hap, 1:4) %>% 
  mutate(hapVar = gsub(".*\\.", "", hap)) %>% 
  filter(hapVar != "notTypeable") %>% 
  group_by(`#chrom`, rowid,hapVar) %>% 
  count() %>% 
  group_by(`#chrom`, rowid) %>% 
  filter(n == max(n)) 

bothExtraction_filt_mod_sel_counts = bothExtraction_filt_mod_sel%>% 
  group_by(`#chrom`, rowid) %>% 
  mutate(varCounts = n()) %>% 
  mutate(typed = case_when(varCounts > 1 ~ "indeterminate", 
                           hapVar == 1 ~ "chr11", 
                           hapVar == 0 ~ "chr13")) 

bothExtraction_filt_mod_sel_counts_type = bothExtraction_filt_mod_sel_counts %>% 
  select(`#chrom`, rowid, typed) %>% 
  group_by() %>% 
  unique()
  
bothExtraction_filt_mod_typed = bothExtraction_filt_mod %>% 
  left_join(bothExtraction_filt_mod_sel_counts_type)


illuminaExtraction_filt_mod_sel = illuminaExtraction_filt_mod %>% 
  select(starts_with("Pf3D7"), `#chrom`, rowid) %>% 
  gather(region, hap, 1:3) %>% 
  mutate(hapVar = gsub(".*\\.", "", hap)) %>% 
  filter(hapVar != "notTypeable") %>% 
  group_by(`#chrom`, rowid,hapVar) %>% 
  count() %>% 
  group_by(`#chrom`, rowid) %>% 
  filter(n == max(n)) 

illuminaExtraction_filt_mod_sel_counts = illuminaExtraction_filt_mod_sel%>% 
  group_by(`#chrom`, rowid) %>% 
  mutate(varCounts = n()) %>% 
  mutate(typed = case_when(varCounts > 1 ~ "indeterminate", 
                           hapVar == 1 ~ "chr11", 
                           hapVar == 0 ~ "chr13")) 

illuminaExtraction_filt_mod_sel_counts_type = illuminaExtraction_filt_mod_sel_counts %>% 
  select(`#chrom`, rowid, typed) %>% 
  group_by() %>% 
  unique()
  
illuminaExtraction_filt_mod_typed = illuminaExtraction_filt_mod %>% 
  left_join(illuminaExtraction_filt_mod_sel_counts_type)

Plotting by strongest chromosome association

Colored by the chromosome associated variation (chr11 vs chr13). If a read has multiple loci and there is a tie between the different chromosomes then it is colored indeterminate.

Code
chrVarPlot = ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = typed),
            color = "black",
            data = bothExtraction_filt_mod_typed %>% 
              mutate(typed = factor(typed, levels = c("indeterminate", "chr13", "chr11"))))  + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    name= "Associated Chromosome\nVariation",
    guide = guide_legend(
      ncol = 1
    ) ) + 
  ggnewscale::new_scale_fill() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0, 
                fill = "Duplicated Region"),
            color = "black",
            #fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0, 
                fill = "SD01 Variant Sites"),
            #color = "black",
            #fill = "#F748A5",
            data = multiRegionPureHybrid) + 
  scale_fill_manual("Genomic", values = c("Duplicated Region" = "#2271B2", "SD01 Variant Sites" = "#F748A5")) + 
  guides(fill = guide_legend(ncol = 1)) + 
  sofonias_theme + 
  theme(axis.text.x = element_text(size=12, angle = -45, vjust = 0.5, hjust = 0)) + 
  facet_wrap(~X1, scales = "free") 
  
print(chrVarPlot)

Code
pdf("SD01_variation_associated_reads_plot.pdf", width = 8, height = 6, useDingbats = F)
print(chrVarPlot)
dev.off()
quartz_off_screen 
                2 
Code
# pacbio and nanopore  
bothExtraction_filt_mod_typedchr11 = bothExtraction_filt_mod_typed %>% 
  filter(typed == "chr11")
write_tsv(bothExtraction_filt_mod_typedchr11, "nanoPacbio_typed11.tab.txt")

bothExtraction_filt_mod_typedchr13 = bothExtraction_filt_mod_typed %>% 
  filter(typed == "chr13")
write_tsv(bothExtraction_filt_mod_typedchr13, "nanoPacbio_typed13.tab.txt")

bothExtraction_filt_mod_typedchr13 = bothExtraction_filt_mod_typed %>% 
  filter(typed == "chr13")
write_tsv(bothExtraction_filt_mod_typedchr13, "nanoPacbio_typed13.tab.txt")



# illumina 
illuminaExtraction_filt_mod_typedchr11 = illuminaExtraction_filt_mod_typed %>% 
  filter(typed == "chr11")
write_tsv(illuminaExtraction_filt_mod_typedchr11, "illumina_typed11.tab.txt")

illuminaExtraction_filt_mod_typedchr13 = illuminaExtraction_filt_mod_typed %>% 
  filter(typed == "chr13")
write_tsv(illuminaExtraction_filt_mod_typedchr13, "illumina_typed13.tab.txt")

illuminaExtraction_filt_mod_typedchr13 = illuminaExtraction_filt_mod_typed %>% 
  filter(typed == "chr13")
write_tsv(illuminaExtraction_filt_mod_typedchr13, "illumina_typed13.tab.txt")
Code
cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/extractingFromRawFastqs

elucidator printCol --file nanoPacbio_typed11.tab.txt --delim tab --header --columnName oldName --unique --sort --out chr11_names.txt --overWrite
elucidator printCol --file nanoPacbio_typed13.tab.txt --delim tab --header --columnName oldName --unique --sort --out chr13_names.txt --overWrite

nohup elucidator extractByName --fastqgz ../pacbioReads/combined.fastq.gz --names chr11_names.txt --out pacbio_chr11_combined.fastq.gz --overWrite --trimAtWhiteSpace  &
nohup elucidator extractByName --fastqgz ../pacbioReads/combined.fastq.gz --names chr13_names.txt --out pacbio_chr13_combined.fastq.gz --overWrite --trimAtWhiteSpace  &
nohup elucidator extractByName --fastqgz ../pacbioReads/combined.fastq.gz --names chr11_names.txt --out pacbio_allButChr11_combined.fastq.gz --overWrite --trimAtWhiteSpace --excluding &
nohup elucidator extractByName --fastqgz ../pacbioReads/combined.fastq.gz --names chr13_names.txt --out pacbio_allButChr13_combined.fastq.gz --overWrite --trimAtWhiteSpace --excluding &


nohup elucidator extractByName --fastqgz ../rawFastq/PfSD01PermethION.fastq.gz --names chr11_names.txt --out nano_chr11_PfSD01PermethION.fastq.gz --overWrite --trimAtWhiteSpace  &
nohup elucidator extractByName --fastqgz ../rawFastq/PfSD01PermethION.fastq.gz --names chr13_names.txt --out nano_chr13_PfSD01PermethION.fastq.gz --overWrite --trimAtWhiteSpace  &
nohup elucidator extractByName --fastqgz ../rawFastq/PfSD01PermethION.fastq.gz --names chr11_names.txt --out nano_allButChr11_PfSD01PermethION.fastq.gz --overWrite --trimAtWhiteSpace --excluding &
nohup elucidator extractByName --fastqgz ../rawFastq/PfSD01PermethION.fastq.gz --names chr13_names.txt --out nano_allButChr13_PfSD01PermethION.fastq.gz --overWrite --trimAtWhiteSpace --excluding &

cat nano_allButChr11_PfSD01PermethION.fastq.gz pacbio_chr13_combined.fastq.gz > ../combinedRawFastqs/PfSD01PermethION_withPacbio_forchr13.fastq.gz
cat nano_allButChr13_PfSD01PermethION.fastq.gz pacbio_chr11_combined.fastq.gz > ../combinedRawFastqs/PfSD01PermethION_withPacbio_forchr11.fastq.gz
Source Code
---
title: "SD01 specific spanning reads analysis"
---

```{r setup, echo=FALSE, message=FALSE}
source("../../../common.R")
```

Getting specific spanning reads for the variability in the shared region for SD01. The read and overall sample quality of SD01 is poor (fragile DNA) so assembly and long read depth is hard to achieve. This plus the fact that basically from this shared region to the end chromosomes 11 and hybrid 13-11 are perfectly copied (a region close to 200kb) it's hard to assembly at baseline. For this reason to help support the claim that the hybrid genome 13 exist we can link the same amount of variation seen within the shared region between 11 and 13 to try to link variation within this region to reads that span chromosome 13 into this region. If variation specific to only to chromosome 13 reads are also contained within reads that then span from within the shared region into chromosome 11 sequence (either on the hybrid 13-11 or into 11 since from the shared region to the end of the chromosome this sequence is identical within the 3D7 hybrid genome so reads should be able to multi-map to both these regions). 

These regions below have variation within the strain SD01 within the 15.3kb shared region between 11 and 13. 

```{r}
finalHrpSubwindows_regions_withSD01MultiInShared = readr::read_tsv("../../../../Windows_Surrounding_HRP2_3_deletions/windowAnalysis/finalHrpSubwindows_regions_withSD01MultiInShared.bed", col_names = F)
create_dt(finalHrpSubwindows_regions_withSD01MultiInShared)
```

Determine the locations within hybrid genome in order to pull the long reads from the alignments to the hybrid (core 3D7 genome plus hybrid chromosomes chr11-chr13, chr13-chr11) genome

```{bash, eval = F}
cd ../../../../Windows_Surrounding_HRP2_3_deletions/windowAnalysis/
elucidator getFastaWithBed  --twoBit /tank/data/genomes/plasmodium/genomes/pf/genomes/Pf3D7.2bit --overWrite --bed finalHrpSubwindows_regions_withSD01MultiInShared.bed --out finalHrpSubwindows_regions_withSD01MultiInShared.fasta 

elucidator determineRegionLastz --identity 100 --coverage 100 --fasta finalHrpSubwindows_regions_withSD01MultiInShared.fasta --individual --genome /tank/data/genomes/plasmodium/genomes/pf_hybrid/genomes/Pf3D7_plus_11-13_13-11_hybrid.fasta | sort | uniq > raw_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed 
# have to do this cause LASTZ bugs out and doesn't do the matching of Pf3D7_11_v3 1924680 1924740 Pf3D7_11_v3-1924664-1924740-for__var-0 60 + for whatever goodness forsaken reason 

cut -f1-6 finalHrpSubwindows_regions_withSD01MultiInShared.bed raw_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed | sort | uniq > finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed

elucidator getOverlappingBedRegions --bed finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --intersectWithBed ../../sharedBetween11_and_13/investigatingChrom11Chrom13/shared_11_13_region_onPureHybrid.bed  | cut -f1-6 > finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed

```

Get the nanopore/pacbio reads that intersect this region  
```{bash, eval = F}
cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore 
elucidator bamToBed --bam hybridAlnsMinimap2/PfSD01PermethION.sorted.bam | elucidator getOverlappingBedRegions --bed STDIN --intersectWithBed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --overWrite --out PfSD01PermethION_intersecting_withSD01MultiInShared.bed 

cd /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies 
elucidator bamToBed --bam hybridAlnsMinimap2/PfSD01.sorted.bam | elucidator getOverlappingBedRegions --bed STDIN --intersectWithBed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --overWrite --out PfSD01Pacbio_intersecting_withSD01MultiInShared.bed

```

```{r}
homologousRegionWithHybrid = readr::read_tsv("../../../../sharedBetween11_and_13/investigatingChrom11Chrom13/combined_shared_11_13_region.bed", col_names = F)
multiRegion = readr::read_tsv("../../../../Windows_Surrounding_HRP2_3_deletions/windowAnalysis/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed", col_names = F)
```

## Plotting intersected long reads  

Shared region is shown in red, the regions within wich SD01 has variation are shown in blue  

```{r}
PfSD01PermethION = readr::read_tsv("PfSD01PermethION_intersecting_withSD01MultiInShared.bed", col_names = F) %>% 
  group_by(X1) %>% 
  mutate(rowid = row_number()) %>% 
  group_by()
PfSD01Pacbio = readr::read_tsv("PfSD01Pacbio_intersecting_withSD01MultiInShared.bed", col_names = F) %>% 
  group_by(X1) %>% 
  mutate(rowid = row_number()) %>% 
  group_by()

PfSD01PermethION_filt = PfSD01PermethION %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((X2 < sharedStart & X3 > sharedStart)
         |
           (X2 < sharedStop & X3 > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by()

PfSD01Pacbio_filt = PfSD01Pacbio %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((X2 < sharedStart & X3 > sharedStart)
         |
           (X2 < sharedStop & X3 > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by()

```

### Nanopore  

```{r}
#| fig-column: screen-inset-shaded


ggplot() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = rowid, 
                ymax = rowid + 1), 
            data = PfSD01PermethION_filt)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free")

```

### Pacbio  

```{r}
#| fig-column: screen-inset-shaded

ggplot() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = rowid, 
                ymax = rowid + 1), 
            data = PfSD01Pacbio_filt)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion)+ 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free")
```

# Extracting and typing reads  

## Extracting long reads  
```{bash, eval = F}

cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore
cd PfSD01_withSD01MultiInShared_extractions

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/hybridAlnsMinimap2/PfSD01PermethION.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --trimToRegion --overWriteDir --dout PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions --rename

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies/hybridAlnsMinimap2/PfSD01.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid.bed --trimToRegion --overWriteDir --dout PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions --rename

```

Add in illumina reads to help with error correcting  

```{bash, eval = F}
elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/data/plasmodium/falciparum/pfdata/bams/PfSD01.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared.bed --trimToRegion --overWriteDir --dout PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions --rename

mkdir PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions

for x in PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix illumina_; done; 
for x in PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix nano_; done; 
for x in PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix pacbio_; done; 

```

Concatenate reads and add the extraction info  
```{bash, eval = F}
for x in PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/prepended_Pf3D7*gz; do cat ${x} PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/$(basename ${x}) PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/$(basename ${x}) > PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/$(basename ${x} | sed 's/prepended_//g'); done;

cat PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/prepended_Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz  PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/prepended_Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz  > PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz

rsync -raPh PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/illumina_extraction_nameKey.tab.txt
rsync -raPh PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/nano_extraction_nameKey.tab.txt
rsync -raPh PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inHybrid_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions/pacbio_extraction_nameKey.tab.txt

```

Cluster the reads using clustering meant for error prone long reads `elucidatorlab kmerClusteringRate` to help type the reads.  

```{bash, eval = F}
cd PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_regions 
for x in Pf3D7_11_v3-19*gz; do elucidatorlab kmerClusteringRate --qualThres 7,5 --cutOff 0.1 --idCutOff 0.80 --numThreads 15 --lqMismatches 2 --oneBaseIndel 5 --twoBaseIndel 2 --map --recalcConsensus --checkChimeras --overWriteDir --writeInitialClusters --sizeCutOff 10  --fastq  ${x} --dout clustering_${x%%.fastq.gz} --postCollapseForceOneComp; done;
elucidator rBind --contains clusterNames.tab.txt --recursive --delim tab --header --overWrite --out allClusterNames.tab.txt

```

## Renaming  

Type the reads based on their clustering  

```{r}
allClusterNames = readr::read_tsv("allClusterNames.tab.txt")

allClusterNames_filt = allClusterNames %>%
  filter(!grepl("_t0\\.[0-9]*", readName)) %>%
  filter("Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.2_t12" != clusterName) %>% 
  mutate(region = gsub("\\.fastq.*", "", clusterName)) %>%
  mutate(clusterID = gsub(".*\\.fastq\\.", "", clusterName)) %>%
  mutate(clusterID = gsub("_t[0-9\\.]*", "", clusterID)) %>% 
  mutate(regionID = paste0(region, ".", clusterID)) 

illuminaExtraction = readr::read_tsv("illumina_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("illumina_", newName))
pacbioExtraction = readr::read_tsv("pacbio_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("pacbio_", newName))
nanoExtraction = readr::read_tsv("nano_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("nano_", newName))

bothExtraction = bind_rows(pacbioExtraction,
                           nanoExtraction)


bothExtraction_filt = bothExtraction %>% 
  filter(newName %in% allClusterNames_filt$readName) %>% 
  mutate(genomeID = paste0(`#chrom`, "-", start, "-", end)) %>% 
  left_join(allClusterNames_filt %>% 
              rename(newName = readName) %>% 
              select(newName, region, regionID)) %>% 
  select(-newName) %>% 
  spread(region, regionID) %>% 
  replace_na(
    list(
    `Pf3D7_11_v3-1919677-1919761-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1921054-1921153-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1924664-1924740-for__var-0` = "notTypeable", 
    `Pf3D7_11_v3-1927700-1928008-for__var-0` = "notTypeable")
  )

illuminaExtraction_filt = illuminaExtraction %>% 
  filter(newName %in% allClusterNames_filt$readName) %>% 
  mutate(genomeID = paste0(`#chrom`, "-", start, "-", end)) %>% 
  left_join(allClusterNames_filt %>% 
              rename(newName = readName) %>% 
              select(newName, region, regionID)) %>% 
  select(-newName) %>% 
  spread(region, regionID) %>% 
  replace_na(
    list(
    `Pf3D7_11_v3-1919677-1919761-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1921054-1921153-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1924664-1924740-for__var-0` = "notTypeable", 
    `Pf3D7_11_v3-1927700-1928008-for__var-0` = "notTypeable")
  )

bothExtraction_filt_counts = bothExtraction_filt %>% 
  group_by(`Pf3D7_11_v3-1919677-1919761-for__var-0`, 
          `Pf3D7_11_v3-1921054-1921153-for__var-0`, 
          `Pf3D7_11_v3-1924664-1924740-for__var-0`, 
          `Pf3D7_11_v3-1927700-1928008-for__var-0`) %>% 
  count()

create_dt(bothExtraction_filt_counts)
```


## Typed intersected reads  


Typing the reads by the variants within each of the 4 regions, the current region typing the read by is colored pink in the shared region below. If a read didn't cover the region or there was too much error to confidently type it then it's colored as "untypeable". 
```{r}
#| fig-column: screen-inset-shaded
#| fig-width: 30
#| fig-height: 15

outBound = 5000

bothExtraction_filt_mod = bothExtraction_filt %>% 
  group_by(`#chrom`) %>% 
  arrange(`#chrom`, start, end )%>% 
  mutate(rowid = row_number()) %>% 
  mutate(X1 = `#chrom`) %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((start < sharedStart & end > sharedStart)
         |
           (start < sharedStop & end > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by() %>% 
  mutate(startDiff = sharedStart - start, 
         endDiff = sharedStop - end) %>% 
  mutate(start = ifelse(start < sharedStart - outBound, sharedStart - outBound, start), 
         end =   ifelse(end > sharedStop + outBound, sharedStop + outBound, end))

illuminaExtraction_filt_mod = illuminaExtraction_filt %>% 
  group_by(`#chrom`) %>% 
  arrange(`#chrom`, start, end )%>% 
  mutate(rowid = row_number()) %>% 
  mutate(X1 = `#chrom`) %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) 

```

```{r}
#| fig-column: screen-inset-shaded
#| fig-width: 30
#| fig-height: 15


ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1919677-1919761-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegion %>% 
              filter(X4 == "Pf3D7_11_v3-1919677-1919761-for__var-0"))  + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1919677-1919761-for__var-0")

ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1921054-1921153-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegion %>% 
              filter(X4 == "Pf3D7_11_v3-1921054-1921153-for__var-0"))  + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1921054-1921153-for__var-0")


ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1924664-1924740-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion)+ 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegion %>% 
              filter(X4 == "Pf3D7_11_v3-1924664-1924740-for__var-0")) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1924664-1924740-for__var-0")

ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1927700-1928008-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegion)+ 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegion %>% 
              filter(X4 == "Pf3D7_11_v3-1927700-1928008-for__var-0")) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1927700-1928008-for__var-0")


```



# Processing on pure hybrid 

The same exact work flow as above but except hybrid genome was modified to include only the expected regular chromosome 11 and the hybrid 13-11 (which is the expected hybrid genome for SD01). This way it is easier to deal with less multi mappers when so much genome was duplicated to map to. 

```{bash, eval = F}
cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore
mkdir purehybridAlnsMinimap2
minimap2 -x map-ont -t 40 -a /tank/data/genomes/plasmodium/genomes/pf_hybrid/genomes/Pf3D7_pure_11-13_13-11_hybrid.fasta rawFastq/PfSD01PermethION.fastq.gz | samtools sort --threads 40 -o purehybridAlnsMinimap2/PfSD01PermethION.sorted.bam 
samtools index purehybridAlnsMinimap2/PfSD01PermethION.sorted.bam 

cd /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies 
mkdir purehybridAlnsMinimap2
minimap2 -x map-hifi -t 48 -a /tank/data/genomes/plasmodium/genomes/pf_hybrid/genomes/Pf3D7_pure_11-13_13-11_hybrid.fasta rawFastq/ERR1036246.fastq.gz   | samtools sort --threads 48 -o purehybridAlnsMinimap2/PfSD01.sorted.bam && samtools index purehybridAlnsMinimap2/PfSD01.sorted.bam 

```



```{bash, eval = F}
cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore 
elucidator bamToBed --bam purehybridAlnsMinimap2/PfSD01PermethION.sorted.bam | elucidator getOverlappingBedRegions --bed STDIN --intersectWithBed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed --overWrite --out PfSD01PermethION_intersecting_withSD01MultiInShared_pureHybrid.bed 

cd /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies 
elucidator bamToBed --bam purehybridAlnsMinimap2/PfSD01.sorted.bam | elucidator getOverlappingBedRegions --bed STDIN --intersectWithBed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed --overWrite --out PfSD01Pacbio_intersecting_withSD01MultiInShared_pureHybrid.bed

```

```{r}
homologousRegionWithPureHybrid = readr::read_tsv("../../../../sharedBetween11_and_13/investigatingChrom11Chrom13/shared_11_13_region_onPureHybrid.bed", col_names = F, skip = 1)
multiRegionPureHybrid = readr::read_tsv("../../../../Windows_Surrounding_HRP2_3_deletions/windowAnalysis/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed", col_names = F)
```

```{r}
PfSD01PermethION = readr::read_tsv("againstPureHybrid/PfSD01PermethION_intersecting_withSD01MultiInShared_pureHybrid.bed", col_names = F) %>% 
  group_by(X1) %>% 
  mutate(rowid = row_number()) %>% 
  group_by()
PfSD01Pacbio = readr::read_tsv("againstPureHybrid/PfSD01Pacbio_intersecting_withSD01MultiInShared_pureHybrid.bed", col_names = F) %>% 
  group_by(X1) %>% 
  mutate(rowid = row_number()) %>% 
  group_by()

PfSD01PermethION_filt = PfSD01PermethION %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((X2 < sharedStart & X3 > sharedStart)
         |
           (X2 < sharedStop & X3 > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by()

PfSD01Pacbio_filt = PfSD01Pacbio %>%
  left_join(
    homologousRegionWithHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((X2 < sharedStart & X3 > sharedStart)
         |
           (X2 < sharedStop & X3 > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by()

```

### Nanopore  

```{r}
#| fig-column: screen-inset-shaded

ggplot() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = rowid, 
                ymax = rowid + 1), 
            data = PfSD01PermethION_filt)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free")
```

### Pacbio  

```{r}
#| fig-column: screen-inset-shaded

ggplot() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = rowid, 
                ymax = rowid + 1), 
            data = PfSD01Pacbio_filt)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid)+ 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free")
```

## Extracting and typing reads  

```{bash, eval = F}

cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore
mkdir PfSD01_withSD01MultiInShared_extractions_pureHybrid
cd PfSD01_withSD01MultiInShared_extractions_pureHybrid

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/purehybridAlnsMinimap2/PfSD01PermethION.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed --trimToRegion --overWriteDir --dout PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions --rename

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/data/plasmodium/falciparum/pfpubdata/malariagen_pacbio_assemblies/purehybridAlnsMinimap2/PfSD01.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid.bed --trimToRegion --overWriteDir --dout PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions --rename

elucidator BamGetSpanningReadsForRegionLongReads --bam /tank/data/plasmodium/falciparum/pfdata/bams/PfSD01.sorted.bam --bed /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/finalHrpSubwindows_regions_withSD01MultiInShared.bed --trimToRegion --overWriteDir --dout PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions --rename

mkdir PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions

for x in PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix illumina_; done; 
for x in PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix nano_; done; 
for x in PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/Pf*gz; do elucidator prependNames --fastqgz ${x} --overWrite --prefix pacbio_; done; 

for x in PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/prepended_Pf3D7*gz; do cat ${x} PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/$(basename ${x}) PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/$(basename ${x}) > PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/$(basename ${x} | sed 's/prepended_//g'); done;

cat PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/prepended_Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz  PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/prepended_Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz  > PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.gz

rsync -raPh PfSD01_illumina_finalHrpSubwindows_regions_withSD01MultiInShared_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/illumina_extraction_nameKey.tab.txt
rsync -raPh PfSD01_nano_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/nano_extraction_nameKey.tab.txt
rsync -raPh PfSD01_pacbio_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/nameKey.tab.txt PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions/pacbio_extraction_nameKey.tab.txt

cd PfSD01_combined_finalHrpSubwindows_regions_withSD01MultiInShared_inPureHybrid_regions 
for x in Pf3D7_11_v3-19*gz; do elucidatorlab kmerClusteringRate --qualThres 7,5 --cutOff 0.1 --idCutOff 0.80 --numThreads 15 --lqMismatches 2 --oneBaseIndel 5 --twoBaseIndel 2 --map --recalcConsensus --checkChimeras --overWriteDir --writeInitialClusters --sizeCutOff 10  --fastq  ${x} --dout clustering_${x%%.fastq.gz} --postCollapseForceOneComp; done;
elucidator rBind --contains clusterNames.tab.txt --recursive --delim tab --header --overWrite --out allClusterNames.tab.txt

```

## Typing    
```{r}
allClusterNames = readr::read_tsv("againstPureHybrid/allClusterNames.tab.txt")

allClusterNames_filt = allClusterNames %>%
  filter(!grepl("_t0\\.[0-9]*", readName)) %>%
  filter("Pf3D7_11_v3-1927700-1928008-for__var-0.fastq.2_t12" != clusterName) %>% 
  mutate(region = gsub("\\.fastq.*", "", clusterName)) %>%
  mutate(clusterID = gsub(".*\\.fastq\\.", "", clusterName)) %>%
  mutate(clusterID = gsub("_t[0-9\\.]*", "", clusterID)) %>% 
  mutate(regionID = paste0(region, ".", clusterID)) 

illuminaExtraction = readr::read_tsv("againstPureHybrid/illumina_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("illumina_", newName))
pacbioExtraction = readr::read_tsv("againstPureHybrid/pacbio_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("pacbio_", newName))
nanoExtraction = readr::read_tsv("againstPureHybrid/nano_extraction_nameKey.tab.txt") %>%
  mutate(newName = paste0("nano_", newName))

bothExtraction = bind_rows(pacbioExtraction,
                           nanoExtraction)


bothExtraction_filt = bothExtraction %>% 
  filter(newName %in% allClusterNames_filt$readName) %>% 
  mutate(genomeID = paste0(`#chrom`, "-", start, "-", end)) %>% 
  left_join(allClusterNames_filt %>% 
              rename(newName = readName) %>% 
              select(newName, region, regionID)) %>% 
  select(-newName) %>% 
  spread(region, regionID) %>% 
  replace_na(
    list(
    `Pf3D7_11_v3-1919677-1919761-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1921054-1921153-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1924664-1924740-for__var-0` = "notTypeable", 
    `Pf3D7_11_v3-1927700-1928008-for__var-0` = "notTypeable")
  )


illuminaExtraction_filt = illuminaExtraction %>% 
  filter(newName %in% allClusterNames_filt$readName) %>% 
  mutate(genomeID = paste0(`#chrom`, "-", start, "-", end)) %>% 
  left_join(allClusterNames_filt %>% 
              rename(newName = readName) %>% 
              select(newName, region, regionID)) %>% 
  select(-newName) %>% 
  spread(region, regionID) %>% 
  replace_na(
    list(
    `Pf3D7_11_v3-1919677-1919761-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1921054-1921153-for__var-0` = "notTypeable",
    `Pf3D7_11_v3-1924664-1924740-for__var-0` = "notTypeable", 
    `Pf3D7_11_v3-1927700-1928008-for__var-0` = "notTypeable")
  )

bothExtraction_filt_counts = bothExtraction_filt %>% 
  group_by(`Pf3D7_11_v3-1919677-1919761-for__var-0`, 
          `Pf3D7_11_v3-1921054-1921153-for__var-0`, 
          `Pf3D7_11_v3-1924664-1924740-for__var-0`, 
          `Pf3D7_11_v3-1927700-1928008-for__var-0`) %>% 
  count()

create_dt(bothExtraction_filt_counts)
```



```{r}
#| fig-column: screen-inset-shaded
#| fig-width: 30
#| fig-height: 15

outBound = 5000

bothExtraction_filt_mod = bothExtraction_filt %>% 
  group_by(`#chrom`) %>% 
  arrange(`#chrom`, start, end )%>% 
  mutate(rowid = row_number()) %>% 
  mutate(X1 = `#chrom`) %>%
  left_join(
    homologousRegionWithPureHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  ) %>%
  filter((start < sharedStart & end > sharedStart)
         |
           (start < sharedStop & end > sharedStop)) %>%
  group_by(X1) %>%
  mutate(rowid = row_number()) %>%
  group_by() %>% 
  mutate(startDiff = sharedStart - start, 
         endDiff = sharedStop - end) %>% 
  mutate(start = ifelse(start < sharedStart - outBound, sharedStart - outBound, start), 
         end =   ifelse(end > sharedStop + outBound, sharedStop + outBound, end))


illuminaExtraction_filt_mod = illuminaExtraction_filt %>% 
  group_by(`#chrom`) %>% 
  arrange(`#chrom`, start, end )%>% 
  mutate(rowid = row_number()) %>% 
  mutate(X1 = `#chrom`) %>%
  left_join(
    homologousRegionWithPureHybrid %>%
      rename(sharedStart = X2,
             sharedStop = X3) %>%
      select(X1, starts_with("shared"))
  )

ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1919677-1919761-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegionPureHybrid %>% 
              filter(X4 == "Pf3D7_11_v3-1919677-1919761-for__var-0"))  + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1919677-1919761-for__var-0")

ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1921054-1921153-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegionPureHybrid %>% 
              filter(X4 == "Pf3D7_11_v3-1921054-1921153-for__var-0"))  + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1921054-1921153-for__var-0")


ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1924664-1924740-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid)+ 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegionPureHybrid %>% 
              filter(X4 == "Pf3D7_11_v3-1924664-1924740-for__var-0")) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1924664-1924740-for__var-0")

ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = `Pf3D7_11_v3-1927700-1928008-for__var-0`), 
            data = bothExtraction_filt_mod)  + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F748A5",
            data = multiRegionPureHybrid)+ 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0),
            fill = "#F0E442",
            data = multiRegionPureHybrid %>% 
              filter(X4 == "Pf3D7_11_v3-1927700-1928008-for__var-0")) + 
  sofonias_theme + 
  facet_wrap(~X1, scales = "free") + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    guide = guide_legend(
      ncol = 1
    )
  ) + 
  labs(title = "Pf3D7_11_v3-1927700-1928008-for__var-0")

```


## Linkage between variants  

Looking at the linkage between variants by linking each variant within the long reads to each subsequent variant and then linkng the chromosome to the first variant. It appears that all the variants 0.1 are correlated to chromosome 11 and all variants 0.0 are correlated to chromosome 13

```{r}
# Load package
library(networkD3)

bothExtraction_filt_counts_gat = bothExtraction_filt_counts %>% 
  gather(region, haplotype, 1:4)

bothExtraction_filt_counts_gat_hapCounts = bothExtraction_filt_counts_gat %>% 
  filter(haplotype != "notTypeable") %>% 
  group_by(haplotype) %>% 
  summarise(n = sum(n)) %>% 
  mutate(haplotypeCoded = row_number())

bothExtraction_filt_counts_hapLinkCounts = tibble()
# for(col1 in 1:3) {
#   for (col2 in (col1 + 1):4) {
#    bothExtraction_filt_counts_gat_subSelection = bothExtraction_filt_counts %>% 
#      group_by() %>% 
#      select(col1, col2, n)
#    colnames(bothExtraction_filt_counts_gat_subSelection)[1:2] = c("hap1", "hap2")
#    bothExtraction_filt_counts_gat_subSelection_filt = bothExtraction_filt_counts_gat_subSelection %>% 
#      filter(hap1 != "notTypeable", 
#             hap2 != "notTypeable") %>% 
#      group_by(hap1, hap2) %>% 
#      summarise(n = sum(n))
#    bothExtraction_filt_counts_hapLinkCounts = bind_rows(
#      bothExtraction_filt_counts_hapLinkCounts, 
#      bothExtraction_filt_counts_gat_subSelection_filt
#    )
#   }
# }

for(col1 in 1:3) {
  col2 = col1 + 1
   bothExtraction_filt_counts_gat_subSelection = bothExtraction_filt_counts %>% 
     group_by() %>% 
     select(col1, col2, n)
   colnames(bothExtraction_filt_counts_gat_subSelection)[1:2] = c("hap1", "hap2")
   bothExtraction_filt_counts_gat_subSelection_filt = bothExtraction_filt_counts_gat_subSelection %>% 
     filter(hap1 != "notTypeable", 
            hap2 != "notTypeable") %>% 
     group_by(hap1, hap2) %>% 
     summarise(n = sum(n))
   bothExtraction_filt_counts_hapLinkCounts = bind_rows(
     bothExtraction_filt_counts_hapLinkCounts, 
     bothExtraction_filt_counts_gat_subSelection_filt
   )
}

bothExtraction_filt_chromosomeLinkageInfo =  bothExtraction_filt %>% 
   select(`#chrom`, `Pf3D7_11_v3-1919677-1919761-for__var-0`) %>% 
   filter(`Pf3D7_11_v3-1919677-1919761-for__var-0` != "notTypeable") %>% 
  group_by(`#chrom`, `Pf3D7_11_v3-1919677-1919761-for__var-0`) %>% 
  count() 

bothExtraction_filt_chromosomeLinkageInfo_addToHapCounts = bothExtraction_filt_chromosomeLinkageInfo %>% 
  rename(haplotype = `#chrom` ) %>% 
  group_by(haplotype) %>% 
  summarise(n = sum(n))

bothExtraction_filt_counts_gat_hapCounts = bothExtraction_filt_counts_gat_hapCounts %>% 
  bind_rows(bothExtraction_filt_chromosomeLinkageInfo_addToHapCounts)


#zero based position source/target nodes 
bothExtraction_filt_counts_hapLinkCounts = bothExtraction_filt_counts_hapLinkCounts %>% 
  mutate(source = match(hap1, bothExtraction_filt_counts_gat_hapCounts$haplotype) -1) %>% 
  mutate(target = match(hap2, bothExtraction_filt_counts_gat_hapCounts$haplotype) -1)

bothExtraction_filt_chromosomeLinkageInfo = bothExtraction_filt_chromosomeLinkageInfo  %>% 
  rename(hap1 = `#chrom`, 
         hap2 = `Pf3D7_11_v3-1919677-1919761-for__var-0`) %>% 
  mutate(source = match(hap1, bothExtraction_filt_counts_gat_hapCounts$haplotype) -1) %>% 
  mutate(target = match(hap2, bothExtraction_filt_counts_gat_hapCounts$haplotype) -1)
bothExtraction_filt_counts_hapLinkCounts = bothExtraction_filt_counts_hapLinkCounts %>% 
  bind_rows(bothExtraction_filt_chromosomeLinkageInfo)
# adding in chromosome linkage to front 


bothExtraction_filt_counts_hapLinkCounts = as.data.frame(bothExtraction_filt_counts_hapLinkCounts)
bothExtraction_filt_counts_gat_hapCounts = as.data.frame(bothExtraction_filt_counts_gat_hapCounts)

# Thus we can plot it
p <- sankeyNetwork(
  Links = bothExtraction_filt_counts_hapLinkCounts,
  Nodes = bothExtraction_filt_counts_gat_hapCounts,
  Source = "source",
  Target = "target",
  Value = "n",
  NodeID = "haplotype",
  units = "reads",
  fontSize = 12,
  nodeWidth = 30
)

```


There is some cross linkage between the variants but this is low level and could either represent typing error or even possibly low level of continued recombination between these regions.  

```{r}
#| fig-column: screen-inset-shaded
#| column: screen-inset-shaded
p
```

```{r}
bothExtraction_filt_mod_sel = bothExtraction_filt_mod %>% 
  select(starts_with("Pf3D7"), `#chrom`, rowid) %>% 
  gather(region, hap, 1:4) %>% 
  mutate(hapVar = gsub(".*\\.", "", hap)) %>% 
  filter(hapVar != "notTypeable") %>% 
  group_by(`#chrom`, rowid,hapVar) %>% 
  count() %>% 
  group_by(`#chrom`, rowid) %>% 
  filter(n == max(n)) 

bothExtraction_filt_mod_sel_counts = bothExtraction_filt_mod_sel%>% 
  group_by(`#chrom`, rowid) %>% 
  mutate(varCounts = n()) %>% 
  mutate(typed = case_when(varCounts > 1 ~ "indeterminate", 
                           hapVar == 1 ~ "chr11", 
                           hapVar == 0 ~ "chr13")) 

bothExtraction_filt_mod_sel_counts_type = bothExtraction_filt_mod_sel_counts %>% 
  select(`#chrom`, rowid, typed) %>% 
  group_by() %>% 
  unique()
  
bothExtraction_filt_mod_typed = bothExtraction_filt_mod %>% 
  left_join(bothExtraction_filt_mod_sel_counts_type)


illuminaExtraction_filt_mod_sel = illuminaExtraction_filt_mod %>% 
  select(starts_with("Pf3D7"), `#chrom`, rowid) %>% 
  gather(region, hap, 1:3) %>% 
  mutate(hapVar = gsub(".*\\.", "", hap)) %>% 
  filter(hapVar != "notTypeable") %>% 
  group_by(`#chrom`, rowid,hapVar) %>% 
  count() %>% 
  group_by(`#chrom`, rowid) %>% 
  filter(n == max(n)) 

illuminaExtraction_filt_mod_sel_counts = illuminaExtraction_filt_mod_sel%>% 
  group_by(`#chrom`, rowid) %>% 
  mutate(varCounts = n()) %>% 
  mutate(typed = case_when(varCounts > 1 ~ "indeterminate", 
                           hapVar == 1 ~ "chr11", 
                           hapVar == 0 ~ "chr13")) 

illuminaExtraction_filt_mod_sel_counts_type = illuminaExtraction_filt_mod_sel_counts %>% 
  select(`#chrom`, rowid, typed) %>% 
  group_by() %>% 
  unique()
  
illuminaExtraction_filt_mod_typed = illuminaExtraction_filt_mod %>% 
  left_join(illuminaExtraction_filt_mod_sel_counts_type)


```

## Plotting by strongest chromosome association  

Colored by the chromosome associated variation (chr11 vs chr13). If a read has multiple loci and there is a tie between the different chromosomes then it is colored indeterminate. 

```{r}
#| fig-column: screen-inset-shaded
chrVarPlot = ggplot() + 
  geom_rect(aes(xmin = start, xmax = end, 
                ymin = rowid, 
                ymax = rowid + 1, 
                fill = typed),
            color = "black",
            data = bothExtraction_filt_mod_typed %>% 
              mutate(typed = factor(typed, levels = c("indeterminate", "chr13", "chr11"))))  + 
  scale_fill_manual(values = c("#3DB7E9", "#D55E00", "#9F0162"),
    name= "Associated Chromosome\nVariation",
    guide = guide_legend(
      ncol = 1
    ) ) + 
  ggnewscale::new_scale_fill() + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0, 
                fill = "Duplicated Region"),
            color = "black",
            #fill = "#2271B2",
            data = homologousRegionWithPureHybrid) + 
  geom_rect(aes(xmin = X2, xmax = X3, 
                ymin = -10, 
                ymax = 0, 
                fill = "SD01 Variant Sites"),
            #color = "black",
            #fill = "#F748A5",
            data = multiRegionPureHybrid) + 
  scale_fill_manual("Genomic", values = c("Duplicated Region" = "#2271B2", "SD01 Variant Sites" = "#F748A5")) + 
  guides(fill = guide_legend(ncol = 1)) + 
  sofonias_theme + 
  theme(axis.text.x = element_text(size=12, angle = -45, vjust = 0.5, hjust = 0)) + 
  facet_wrap(~X1, scales = "free") 
  
print(chrVarPlot)
```

```{r}
pdf("SD01_variation_associated_reads_plot.pdf", width = 8, height = 6, useDingbats = F)
print(chrVarPlot)
dev.off()
```

```{r}
# pacbio and nanopore  
bothExtraction_filt_mod_typedchr11 = bothExtraction_filt_mod_typed %>% 
  filter(typed == "chr11")
write_tsv(bothExtraction_filt_mod_typedchr11, "nanoPacbio_typed11.tab.txt")

bothExtraction_filt_mod_typedchr13 = bothExtraction_filt_mod_typed %>% 
  filter(typed == "chr13")
write_tsv(bothExtraction_filt_mod_typedchr13, "nanoPacbio_typed13.tab.txt")

bothExtraction_filt_mod_typedchr13 = bothExtraction_filt_mod_typed %>% 
  filter(typed == "chr13")
write_tsv(bothExtraction_filt_mod_typedchr13, "nanoPacbio_typed13.tab.txt")



# illumina 
illuminaExtraction_filt_mod_typedchr11 = illuminaExtraction_filt_mod_typed %>% 
  filter(typed == "chr11")
write_tsv(illuminaExtraction_filt_mod_typedchr11, "illumina_typed11.tab.txt")

illuminaExtraction_filt_mod_typedchr13 = illuminaExtraction_filt_mod_typed %>% 
  filter(typed == "chr13")
write_tsv(illuminaExtraction_filt_mod_typedchr13, "illumina_typed13.tab.txt")

illuminaExtraction_filt_mod_typedchr13 = illuminaExtraction_filt_mod_typed %>% 
  filter(typed == "chr13")
write_tsv(illuminaExtraction_filt_mod_typedchr13, "illumina_typed13.tab.txt")


```

```{bash, eval = F}
cd /tank/projects/plasmodium/falciparum/hrp/hrp3_deletion/nanopore/extractingFromRawFastqs

elucidator printCol --file nanoPacbio_typed11.tab.txt --delim tab --header --columnName oldName --unique --sort --out chr11_names.txt --overWrite
elucidator printCol --file nanoPacbio_typed13.tab.txt --delim tab --header --columnName oldName --unique --sort --out chr13_names.txt --overWrite

nohup elucidator extractByName --fastqgz ../pacbioReads/combined.fastq.gz --names chr11_names.txt --out pacbio_chr11_combined.fastq.gz --overWrite --trimAtWhiteSpace  &
nohup elucidator extractByName --fastqgz ../pacbioReads/combined.fastq.gz --names chr13_names.txt --out pacbio_chr13_combined.fastq.gz --overWrite --trimAtWhiteSpace  &
nohup elucidator extractByName --fastqgz ../pacbioReads/combined.fastq.gz --names chr11_names.txt --out pacbio_allButChr11_combined.fastq.gz --overWrite --trimAtWhiteSpace --excluding &
nohup elucidator extractByName --fastqgz ../pacbioReads/combined.fastq.gz --names chr13_names.txt --out pacbio_allButChr13_combined.fastq.gz --overWrite --trimAtWhiteSpace --excluding &


nohup elucidator extractByName --fastqgz ../rawFastq/PfSD01PermethION.fastq.gz --names chr11_names.txt --out nano_chr11_PfSD01PermethION.fastq.gz --overWrite --trimAtWhiteSpace  &
nohup elucidator extractByName --fastqgz ../rawFastq/PfSD01PermethION.fastq.gz --names chr13_names.txt --out nano_chr13_PfSD01PermethION.fastq.gz --overWrite --trimAtWhiteSpace  &
nohup elucidator extractByName --fastqgz ../rawFastq/PfSD01PermethION.fastq.gz --names chr11_names.txt --out nano_allButChr11_PfSD01PermethION.fastq.gz --overWrite --trimAtWhiteSpace --excluding &
nohup elucidator extractByName --fastqgz ../rawFastq/PfSD01PermethION.fastq.gz --names chr13_names.txt --out nano_allButChr13_PfSD01PermethION.fastq.gz --overWrite --trimAtWhiteSpace --excluding &

cat nano_allButChr11_PfSD01PermethION.fastq.gz pacbio_chr13_combined.fastq.gz > ../combinedRawFastqs/PfSD01PermethION_withPacbio_forchr13.fastq.gz
cat nano_allButChr13_PfSD01PermethION.fastq.gz pacbio_chr11_combined.fastq.gz > ../combinedRawFastqs/PfSD01PermethION_withPacbio_forchr11.fastq.gz


```