SeekDeep genTargetInfoFromGenomes

SeekDeep genTargetInfoFromGenomes is used to automatically generate several of the files used by the SeekDeep pipeline mostly in the extraction step. It is also useful for checking primers that they match against several reference genomes

SeekDeep comes equipped with a helper function (SeekDeep genTargetInfoFromGenomes) to provide some of the files used by the program. It is not required that you use this helper function as all of these files can be created by hand but this helper function can help ease the creation of these files. SeekDeep genTargetInfoFromGenomes needs bowtie2 and samtools to be installed and requires a directory of genomes (at least 1 genome fasta file). SeekDeep genTargetInfoFromGenomes takes a primer ID file (column 1 target, col 2 forward_primer (5`-3`), col 3 reverse_primer (5`-3`)) that is required by SeekDeep extractor and extracts the sequences expected with the primers from the genomes.

Plasmodium Example

Here is an example running on the Plasmodium falciparum 3D7 reference genome for several P. falciparum

Code

wget http://seekdeep.brown.edu/data/plasmodiumData/pf.tar.gz 
tar -zxvf pf.tar.gz
wget http://seekdeep.brown.edu/data/SeekDeepTutorialData/ver2_6_0/CamThaiGhanaDRC_2011_2013_drugRes/ids.tab.txt
SeekDeep genTargetInfoFromGenomes --gffDir pf/info/gff --genomeDir pf/genomes/ --primers ids.tab.txt --numThreads 7 --pairedEndLength 250 --dout extractedRefSeqs

Required options

--genomeDir - A directory of genomes that must end in .fasta (the .fasta extension is used to determine which files in this directory to treat as genomes) which will be automatically bowtie2 indexed if they aren’t already
--primers - The ID file containing the primers used
--pairedEndLength - The length of the input sequencing length for paired end analysis (e.g. 250 for 2x250)

Optional options

--gffDir - a directory of gff files to add gene annotation information
--numThreads - The number of threads/CPUs to use in parallel
--dout - A name of an output directory
--lenCutOffSizeExpand - By default the max and min sizes are set to the +/-20 off of the max and min sizes of the extracted seqs, since msp1 here has large length variation, this has been increased to +/- 50 though this could also be solved by using more than just the Pf3D7 genome and using more refs.
--sizeLimit - Since by default this program is used for short amplicon analysis there is an default size limit for the extraction length (1000), meaning possible extractions longer than 1000 bp will not be extracted, which can be changed with this flag.
--longRangeAmplicon - This allows to skip putting a --pairedEndLength and increased the size limit to 10,000 from the default 1,000.
--useBlast - This allows the usage of blast to find primers, tends to be faster search especially for when allowing errors in primers

Output

All of the output will be in one directory. Within this directory there will be a directory for each target given in the ID file that contains several useful files including:

extractionCounts.tab.txt - A file giving how many times each primer hit and the number of extractions per genome
forwardPrimer.fasta - The forward primer used, if it contained degenerative bases this file will be every possible primer
reversePrimer.fasta - The reverse primer used, if it contained degenerative bases this file will be every possible primer
genomeLocations - The genomic locations of the extractions
[TARGET_NAME].fasta - A fasta file of the extractions
**[TARGET_NAME]_primersRemoved.fasta** - A fasta file of the extractions with the primers removed

There will also be a directory named forSeekDeep which will contain three files useful for SeekDeep runs:

lenCutOffs.txt - A file giving estimated length cut off by using the max and min lengths of the extracted reference sequences
overlapStatuses.txt - A file indicating the overlap possible given the paired end sequencing length and the length of the reference sequences, required by SeekDeep extractorPairedEnd, see SeekDeep extractor and Illumina Paired Info Page
refSeqs - A directory of the extracted reference sequences with a fasta file for each target, this directory can be given to SeekDeep extractor to help filter off artifact/contamination

Microbiome Example

Code

wget http://seekdeep.brown.edu/data/microbiomeData/ncbi_genomes_renamed.tar.gz 
tar -zxvf ncbi_genomes_renamed.tar.gz
wget http://seekdeep.brown.edu/data/microbiomeData/16s_variableRegions_targets.id.txt 
SeekDeep genTargetInfoFromGenomes --genomeDir ncbi_genomes_renamed --primers 16s_variableRegions_targets.id.txt --numThreads 7 --pairedEndLength 300 --dout extractedRefSeqs

--- title: "SeekDeep genTargetInfoFromGenomes" --- <script> $(document).ready(function() { document.querySelectorAll('.downloadLink').forEach(function(e) { e.setAttribute('download', e.text); }); document.querySelectorAll('.downloadLink').forEach(function(e) { e.innerHTML = '<i class="fa fa-download"></i> ' + e.text; }); }); </script> ```{r setup, echo=FALSE, message=FALSE} source("../common.R") ``` **SeekDeep genTargetInfoFromGenomes** is used to automatically generate several of the files used by the SeekDeep pipeline mostly in the extraction step. It is also useful for checking primers that they match against several reference genomes SeekDeep comes equipped with a helper function (`SeekDeep genTargetInfoFromGenomes`) to provide some of the files used by the program. It is not required that you use this helper function as all of these files can be created by hand but this helper function can help ease the creation of these files. `SeekDeep genTargetInfoFromGenomes` needs **bowtie2** and **samtools** to be installed and requires a directory of genomes (at least 1 genome fasta file). `SeekDeep genTargetInfoFromGenomes` takes a primer ID file (column 1 target, col 2 forward_primer (5\`-3\`), col 3 reverse_primer (5\`-3\`)) that is required by `SeekDeep extractor` and extracts the sequences expected with the primers from the genomes. # Plasmodium Example Here is an example running on the Plasmodium falciparum 3D7 reference genome for several P. falciparum ```{bash, eval = F} wget http://seekdeep.brown.edu/data/plasmodiumData/pf.tar.gz tar -zxvf pf.tar.gz wget http://seekdeep.brown.edu/data/SeekDeepTutorialData/ver2_6_0/CamThaiGhanaDRC_2011_2013_drugRes/ids.tab.txt SeekDeep genTargetInfoFromGenomes --gffDir pf/info/gff --genomeDir pf/genomes/ --primers ids.tab.txt --numThreads 7 --pairedEndLength 250 --dout extractedRefSeqs ``` ## Required options * **\-\-genomeDir** - A directory of genomes that must end in .fasta (the .fasta extension is used to determine which files in this directory to treat as genomes) which will be automatically bowtie2 indexed if they aren't already * **\-\-primers** - The ID file containing the primers used * **\-\-pairedEndLength** - The length of the input sequencing length for paired end analysis (e.g. 250 for 2x250) ## Optional options * **\-\-gffDir** - a directory of gff files to add gene annotation information * **\-\-numThreads** - The number of threads/CPUs to use in parallel * **\-\-dout** - A name of an output directory * **\-\-lenCutOffSizeExpand** - By default the max and min sizes are set to the +/-20 off of the max and min sizes of the extracted seqs, since msp1 here has large length variation, this has been increased to +/- 50 though this could also be solved by using more than just the Pf3D7 genome and using more refs. * **\-\-sizeLimit** - Since by default this program is used for short amplicon analysis there is an default size limit for the extraction length (1000), meaning possible extractions longer than 1000 bp will not be extracted, which can be changed with this flag. * **\-\-longRangeAmplicon** - This allows to skip putting a **\-\-pairedEndLength** and increased the size limit to 10,000 from the default 1,000. * **\-\-useBlast** - This allows the usage of blast to find primers, tends to be faster search especially for when allowing errors in primers ## Output All of the output will be in one directory. Within this directory there will be a directory for each target given in the ID file that contains several useful files including: 1. **extractionCounts.tab.txt** - A file giving how many times each primer hit and the number of extractions per genome 2. **forwardPrimer.fasta** - The forward primer used, if it contained degenerative bases this file will be every possible primer 3. **reversePrimer.fasta** - The reverse primer used, if it contained degenerative bases this file will be every possible primer 4. **genomeLocations** - The genomic locations of the extractions 5. **[TARGET_NAME].fasta** - A fasta file of the extractions 5. **[TARGET_NAME]_primersRemoved.fasta** - A fasta file of the extractions with the primers removed There will also be a directory named **forSeekDeep** which will contain three files useful for SeekDeep runs: 1. **lenCutOffs.txt** - A file giving estimated length cut off by using the max and min lengths of the extracted reference sequences 1. **overlapStatuses.txt** - A file indicating the overlap possible given the paired end sequencing length and the length of the reference sequences, required by `SeekDeep extractorPairedEnd`, see [SeekDeep extractor](extractor_usage.html) and [Illumina Paired Info Page](illumina_paired_info.html) 1. **refSeqs** - A directory of the extracted reference sequences with a fasta file for each target, this directory can be given to `SeekDeep extractor` to help filter off artifact/contamination # Microbiome Example ```{bash, eval = F} wget http://seekdeep.brown.edu/data/microbiomeData/ncbi_genomes_renamed.tar.gz tar -zxvf ncbi_genomes_renamed.tar.gz wget http://seekdeep.brown.edu/data/microbiomeData/16s_variableRegions_targets.id.txt SeekDeep genTargetInfoFromGenomes --genomeDir ncbi_genomes_renamed --primers 16s_variableRegions_targets.id.txt --numThreads 7 --pairedEndLength 300 --dout extractedRefSeqs ```