SeekDeep genTargetInfoFromGenomes is used to automatically generate several of the files used by the SeekDeep pipeline mostly in the extraction step. It is also useful for checking primers that they match against several reference genomes
SeekDeep comes equipped with a helper function (SeekDeep genTargetInfoFromGenomes) to provide some of the files used by the program. It is not required that you use this helper function as all of these files can be created by hand but this helper function can help ease the creation of these files. SeekDeep genTargetInfoFromGenomes needs bowtie2 and samtools to be installed and requires a directory of genomes (at least 1 genome fasta file). SeekDeep genTargetInfoFromGenomes takes a primer ID file (column 1 target, col 2 forward_primer (5`-3`), col 3 reverse_primer (5`-3`)) that is required by SeekDeep extractor and extracts the sequences expected with the primers from the genomes.
Plasmodium Example
Here is an example running on the Plasmodium falciparum 3D7 reference genome for several P. falciparum
--genomeDir - A directory of genomes that must end in .fasta (the .fasta extension is used to determine which files in this directory to treat as genomes) which will be automatically bowtie2 indexed if they aren’t already
--primers - The ID file containing the primers used
--pairedEndLength - The length of the input sequencing length for paired end analysis (e.g. 250 for 2x250)
Optional options
--gffDir - a directory of gff files to add gene annotation information
--numThreads - The number of threads/CPUs to use in parallel
--dout - A name of an output directory
--lenCutOffSizeExpand - By default the max and min sizes are set to the +/-20 off of the max and min sizes of the extracted seqs, since msp1 here has large length variation, this has been increased to +/- 50 though this could also be solved by using more than just the Pf3D7 genome and using more refs.
--sizeLimit - Since by default this program is used for short amplicon analysis there is an default size limit for the extraction length (1000), meaning possible extractions longer than 1000 bp will not be extracted, which can be changed with this flag.
--longRangeAmplicon - This allows to skip putting a --pairedEndLength and increased the size limit to 10,000 from the default 1,000.
Output
All of the output will be in one directory. Within this directory there will be a directory for each target given in the ID file that contains several useful files including:
extractionCounts.tab.txt - A file giving how many times each primer hit and the number of extractions per genome
forwardPrimer.fasta - The forward primer used, if it contained degenerative bases this file will be every possible primer
reversePrimer.fasta - The reverse primer used, if it contained degenerative bases this file will be every possible primer
genomeLocations - The genomic locations of the extractions
[TARGET_NAME].fasta - A fasta file of the extractions
**[TARGET_NAME]_primersRemoved.fasta** - A fasta file of the extractions with the primers removed
There will also be a directory named forSeekDeep which will contain three files useful for SeekDeep runs:
lenCutOffs.txt - A file giving estimated length cut off by using the max and min lengths of the extracted reference sequences
overlapStatuses.txt - A file indicating the overlap possible given the paired end sequencing length and the length of the reference sequences, required by SeekDeep extractorPairedEnd, see SeekDeep extractor and Illumina Paired Info Page
refSeqs - A directory of the extracted reference sequences with a fasta file for each target, this directory can be given to SeekDeep extractor to help filter off artifact/contamination