extractor’s objectives

The main purpose of extractor is to take raw data and demultiplex by sample barcodes if present and by primer pairs. It at most requires two arguments: an input file which can be fastq (--fastq) or fasta (--fasta) and an ID file (--id) that supplies primer pairs and barcodes if present. It will also do some quality control as well for length and quality scores.

Paired End Extraction

There is a separate extractor command for extracting paired end data and is simply called SeekDeep extractorPairedEnd and shares most of the command line options with regular single end SeekDeep extractor

Getting usage command line

Calling -help will give a help message on running the program

Code

SeekDeep extractor --help 
SeekDeep extractorPairedEnd --help

Also all flags in SeekDeep are case insensitive and so all the following would have the same results

Code

SeekDeep extractor --help 
SeekDeep extractor --HELP
SeekDeep extractor --HeLP
SeekDeep extractor --HeLp
SeekDeep extrActor --HeLP
SeekDeep ExtraCtor --HeLp
SeekDeep EXTRACTOR --HeLp

Format of id file

The id file is a tab delimited file that contains both primer pair info and optional barcode information
Primer information starts with a header line that must start with either gene or target and have two additional columns for forward and reverse primers, the line can start with either gene or target.

target  forward reverse
PFAMA1  CAGGGAAATGTCCAGTATT CTTGAACATAAAGTCAATTC
PFCSP   ACAATCAAGGTAATGGACAAGG  TTTTCAATATCATTTYCATAATCTAATT

Primers

The primers can contain ambigious bases (Y,R,etc.) and should be all upper case

Forward Primer

Forward primer should be in the 5`-3` direction.

Reverse Primer

The reverse primer should in in the 5`-3` direction as well.

MIDs

If data is multiplexed the MID name and corresponding barcode information comes after the primer information. Barcode information needs to start with a header id barcode which the id is the important name for parsing. Mids should be in the direction they would be found in if reads are found in the direction of the forward primer. It is always assumed that the mid and primer set up is MID-PRIMER. If this is not the case for your data contact Nick Hathaway (nicholas.hathaway@umassmed.edu) to see if he can add options for accommodating your data structure.

target  forward reverse
PFAMA1  CAGGGAAATGTCCAGTATT CTTGAACATAAAGTCAATTC
PFCSP   ACAATCAAGGTAATGGACAAGG  TTTTCAATATCATTTYCATAATCTAATT
id  barcode
MID01   ACGAGTGCGT
MID02   ACGCTCGACA
MID03   AGACGCACTC
MID04   AGCACTGTAG
MID05   ATCAGACACG
MID06   ATATCGCGAG
MID07   CGTGTCTCTA
MID08   CTCGCGTGTC
MID10   TCTCTATGCG
MID11   TGATACGTCT
MID13   CATAGTAGTG
MID14   CGAGAGATAC
MID15   ATACGACGTA
MID16   TCACGTACTA
MID17   CGTCTAGTAC
MID18   TCTACGTAGC

Output files

A directory will be created for the output of extractor, this defaults to the name of the file plus the work extractor and then the current date and time of the when the command was run. This can be changed with the --dout flag. In this directory will be the sequences and several files reporting on how the extraction when.

extractionProfile.tab.txt - Reports stats on extraction per mid and primer pair
extractionStats.tab.txt - Reports stats on the extraction overall
runLog_extractor.txt - Contains the date and where the command was run on and what it was and total run time
filteredOff - A directory containing the reads that didn’t meet filtering criteria or match the input primers/mids

Example of extractionProfile.tab.txt

cat: extraFiles/exampleExtraction/extractionProfile.tab.txt: No such file or directory

Example of extractionStats.tab.txt

cat: extraFiles/exampleExtraction/extractionStats.tab.txt: No such file or directory

The output files will be named by the using the name given in the gene column and the name in id column if the data is multiplexed.

Code

ls ./

PFAMA1MID01.fastq
PFAMA1MID02.fastq
PFAMA1MID03.fastq
PFAMA1MID04.fastq
PFAMA1MID05.fastq
PFAMA1MID06.fastq
PFAMA1MID07.fastq
PFAMA1MID08.fastq
extractionProfile.tab.txt
extractionStats.tab.txt
parametersUsed.txt
runLog_extractor.txt

Examples

Required files

SeekDeep extractor and SeekDeep extractorPairedEnd require an input sequence file and an identifier file.

Different input formats

Fastq Input

Code

SeekDeep extractor --fastq example.fastq --id ids.txt
SeekDeep extractor --fastqgz example.fastq.gz --id ids.txt

Fasta Input

Code

SeekDeep extractor --fasta example.fasta --id ids.txt

Fasta and Qual Input

Code

SeekDeep extractor --fasta example.fasta --qual example.fasta.qual --id ids.txt
#or
SeekDeep extractor --stub example --id ids.txt

Paired End

For paired end the only options are fastq or fastqgz and both mates have to be indicated. This also assumes that R1 sequences are in the 5`-3` direction of the top strand, and R2 are in the 5`-3` direction of the bottom strand.

Code

SeekDeep extractorPairedEnd --fastq1 example_R1.fastq --fastq2 example_R2.fastq --id ids.txt
SeekDeep extractorPairedEnd --fastq1gz example_R1.fastq.gz --fastq2gz example_R2.fastq.gz --id ids.txt

Multiplexed and multiple primer pairs

Multiplex

Multiplexing simply requires having MID/barcodes in id file, a header line is required that starts with id to indicate to SeekDeep extractor that barcodes are starting.

Code

cat id_file.tab.txt

target  forwardPrimer   reversePrimer
PFAMA1  CCATCAGGGAAATGTCCAGT    TTTCCTGCATGTCTTGAACA
id  barcode
MID01   ACGAGTGCGT
MID02   ACGCTCGACA
MID03   AGACGCACTC
MID04   AGCACTGTAG
MID05   ATCAGACACG
MID06   ATATCGCGAG
MID07   CGTGTCTCTA
MID08   CTCGCGTGTC

Code

SeekDeep extractor --fastq example.fastq --id id_file.tab.txt --multiplex

Multiple primer pairs

Multiple primer pairs is done simply by adding another primer info line, this can also be combined with multiplexing

Code

cat multipleGenePairs.id.txt

gene    forwardPrimer   reversePrimer
PFAMA1  CCATCAGGGAAATGTCCAGT    TTTCCTGCATGTCTTGAACA
PFMSP1  AACTAGAAGCTTTAGAAGATGCA ACATATGATTGGTTAAATCAAAG
id  barcode
MID01   ACGAGTGCGT
MID02   ACGCTCGACA
MID03   AGACGCACTC
MID04   AGCACTGTAG
MID05   ATCAGACACG
MID06   ATATCGCGAG
MID07   CGTGTCTCTA
MID08   CTCGCGTGTC

Code

SeekDeep extractor --fasta example.fasta --id multipleGenePairs.id.txt

Paired End

For paired reads an additional file is required that indicates if there is overlap between the mates. There are three possible overlap scenarios: noOverLap, r1EndsInR2 (normal overlap goal), and r1BeginsInR2 (happens with read through). If there is overlap SeekDeep extractorPairedEnd will stitch together the mates and if there is no overlap it will extract and have a _R1 and _R2 files for the target. The different overlap scenarios are required as there are many ways to generate artifacts in PCR (unspecific amplification, primer dimers, etc) that could lead to the wrong overlap and therefore reads that overlap in the way that they are expected to are given in the final output files and the rest assumed to be artifact. See Illumina Paired Info Page for a diagram on how sequences overlap.

The overlap status of the targets is given by a file with two columns, target and status. The order of columns does not matter and the case of the status column does not matter, the target column does. The name given in the target column must match with the names given in the id file.

Code

cat overlapStatuses.txt

target  status
PFAMA1  R1BeginsInR2
PFCSP   R1BeginsInR2

Code

SeekDeep extractor --fastq1 example_R1.fastq --fastq2 example_R2.fastq --id multipleGenePairs.id.txt --overlapStatusFnp overlapStatuses.txt

Filtering parameters

Length

Reads are filtered on expected length by choosing a minimum length (--minLen) and a maximum length (--maxLen), these default to be within 20% of the median input read lengths

Minimum Length

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --minLen 150

Maximum Length

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --minLen 150 --maxLen 300

You can also supply a file with len cutoffs which is especially useful for when you have different expected lengths for different targets, the file needs to contain at least three columns with one column named target which contain the name of the target region which must match the target name in the --id file and two columns minlen and maxlen which contain the min and max length for the target.
This is supplied with the --lenCutOffs flags.

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --lenCutOffs lengthCutOffsForTargets.tab.txt

Code

cat lengthCutOffsForTargets.tab.txt

target  minlen  maxlen
PFAMA1  200 240
PFCSP   254 294

Quality

When quality values are supplies they can be used to filter reads as well, there are two options for quality filtering. The default is to use the sliding quality window threshold method, this is good for Ion Torrent, 454, and PacBio. The second quality option is good for Illumina

Siding window average threshold

By default a sliding window of size 50 and stepping by every 5 bases and checking to see if the average quality of those 50 bases are above a threshold of 25. The default values can be changed by using the --qualWindow flag and supplying three values separated by commas which are the window size, window step, and window threshold. So the below example would use a window size of 50, a step of 5, and a average threshold of 20

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --qualWindow 50,5,20

Also the default behavior is to throw out any read where this ever fails and this can be changed to trimming the read at the failed window instead using the -qualWindowTrim flag. The minimum length filter is applied after this step so if the trim falls below the min len it will still be filtered out

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --qualWindow 50,5,20 --qualWindowTrim

Quality fraction above a threshold

A second option for filtering using quality scores is using the distribution of a reads quality scores. This is done by choosing a quality score cut off and checking to see if a certain fraction of quality scores are above a fraction cut off. This is done by using the -qualCheck and the qualCheckCutOff flags. So the below example would throw out any reads where less than 75% of their quality scores was above 25

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 25 --qualCheckCutOff .75

Dry run quality testing

Since extraction might take some time, it might be nice to see how many reads you would loose with different setting, so a command was added to SeekDeep to do a dry run to output the percentage of reads you would loose if certain quality parameters were used, this is called SeekDeep dryRunQualityFiltering and the arguments are similar to the arguments above.

Checking quality window

Code

SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,25

void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist

Code

SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,20

void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist

Code

SeekDeep dryRunQualityFiltering --fastq example.fastq  --qualWindow 50,5,18

void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist

checking qualcheck

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 25 --qualCheckCutOff .75

void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 20 --qualCheckCutOff .75

void njhseq::SeqInput::openInLockFree(): Error file: [31m[1mMultiplex_IT_Tutorial_Materials/IonTorrent1.fastq[0m doesn't exist

Primers

There are various filtering parameters that can be applied to the presence of primers. This includes percent of primer found, the number of mismatches to the primer, and not searching for the primer at all. See SeekDeep extractor -help for more details and defaults

Reverse Primer

Looking for primers can be turned off and simple filtering can be done, this requires the flag --noPrimers and that the ID file has just one primer line so that a target name can be given to the out files but the forward and reverse primer columns are ignore.

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --noPrimers

Additional Options

Looking in the reverse complement direction

Depending on the library set up reads are can be found in two directions. Reads found in the reverse complement direction will be reverse complemented and their name will be marked with _Comp so all sequences are in the same direction. There are separate flags for searching MIDs in both directions (–checkRevComplementForMids) and for searching for primers in the both directions (–checkRevComplementForPrimers). And depending on library prep again you might have to combine the two.

Code

SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForMids

Code

SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForPrimers

Code

SeekDeep extractor --fasta example.fasta --id ids.txt  --checkRevComplementForMids --checkRevComplementForPrimers

Looking for barcodes/primers at various locations

Barcodes and primers aren’t always located at the very beginning of the sequences and the search region has to be expanded which can be done using the --midWithinStart flag for mids and --primerWithinStart, the below example would check for the barcodes and primers for start sites within the first 10 bases of the start of the sequence (the primer/MID just has to start within this many bases and not be contained completely with this region, e.g. the primer could be found at base 10 but extend to base 30 and would be caught with this flag). When data is multiplexed the primer within start parameter is the number of bases within the found MID and not the start of the sequence.

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --midWithinStart 10

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --primerWithinStart 10

And again, depending on library prep you might need to combine these two flags. For instance, if you had say a variable number of bases of up to 4 in front of your MIDs and then an additional variable number of bases in between your MIDs and primers (lets say up to 10).

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --midWithinStart 4 --primerWithinStart 10

It is advised to keep these numbers low as a high number will greatly increase the search space and could lead to many false positives.

Barcodes

Dual barcodes

Dual barcoding schemes can be done by adding another column to the barcode portion of the input ID file. The first column is assumed to be associated with the forward target primer and the second column is assumed to be associated with the

Code

SeekDeep extractor --fastq example.fastq --id ids.txt

Allowing errors in barcodes

You can allow mismatches in barcodes using --barcodeErrors

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --barcodeErrors 2

Renaming of sequence ids

The names given to the sequences are sometimes annoying so adding the --rename flag will rename the sequence with what primer/mid they were found with

Code

SeekDeep extractor --fastq example.fastq --id ids.txt  --rename

A file will be created called renameKey.tab.txt which contains the name conversion.

Code

cat renameKey.tab.txt

originalName    newName
JUW96:00012:00129   PFAMA1MID01.00
JUW96:00013:00133   PFAMA1MID01.01
JUW96:00017:00147   PFAMA1MID01.02_Comp
JUW96:00018:00120   PFAMA1MID01.03
JUW96:00020:00136   PFAMA1MID01.04_Comp
JUW96:00029:00127   PFAMA1MID01.05
JUW96:00032:00129   PFAMA1MID01.06
JUW96:00033:00147   PFAMA1MID01.07_Comp
JUW96:00035:00137   PFAMA1MID01.08_Comp

Changing out directory name

To change the default directory name use the --dout flag. SeekDeep will never overwrite a directory if it already exists and will fail and quit if it tries to create a directory that exists.

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --dout extractionDir

The dout option also understand the key work TODAY to mean to insert the current date and time there instead though this means a output directory name can never have TODAY all in caps in it

Code

SeekDeep extractor --fastq example.fastq --id ids.txt --dout extractionDir_TODAY

:::{.callout-note} # extractor's objectives The main purpose of extractor is to take raw data and demultiplex by sample barcodes if present and by primer pairs. It at most requires two arguments: an input file which can be fastq (`--fastq`) or fasta (`--fasta`) and an ID file (`--id`) that supplies primer pairs and barcodes if present. It will also do some quality control as well for length and quality scores. ::: # Paired End Extraction There is a separate extractor command for extracting paired end data and is simply called `SeekDeep extractorPairedEnd` and shares most of the command line options with regular single end `SeekDeep extractor` # Getting usage command line Calling `-help` will give a help message on running the program ```{r, engine='bash',comment="",eval=FALSE} SeekDeep extractor --help SeekDeep extractorPairedEnd --help ``` Also all flags in SeekDeep are case insensitive and so all the following would have the same results ```{r, engine='bash',comment="",eval=FALSE} SeekDeep extractor --help SeekDeep extractor --HELP SeekDeep extractor --HeLP SeekDeep extractor --HeLp SeekDeep extrActor --HeLP SeekDeep ExtraCtor --HeLp SeekDeep EXTRACTOR --HeLp ``` ```{r, engine='bash',comment="",eval=FALSE, echo=FALSE} SeekDeep extractor --getFlags ``` ```{r, engine='bash',comment="",highlight=TRUE, eval=FALSE, echo=FALSE} SeekDeep extractor --getFlags | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" | head -20 echo ... echo ... ``` # Format of id file The id file is a tab delimited file that contains both primer pair info and optional barcode information Primer information starts with a header line that must start with either `gene` or `target` and have two additional columns for forward and reverse primers, the line can start with either gene or target. ```{r,engine='bash',comment="", echo=FALSE} head -3 ../extraFiles/pfama1_csp_ids.tab.txt ``` ## Primers The primers can contain [ambigious bases](http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html) (Y,R,etc.) and should be all upper case ### Forward Primer Forward primer should be in the 5\`-3\` direction. ### Reverse Primer The reverse primer should in in the 5\`-3\` direction as well. ## MIDs If data is multiplexed the MID name and corresponding barcode information comes after the primer information. Barcode information needs to start with a header `id barcode` which the `id` is the important name for parsing. Mids should be in the direction they would be found in if reads are found in the direction of the forward primer. It is always assumed that the mid and primer set up is MID-PRIMER. If this is not the case for your data contact Nick Hathaway (nicholas.hathaway@umassmed.edu) to see if he can add options for accommodating your data structure. ```{r,engine='bash',comment="", echo=FALSE} cat ../extraFiles/pfama1_csp_ids.tab.txt ``` # Output files A directory will be created for the output of extractor, this defaults to the name of the file plus the work extractor and then the current date and time of the when the command was run. This can be changed with the `--dout` flag. In this directory will be the sequences and several files reporting on how the extraction when. * **extractionProfile.tab.txt** - Reports stats on extraction per mid and primer pair * **extractionStats.tab.txt** - Reports stats on the extraction overall * **runLog_extractor.txt** - Contains the date and where the command was run on and what it was and total run time * **filteredOff** - A directory containing the reads that didn't meet filtering criteria or match the input primers/mids Example of extractionProfile.tab.txt ```{r engine='bash', echo=FALSE} cat extraFiles/exampleExtraction/extractionProfile.tab.txt | column -t ``` Example of extractionStats.tab.txt ```{r engine='bash', echo=FALSE} cat extraFiles/exampleExtraction/extractionStats.tab.txt | column -t ``` The output files will be named by the using the name given in the gene column and the name in id column if the data is multiplexed. ```{r engine='bash', echo=TRUE, eval=FALSE} ls ./ ``` ```{r engine='bash', echo=FALSE} ls ../extraFiles/exampleExtraction/ ``` # Examples ## Required files `SeekDeep extractor` and `SeekDeep extractorPairedEnd` require an input sequence file and an identifier file. ### Different input formats #### Fastq Input ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt SeekDeep extractor --fastqgz example.fastq.gz --id ids.txt ``` #### Fasta Input ```{r engine='bash', eval=FALSE} SeekDeep extractor --fasta example.fasta --id ids.txt ``` #### Fasta and Qual Input ```{r engine='bash', eval=FALSE} SeekDeep extractor --fasta example.fasta --qual example.fasta.qual --id ids.txt #or SeekDeep extractor --stub example --id ids.txt ``` ### Paired End For paired end the only options are fastq or fastqgz and both mates have to be indicated. This also assumes that R1 sequences are in the 5\`-3\` direction of the top strand, and R2 are in the 5\`-3\` direction of the bottom strand. ```{r engine='bash', eval=FALSE} SeekDeep extractorPairedEnd --fastq1 example_R1.fastq --fastq2 example_R2.fastq --id ids.txt SeekDeep extractorPairedEnd --fastq1gz example_R1.fastq.gz --fastq2gz example_R2.fastq.gz --id ids.txt ``` ### Multiplexed and multiple primer pairs #### Multiplex Multiplexing simply requires having MID/barcodes in id file, a header line is required that starts with `id` to indicate to `SeekDeep extractor` that barcodes are starting. ```{r engine='bash', eval=F} cat id_file.tab.txt ``` ```{r engine='bash', eval=T, echo=FALSE} cat ../extraFiles/id_file.tab.txt ``` ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id id_file.tab.txt --multiplex ``` #### Multiple primer pairs Multiple primer pairs is done simply by adding another primer info line, this can also be combined with multiplexing ```{r engine='bash', eval=F} cat multipleGenePairs.id.txt ``` ```{r engine='bash', eval=T, echo=FALSE} cat ../extraFiles/multipleGenePairs.id.txt ``` ```{r engine='bash', eval=FALSE} SeekDeep extractor --fasta example.fasta --id multipleGenePairs.id.txt ``` ### Paired End For paired reads an additional file is required that indicates if there is overlap between the mates. There are three possible overlap scenarios: noOverLap, r1EndsInR2 (normal overlap goal), and r1BeginsInR2 (happens with read through). If there is overlap `SeekDeep extractorPairedEnd` will stitch together the mates and if there is no overlap it will extract and have a _R1 and _R2 files for the target. The different overlap scenarios are required as there are many ways to generate artifacts in PCR (unspecific amplification, primer dimers, etc) that could lead to the wrong overlap and therefore reads that overlap in the way that they are expected to are given in the final output files and the rest assumed to be artifact. See [Illumina Paired Info Page](illumina_paired_info.html) for a diagram on how sequences overlap. The overlap status of the targets is given by a file with two columns, `target` and `status`. The order of columns does not matter and the case of the status column does not matter, the target column does. The name given in the target column must match with the names given in the id file. ```{bash, eval = F} cat overlapStatuses.txt ``` ```{bash, echo = F} cat ../extraFiles/refSeqs/forSeekDeep/overlapStatuses.txt ``` ```{bash, eval = F} SeekDeep extractor --fastq1 example_R1.fastq --fastq2 example_R2.fastq --id multipleGenePairs.id.txt --overlapStatusFnp overlapStatuses.txt ``` ## Filtering parameters ### Length Reads are filtered on expected length by choosing a minimum length (`--minLen`) and a maximum length (`--maxLen`), these default to be within 20% of the median input read lengths #### Minimum Length ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --minLen 150 ``` #### Maximum Length ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --minLen 150 --maxLen 300 ``` You can also supply a file with len cutoffs which is especially useful for when you have different expected lengths for different targets, the file needs to contain at least three columns with one column named target which contain the name of the target region which must match the target name in the `--id` file and two columns `minlen` and `maxlen` which contain the min and max length for the target. This is supplied with the `--lenCutOffs` flags. ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --lenCutOffs lengthCutOffsForTargets.tab.txt ``` ```{bash, eval = F} cat lengthCutOffsForTargets.tab.txt ``` ```{bash, echo = F} cat ../extraFiles/refSeqs/forSeekDeep/lenCutOffs.txt ``` ### Quality When quality values are supplies they can be used to filter reads as well, there are two options for quality filtering. The default is to use the sliding quality window threshold method, this is good for Ion Torrent, 454, and PacBio. The second quality option is good for Illumina #### Siding window average threshold By default a sliding window of size 50 and stepping by every 5 bases and checking to see if the average quality of those 50 bases are above a threshold of 25. The default values can be changed by using the `--qualWindow` flag and supplying three values separated by commas which are the window size, window step, and window threshold. So the below example would use a window size of 50, a step of 5, and a average threshold of 20 ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --qualWindow 50,5,20 ``` Also the default behavior is to throw out any read where this ever fails and this can be changed to trimming the read at the failed window instead using the `-qualWindowTrim` flag. The minimum length filter is applied after this step so if the trim falls below the min len it will still be filtered out ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --qualWindow 50,5,20 --qualWindowTrim ``` #### Quality fraction above a threshold A second option for filtering using quality scores is using the distribution of a reads quality scores. This is done by choosing a quality score cut off and checking to see if a certain fraction of quality scores are above a fraction cut off. This is done by using the `-qualCheck` and the `qualCheckCutOff` flags. So the below example would throw out any reads where less than 75% of their quality scores was above 25 ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 25 --qualCheckCutOff .75 ``` ### Dry run quality testing Since extraction might take some time, it might be nice to see how many reads you would loose with different setting, so a command was added to SeekDeep to do a dry run to output the percentage of reads you would loose if certain quality parameters were used, this is called **SeekDeep dryRunQualityFiltering** and the arguments are similar to the arguments above. #### Checking quality window ```{r engine='bash', eval=FALSE} SeekDeep dryRunQualityFiltering --fastq example.fastq --qualWindow 50,5,25 ``` ```{r engine='bash', eval=T, echo=FALSE} SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq --qualWindow 50,5,25 | tail -13 | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" ``` ```{r engine='bash', eval=FALSE} SeekDeep dryRunQualityFiltering --fastq example.fastq --qualWindow 50,5,20 ``` ```{r engine='bash', eval=T, echo=FALSE} SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq --qualWindow 50,5,20 | tail -13 | gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" ``` ```{r engine='bash', eval=FALSE} SeekDeep dryRunQualityFiltering --fastq example.fastq --qualWindow 50,5,18 ``` ```{r engine='bash', eval=T, echo=FALSE} SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq --qualWindow 50,5,18 | tail -13 |gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" ``` #### checking qualcheck ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 25 --qualCheckCutOff .75 ``` ```{r engine='bash', eval=T, echo=FALSE} SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq --qualCheck 25 --qualCheckCutOff .75 | tail -13 |gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" ``` ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --qualCheck 20 --qualCheckCutOff .75 ``` ```{r engine='bash', eval=T, echo=FALSE} SeekDeep dryRunQualityFiltering --fastq Multiplex_IT_Tutorial_Materials/IonTorrent1.fastq --qualCheck 20 --qualCheckCutOff .75 | tail -13 |gsed -r "s/\x1B\[([0-9]{1,3}((;[0-9]{1,3})*)?)?[m|K]//g" ``` ### Primers There are various filtering parameters that can be applied to the presence of primers. This includes percent of primer found, the number of mismatches to the primer, and not searching for the primer at all. See SeekDeep extractor -help for more details and defaults #### Reverse Primer Looking for primers can be turned off and simple filtering can be done, this requires the flag `--noPrimers` and that the ID file has just one primer line so that a target name can be given to the out files but the forward and reverse primer columns are ignore. ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --noPrimers ``` ## Additional Options ### Looking in the reverse complement direction Depending on the library set up reads are can be found in two directions. Reads found in the reverse complement direction will be reverse complemented and their name will be marked with `_Comp` so all sequences are in the same direction. There are separate flags for searching MIDs in both directions (--checkRevComplementForMids) and for searching for primers in the both directions (--checkRevComplementForPrimers). And depending on library prep again you might have to combine the two. ```{r engine='bash', eval=FALSE} SeekDeep extractor --fasta example.fasta --id ids.txt --checkRevComplementForMids ``` ```{r engine='bash', eval=FALSE} SeekDeep extractor --fasta example.fasta --id ids.txt --checkRevComplementForPrimers ``` ```{r engine='bash', eval=FALSE} SeekDeep extractor --fasta example.fasta --id ids.txt --checkRevComplementForMids --checkRevComplementForPrimers ``` ### Looking for barcodes/primers at various locations Barcodes and primers aren't always located at the very beginning of the sequences and the search region has to be expanded which can be done using the `--midWithinStart` flag for mids and `--primerWithinStart`, the below example would check for the barcodes and primers for start sites within the first 10 bases of the start of the sequence (the primer/MID just has to start within this many bases and not be contained completely with this region, e.g. the primer could be found at base 10 but extend to base 30 and would be caught with this flag). When data is multiplexed the primer within start parameter is the number of bases within the found MID and not the start of the sequence. ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --midWithinStart 10 ``` ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --primerWithinStart 10 ``` And again, depending on library prep you might need to combine these two flags. For instance, if you had say a variable number of bases of up to 4 in front of your MIDs and then an additional variable number of bases in between your MIDs and primers (lets say up to 10). ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --midWithinStart 4 --primerWithinStart 10 ``` It is advised to keep these numbers low as a high number will greatly increase the search space and could lead to many false positives. ## Barcodes ### Dual barcodes Dual barcoding schemes can be done by adding another column to the barcode portion of the input ID file. The first column is assumed to be associated with the forward target primer and the second column is assumed to be associated with the ![Supply specific length cut offs](../images/dual_barcoding_scheme.jpg) ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt ``` ### Allowing errors in barcodes You can allow mismatches in barcodes using `--barcodeErrors` ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --barcodeErrors 2 ``` ### Renaming of sequence ids The names given to the sequences are sometimes annoying so adding the `--rename` flag will rename the sequence with what primer/mid they were found with ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --rename ``` A file will be created called `renameKey.tab.txt` which contains the name conversion. ```{r engine='bash', eval=FALSE} cat renameKey.tab.txt ``` ```{r engine='bash', eval=T, echo=FALSE} cat ../extraFiles/exampleRenameFile.txt ``` ### Changing out directory name To change the default directory name use the `--dout` flag. SeekDeep will never overwrite a directory if it already exists and will fail and quit if it tries to create a directory that exists. ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --dout extractionDir ``` The dout option also understand the key work TODAY to mean to insert the current date and time there instead though this means a output directory name can never have TODAY all in caps in it ```{r engine='bash', eval=FALSE} SeekDeep extractor --fastq example.fastq --id ids.txt --dout extractionDir_TODAY ```