Code
SeekDeep makeSampleDirectories
The intent of makeSampleDirectories is to set up the directory tree structure needed by SeekDeep processClusters and to ease the process of putting output files into the tree. SeekDeep makeSampleDirectories needs two arguments given by the --file
and --dout
flags. The --file
supplies the set up file needed (explained below) and the --dout
flag names the directory that will be created (it will not over write already existing directories). It has mostly been developed for use with multiplex data.
Just typing the name of the program will give a help message on running the program
makeSampleDirectoires
Set up a directory tree for processClusters
Commands, order not necessary, flags are case insensitive
Required commands
--file [option], name of the file of sample names to read in
--dout [option], name of the main directory to create
File should be tab delimited and a few examples are below
File should have at least three columns
Where first column is the name of the index or sff file used, second column is
the sample names, and all following columns are the MIDs for that samples
samples
Example with two replicates and two separate master indexes
1 090-00 MID01 MID02
1 090-24 MID03 MID04
1 090-48 MID05 MID06
...
...
Also calling -help
will do the same
Also all flags in SeekDeep are case insensitive and so all the following would have the same results
The set up file contains at least three columns. The first column is a identifier for the sequence file that contains the sample sequence’s data. The second column is the name of the sample. The third column is the name of the MID for that sample and each additional column is another replicate for that sample in the that file. Any line that starts with a # or is a blank line will be ignored.
Below would be for example of a sequencing experiment that included 20 samples and each sample contained 2 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs)
Below would be for example of a sequencing experiment that included 20 samples and each sample contained only 1 PCR replicates and the sequencing itself was done on two lanes or cells (eg. Two Illumina Lanes or two Different Ion torrent runs)
Now you might have a experiment like the one with dual replicates but maybe you only one replicate for a couple of samples maybe due to poor amplification or someone other reason. This can also be done simply by mixing the two above formats.
The output directory will be what is given by the -dout
flag. This will never overwrite an existing directory. The -dout
flag also will interpret the word TODAY in all caps to mean to insert the current date and time instead.
A master directory will be built using the name given by -dout
. In this directory will be a directory tree that is need by SeekDeep processClusters where there is a top directory containing directories for the samples in the analysis and each sample directory contains all replicate directories for that sample.
Another directory, called locationByIndex, will contain a file for intended use with SeekDeep qluster to help direct output to this directory. This is done by giving these files to qluster using the -additionalOut
flag with qluster, see here for details on this flag. The idea here is each input sequence file will contain mids and once the files have been extracted by SeekDeep extractor will be in separate files. The above example had two seq input files. When running qluster on the first file you give the location file for that file.
Here are the files from the above example
Each file contains two columns, the first is the MID name and the second is the location where the output of qluster should go for the clustering of that MID.
Now when running qluster on files on the results from the extraction of the input files give the appropriate location file. Extraction being done with the following file
gene forward reverse
PFAMA1 CAGGGAAATGTCCAGTATT CTTGAACATAAAGTCAATTC
id barcode
MID01 ACGAGTGCGT
MID02 ACGCTCGACA
MID03 AGACGCACTC
MID04 AGCACTGTAG
MID05 ATCAGACACG
MID06 ATATCGCGAG
MID07 CGTGTCTCTA
...
...
SeekDeep makeSampleDirectories --file exampleSampleNames.tab.txt --dout filesForProcessClusters
SeekDeep extractor --fastq input1.fastq --id ids.txt --dout extraction1
cd extraction1
SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/1.tab.txt
...
cd ..
SeekDeep extractor --fastq input2.fastq --id ids.txt --dout extraction2
cd extraction2
SeekDeep qluster --fastq PFAMA1MID01.fastq --par pars.tab.txt --additionalOut ../filesForProcessClusters/locationByIndex/2.tab.txt
...
#now after running qluster on all the files from extraction things are already ready for processClusters
cd ../filesForProcessClusters
SeekDeep processClusters --fastq output.fastq --par pars.txt