Usage: gnumap [options] <file_to_parse>
-g, --genome=STRING Genome .fa file(s)
-o, --output=STRING Output file
-a, --align_score=DOUBLE Limit for sequence alignment (default: 10)
-p, --percent Use percentage when determining alignment score
-q, --read_quality=DOUBLE Read quality cutoff: won't align reads if they are
below this cutoff value
(default=0.0)
-v, --verbose=INT Verbose (default=0)
-c, --num_proc=INT Number of processors to run on
-m, --mer_size=INT Mer size (default=10)
-B, --buffer=INT Buffer size
-T, --max_match=INT Maximum number of matches for a given
sequence (default: 1000)
-h, --max_hash=INT Maximum values to store in the hash
(default: none)
-S, --subst_file=STRING Position-Weight Matrix file for DNA
Substitutions
-G, --gap_penalty=DOUBLE Gap Penalty for Substitution Matrix
-M, --max_gap=INT Maximum Number of Gaps to use in Alignment
-A, --adaptor=STRING Adaptor sequence to remove from sequences
-0, --print_full Print locations for the entire sequence, not
just for the beginning.
--print_all_sam Include all possible SAM records in output
-b, --bs_seq Flag to turn on the C to T conversion, used in
bisulfite sequence analysis
--b2 Flag to turn on bisulfite sequencing, searching the reverse
strand of -b
-d, --a_to_g Flag that allows for A to G conversion
--fast Perform a fast alignment (at some reduction
of accuracy)
-s, --gen_skip=INT Number of bases to skip when the genome is aligned
(default: none)
--bin_size=INT The resolution for GNUMAP (default: 8)
-j, --jump=INT The number of bases to jump in the sequence hashing
(default: mer_size / 2)
-k, --num_seed=INT The total number of seed hashes that must match to a
location before it is considered for alignment
(default: 2)
--snp Turn on SNP mapping (will output a .sgrex file)
--snp_pval=DOUBLE P-Value cutoff for calling SNPs
(default: 0.001)
--snp_monop Flag that turns on monoploid SNP calling
(default: diploid SNP calling)
--illumina Defines the fastq file as Illumina file (otherwise
does nothing)
--up_strand Will only search the positive strand of the genome
for matching location (will not look for reverse
compliment match to the genome.
--down_strand Will only search the negaitve strand (opposite of
--up_strand command)
For MPI usage:
--MPI_largemem If the run requires a large amount of memory, this
flag will spread it accross several nodes.
(default: not included)
For reading and writing to a binary file:
--save=FILENAME Save the genome out to a file
--read=FILENAME Read the genome in from a file
Help options:
-?, --help Show this help message
-g "$(ls genome/*.fa)"and the entire *.fa contents of the folder genome will be used.
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> <OPT>
<QNAME> will be used if it is a fastq or fasta string--otherwise
the query name will be "seq" followed by the sequence number in the file.<FLAG> that are important are 0x0010 (where 0 represents an alignment
to the forward strand and 1 represents reverse) or 0x0200, which means the match was too poor to align<RNAME> is the chromosome mapped to<POS> is the position on the chromosome<MAPQ> is the mapping quality, where MAPQ = 10 * log_10(1-p(x)) where p(x) is GNUMAP's
posterior probability of mapping to that specific location<CIGAR> is the alignment differences, where I=Insertion, D=Deletion and M=Mismatch or Match
with the genomic strand<MRNM> will be '=' for all sequences (it's the mate-pair name)<MPOS> is ignored. Always '0'<ISIZE> is ignored. Always '0'<SEQ> is the sequence<QUAL> is the Phred-based quality of the sequences<OPT> is an optional field of the format TAG:VTYPE:VALUE.
XA:f:<ALIGN_SCORE> is the raw alignment scoreXP:f:<POST_PROB> is the posterior probability scoreX0:i:<SIM_MATCHES> is the number of similar matches
-a: Minimum Alignment Score
The -a option defines the minimum aligment score that will be accepted for
mapped reads. The default is a raw score of 10.
-p: Percentage
The -p option indicates that the score given in the -a option should be used as
a percentage instead of a raw score.
-q: Read Quality Cutoff
The -q option determines at which level the quality control is turned on.
The cutoff score is determined by the alignment score the probabilistic sequence
gets with itself. The default score of zero is about as stringent as it can get;
using a higher score will prevent poorer sequences from being aligned.
-v: Verbosity
Different levels of verbosity can be used. However, for most purposes, level one
is sufficient, ie:
-v 1Using this will slow down the program slightly.
--c: Number of threads
gnumap supports multi-threading to increase speed. To use gnumap with 8
processors, use
-c 8If the -v 1 parameter is also used, the completion percentage will not be entirely accurate.
--m: Mer length
gnumap uses a hash table to store the genome. The -m option controls the length
of these hashes (creating an m-mer hash). Using a longer mer-size, ie:
-m 14will increase the sequence alignment, but could potentially decrease the accuracy. As a general rule, as the mer size increases, so does the memory requirement. The current default setting is 9.
--B: Buffer Size
The buffer size determines how much of the genome and sequence files are read in
at a time. Eventually, the entire genome will be stored in memory, but if it is
necessary to read in a smaller portion at a time, this option can be set lower.
Default setting: 10M
--T: Maximum number of matches
When a sequence from the _int.txt file is matched, all the matches above the
align_score threshhold are kept. The -M option will determine how many matches
are reported and scored. If a given sequence has too many matches (whether its
base qualities are too low, or it comes from a heavily-duplicated region), the
sequence will not be used. The current default is 1000, but can be changed, ie:
-T 100000
--h: Maximum Hash Size
For the human genome, there are over 3 million occurances of a 9-mer sequence of
A's. By decreasing the hash size, the mers with an occurance greater than the
threshhold will not be used. This will speed up the program, as there are fewer
mers to compute a match to for each sequence. Currently, this option is not
used, but can be set using
-h 10000
-S: Position-Weight Matrix file
The position-weight matrix used by the algorithm as default is a simple matrix with
(-1) as the mismatch score and (+1) as the match score. If a different scoring
system is desired, it can be included in a file and passed in as a parameter. The
substitution matrix should be in the format:
| A | C | G | T | |
| A | m | mm | mm | mm |
| C | mm | m | mm | mm |
| G | mm | mm | m | mm |
| T | mm | mm | mm | m |
| N | mm | mm | mm | mm |
-G: Gap Penalty
The gap penalty used in the probabilistic Needleman-Wunsch algorithm.
-M: Maximum Number of Gaps
When initializing the Needleman-Wunsch DP-matrix, a boundary using the maximum number
of gaps is used to not allow an alignment with a large number of gaps.
-A: Adaptor String
Many of the Solexa/Illumina reads also contain portions of the adaptor sequence.
Specifying the -A option will remove this adaptor sequence from the end of each read.
Currenly, only exact matches are used.
-0: Print Full
Often, especially when sequences are of different lengths because of adaptor trimming,
it is useful to have base-pair resolution, not only identifying the beginning of each sequence,
but showing length and coverage of each sequence.
--print_all_sam: Print All SAM Records
This flag will print a SAM record for any possible match that GNUMAP finds. For sequences with
multiple "good" hits, this flag will print significantly more records than without it.
Since the conversion to MAPQ looses a lot of data, look for the XP flag on the end of the SAM
line for the posterior probability score of each record.
[ C' -> methylated C ] [ T* -> bisulfite-treated C ] [ A* -> bisulftie-treaded C on reverse-compliment strand ] Watson >> A C'G T T C G C T T G A G >> Crick << T G C'A A G C G A A C T C << ---Denaturation, Bisulfite Treatment, PRC Amplification--- Watson BSW >> A C'G T T T*G T*T T G A G >> BSWR << T G C A A A*C A*A A C T C << Crick BSC << T G C'A A G T*G A A T*T T* << BSCR >> A C G T T C A*C T T A*A A* >>(Reference: Xi, Yuanxin et al, "BSMAP: whole genome bisulfite sequence MAPping program." BMC Bioinformatics, vol. 10, no. 1, p. 1471-2105.)
Thus, a complete Bisulfite analysis should include two runs, one each with -b and --b2:
-b: Bisulfite Sequencing
Using bisulfite sequencing analysis, each unmethylated 'c' nucleotide would appear as
a 't' in the read. However, methylated c's would not be changed, thus identifying which
locations are methylated in the genome.
GNUMAP is capable of dealing with these reads, mapping each possibly methylated read
to the genome. The final output is in the form of .gmp files.
With the -b flag, only BSW and BSC reads are mapped. Including the --up_strand or
--down_strand flags will only map the BSW or BSC reads respectively.
--b2: Bisulfite Sequence Mapping (reverse strand mapping)
To be able to map the other two reads (BSWR and BSCR), use the --b2 flag. Using the
--up_strand or --down_strand flags will only map BSWR or BSCR, respectively.
-d: A to G Sequencing
In similar manner to the Bisulfite Sequencing process appearing above, the -b flag can be used to
allow for substitutios from a genomic a to a genomic g.
--fast: Fast option
The fast option increases the speed of the algorithm at the possible loss of missing some sequencs. Among
other things, including --fast will increase the mer size to 14. If an error occurs and the program
is not capable of creating more memory, decreasing the mer size will allow it to continue.
-s, --gen_skip: Bases to Skip while hashing
When hashing the genome, instead of creating a hash at every base, the defined number
of bases are skipped. This creates a smaller hash in total without losing on accuracy.
In order to skip every other base, use:
-s 1
--bin_size: Number of bases in each bin
When printing the sgr file, this is the number of bases to print. Remember this is a
genome-file specific parameter, ie: The bin size of the genome file that is written to
memory will need to be the genome size of every run.
To use a bin size of 1 (the sgr output file will have base-pair specificty), use:
--bin_size=1
-j, --jump: The number of bases to jump
When matching the read, GNUMAP will jump this number of bases when determining hashes.
A lower jump value might give a better result, and a jump value of too high will not
give very desirable results. Default: MER_SIZE / 2
In order to specify a jump size of 1 (take every possible hash from the sequence), use:
-j 1
-k, --num_seed: The number of seed hashes that must match
When matching sequences, k number of hashes must match to a given location before the
location will be considered for alignment. Significant speedup without loss in accuracy,
and tradeoffs between -j and -k can provide optimal performance. Using -k 1
is the exact same as previous versions. Default: 2
In order to require that three hash k-mers align to the read before being considered for
alignment, use:
-k 3
--snp: Turns on SNP calling flag
There are two results from the SNP flag. The first is a sgrex file, which is an
extension of the sgr file. It also includes a rudimentary SNP calling program. The
second benefit of the SNP flag is that the flags -0 (full print at each base) and
--bin_size=1 are set. This will take a little bit more memory, and should not be
used with little memory machines (if the genome is large).
--snp_pval: Minimum p-value for making SNP calls
GNUMAP uses a p-value score for calling SNPs in the final .sgrex file. If the p-value
is not low enough, it will not label this position as a SNP.
--snp_monop: Monopoid SNP flag
If it is not possible to call diploid SNPs (or if it is not desired to call
diploid SNPs), use this flag.
--illumina: Defines fastq as Illumina format
The only difference between Illumina fastq and normal fastq is the ASCII
adjustment. Normal fastq will adjust the probability by 33 to obtain an ASCII
character, whereas Illumiina uses 64.
--up_strand: Only uses the positive strand for alignment
When searching for matches to a specific sequence, this flag will only allow GNUMAP
to search the positive strand of the genome (instead of also looking at the
reverse compilment negative strand).
--down_strand: Only uses the negative strand for alignment
This flag does the opposite of the --up_strand command and only searches the
negative strand.
--save: Binary file to save Genome
In order to save the genome file to a binary file on disk, include this flag.
--save=<file_to_save>
--read: Binary file to read Genome
Read a previously stored genome back into memory, using the same parameters (ie, mer
size, maximum hash locations, etc) as were included in the writing of the file.
--read=<file_to_read>
<file_to_parse>
The last parameter is the sequence file created by Illumina.
In the latest version of gnumap, any version (_prb.txt, _int.txt, fastq or fasta) can be used.
The _int.txt file looks like this (four numbers identifying the lane,
followed by numbers identifying the intensity of each nucleotide):
8 1 889 119 205.2 -712.3 11023.4 9163.0 9373.4 10373.5 470.0 1059.7 9777.0 12029.2 330.4 819.1...and the _prb.txt file looks like this (four numbers for each nucleotide, separated by tabs):
8 -21 -21 -21 -21 -21 -21 8 8 -21 -21 -21 -21 -21 -21 8 -21 8 -21 -21 8 -21 -21 -21 -21 8 -21 -21...If the _seq.txt, or even the fastq file is used, there will not be as much information obtained; only the likelihood of the one base will be included, losing the likelihood of all the other bases.