|
|
EMBOSS: matcher |
Matcher is based on Bill Pearson's 'lalign' application, version 2.0u4 Feb. 1996
Lalign uses code developed by X. Huang and W. Miller (Adv. Appl. Math. (1991) 12:337-357) for the "sim" program.
Matcher will report a specified number of alignments between the two sequences showing the actual local alignments.
% matcher sw:hba_human sw:hbb_human Output file [hba_human.matcher]: 43.4% identity in 145 HBA_HUMAN overlap; score: 264
Mandatory qualifiers:
[-sequencea] sequence Sequence USA
[-sequenceb] sequence Sequence USA
[-outfile] outfile Output file name
Optional qualifiers:
-datafile matrix Matrix file
-alternatives integer This sets the number of alternative matches
output. By default only the highest scoring
alignment is shown. A value of 2 gves you
other reasonable alignments. In some cases,
for example multidomain proteins of cDNA and
gemomic DNA comparisons, there may be other
interesting and significant alignments.
-gappenalty integer The gap penalty is the score taken away when
a gap is created. The best value depends on
the choice of comparison matrix. The
default value of 14 assumes you are using
the EBLOSUM62 matrix for protein sequences,
or a value of 16 and the EDNAMAT matrix for
nucleotide sequences.
-gaplength integer The gap length, or gap extension, penalty is
added to the standard gap penalty for each
base or residue in the gap. This is how long
gaps are penalized. Usually you will expect
a few long gaps rather than many short
gaps, so the gap extension penalty should be
lower than the gap penalty. An exception is
where one or both sequences are single
reads with possible sequencing errors in
which case you would expect many single base
gaps. You can get this result by setting
the gap penalty to zero (or very low) and
using the gap extension penalty to control
gap scoring.
-markx integer This sets the alternate display of matches
and mismatches in alignments.
-markx=0 uses ':','.',' ', for identities,
conservative replacements, and
non-conservative replacements, respectively.
-markx=1 uses ' ','x', and 'X'.
-markx=2 does not show the second sequence,
but uses the second alignment line to
display matches with a '.' for identity, or
with the mismatched residue for mismatches.
-markx=3 outputs a title line with the
percentage identity and score and then
outputs the gapped sequences in multiple
FASTA format.
-markx=4 outputs only the title line with
the percentage identity and score.
-markx=5,6,7,8 and 9 are the same as
-markx=1
-markx=10 outputs a parseable output.
-length integer number of residues per line
Advanced qualifiers: (none)
|
| Mandatory qualifiers | Allowed values | Default | |
|---|---|---|---|
| [-sequencea] (Parameter 1) |
Sequence USA | Readable sequence | Required |
| [-sequenceb] (Parameter 2) |
Sequence USA | Readable sequence | Required |
| [-outfile] (Parameter 3) |
Output file name | Output file | <sequence>.matcher |
| Optional qualifiers | Allowed values | Default | |
| -datafile | Matrix file | Comparison matrix file in EMBOSS data path | EBLOSUM62 for protein EDNAMAT for DNA |
| -alternatives | This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gves you other reasonable alignments. In some cases, for example multidomain proteins of cDNA and gemomic DNA comparisons, there may be other interesting and significant alignments. | Integer 1 or more | 1 |
| -gappenalty | The gap penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value of 14 assumes you are using the EBLOSUM62 matrix for protein sequences, or a value of 16 and the EDNAMAT matrix for nucleotide sequences. | Positive integer | 14 for protein, 16 for nucleic |
| -gaplength | The gap length, or gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap penalty to zero (or very low) and using the gap extension penalty to control gap scoring. | Positive integer | 4 for any sequence |
| -markx | This sets the alternate display of matches and mismatches in alignments. -markx=0 uses ':','.',' ', for identities, conservative replacements, and non-conservative replacements, respectively. -markx=1 uses ' ','x', and 'X'. -markx=2 does not show the second sequence, but uses the second alignment line to display matches with a '.' for identity, or with the mismatched residue for mismatches. -markx=3 outputs a title line with the percentage identity and score and then outputs the gapped sequences in multiple FASTA format. -markx=4 outputs only the title line with the percentage identity and score. -markx=5,6,7,8 and 9 are the same as -markx=1 -markx=10 outputs a parseable output. | Integer up to 10 | 0 |
| -length | number of residues per line | Integer from 1 to 200 | 60 |
| Advanced qualifiers | Allowed values | Default | |
| (none) | |||
There are several ways to change the output format using -markx.
Here is the output for the example:
10 20 30 40 50
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
:.: .:. : : :::: .. : :.::: :... .: :. .: : ::: :. .:
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
10 20 30 40 50 60
60 70 80 90 100 110
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
:.::::: :.....::.:.. .....::.::. ::.::: ::.::.. :. .:: :.
HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
70 80 90 100 110 120
120 130 140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
:::: :.:. .: .:.:...:. ::
HBB_HU EFTPPVQAAYQKVVAGVANALAHKY
130 140
Here are the example outputs using other values of -markx:
% matcher sw:hba_human sw:hbb_human stdout -markx 1
Finds the best local alignments between two sequences
43.4% identity in 145 HBA_HUMAN overlap; score: 264
10 20 30 40 50
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
x Xx xX X X xxX X x X xxxXx X xXx XX X xXx
HBB_HU LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
10 20 30 40 50 60
60 70 80 90 100 110
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
x XX xxxxx x xxXxxxxx x xX x X x xxX xXx X xXX
HBB_HU KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
70 80 90 100 110 120
120 130 140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
X x xXx Xx x xxx xX
HBB_HU EFTPPVQAAYQKVVAGVANALAHKY
130 140
% matcher sw:hba_human sw:hbb_human stdout -markx 2
Finds the best local alignments between two sequences
43.4% identity in 145 HBA_HUMAN overlap; score: 264
10 20 30 40 50
HBA_HU LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH-----GSAQV
HBB_HU .T.EE.SA.T.L....--NVD.V.G...G.LLVVY.W.QRF.ES.G...TPDAVM.NPK.
60 70 80 90 100 110
HBA_HU KGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPA
HBB_HU .A.....LG.FSDGL..L.NLKGTFAT..E..CD..H...E..R..GNV.VCV..H.FGK
120 130 140
HBA_HU EFTPAVHASLDKFLASVSTVLTSKY
HBB_HU ....P.Q.AYQ.VV.G.ANA.AH..
% matcher sw:hba_human sw:hbb_human stdout -markx 3 Finds the best local alignments between two sequences 43.4% identity in 145 HBA_HUMAN overlap; score: 264 >HBA_HUMAN .. LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH -----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY >HBB_HUMAN .. LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY
% matcher sw:hba_human sw:hbb_human stdout -markx 4 Finds the best local alignments between two sequences 43.4% identity in 145 HBA_HUMAN overlap; score: 264
% matcher sw:hba_human sw:hbb_human stdout -markx 10 Finds the best local alignments between two sequences >>#1 ; sw_score: 264 ; sw_ident: 0.434 ; sw_overlap: 145 >HBA_HUMAN .. ; sq_len: -115 ; al_start: 2 ; al_stop: 140 ; al_display_start: 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLSH -----GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDP VNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY >HBB_HUMAN .. ; sq_len: 5 ; al_start: 3 ; al_stop: 145 ; al_display_start: 3 LTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLST PDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDP ENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKY
Note that the parseable output starts each alignment record with ">>" while each aligned sequence record starts with ">".
All parameters produced will be of the form: ; xx_yyyyy
In this version, we have xx: sw - Smith-Waterman scores sq - sequence length, type al - alignment start, stop, display_offset
All of the output parameters correspond to values that are presented in other output formats, with the exception of the "al_" parameters.
al_start gives the location of the alignment start in the original sequence
al_stop gives the location of the end of the alignment in the original sequence
al_display_start gives the location of the first displayed amino acid residue in the original sequence. The -markx=10 alignments are the same as those produced in the other modes. If the beginning of the first sequence aligns with the 10'th residue of the second sequence, then the first sequence will be padded with ten leading "-" to produce the alignment. The leading '-' are a formatting convenience only; they are not considered in the numbering system for al_display_start, al_start, or al_stop.
EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by EMBOSS environment variable EMBOSS_DATA.
Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".
The directories are searched in the following order:
| Program name | Description |
|---|---|
| water | Smith-Waterman local alignment |
This application was modified for inclusion in EMBOSS by Ian Longden (il@sanger.ac.uk) Informatics Division, The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
Completed 11th May 1999. Last modified 19th July 1999.