Thomas Junier
Swiss Institute of Bioinformatics
April 3, 2001
Metamotifs are a tool for describing the arrangement of features along sequences. This document describes the metamotif search engine, mmsearch, a Python program which allows to retrieve from a database all sequences that match a given metamotif.
The arrangement of features along a sequence, such as domains along a protein or a DNA stretch, is often more characteristic than the presence of any single feature. For example, protein kinase domains occur in a large number of proteins; so do sterile alpha domains; but only ephrin receptors have a protein kinase domain followed by a sterile alpha. In the DNA world, promoters exhibit similar properties.1Searching for arrangements may hence enhance the selectivity of searches without lowering their sensitivity, or vice-versa.
The mmsearch program works with arrangement descriptions called metamotifs (because they are, in a sense, motifs of motifs). Metamotifs look somewhat like character-oriented regular expressions (often called "patterns" in sequence analysis, see for example the PROSITE patterns), but instead of describing arrangements of characters in a text, they describe arrangements of features in a sequence. The syntax of metamotifs (see section 4) is deliberately similar to that of usual regexps (e.g., those of Perl), but due to specialization to a biological problem, there are significant differences. The mmsearch engine also works differently from a pure regular expression engine, because it performs some tasks (like comparing numbers extracted from its input string) that are beyond pattern matching.
mmsearch does not in itself look for the occurrence of motifs in the sequences (a.k.a. match data): this task is left to specialized predictors like profiles, HMMs, patterns, and the like. The match data is supplied to mmsearch on standard input, in a tab-separated format (several formats are possible). This may seem a drawback, but in fact it allows mmsearch to freely mix match data of any origin, including, for example, database annotations.
The call syntax of mmsearch is:
$ mmsearch [-hOXv][-i <input_format>][-o <output_format>]
[-n <name>] <metamotif>
The match data are read on stdin. The options are:
## sw:VAV_HUMAN 617-<prf:SH3#1-617 660-prf:SH3>#1-660 671-<prf:SH2#2-671 765-prf:SH2>#2-765 782-<prf:SH3#3-782 842-prf:SH3>#3-842
sw:VAV_HUMAN 601 660 METAMOTIF - - -This option allows the output of a metamotif search to be fed back to mmsearch for another metamotif search. This allows the construction of rather complex queries in a single pipeline (or of even more complex ones using scripts). See A for examples of this.
sw:VAV_HUMAN 782 842 METAMOTIF - - -
sw:VAV_HUMAN 601 658 pfam:SH3 1 57 10.423Obviously this option is most useful when in conjunction with option pff.
sw:VAV_HUMAN 617 660 prf:SH3 20 -1 11.796
sw:VAV_HUMAN 782 842 prf:SH3 1 -1 17.560
sw:VAV_HUMAN 785 840 pfam:SH3 1 57 18.215
sw:VAV_HUMAN 602 616 SPACER 1 -1 -This information can be handy when there is suspicion that there is an uncharacterized but conserved region that frequently occurs between, say, two motifs A and B. The spacer sequences could be extracted, aligned, and made into a profile. Again, this isn't very useful without the det option.
sw:VAV_HUMAN 618 657 SPACER 1 -1 -
sw:VAV_HUMAN 659 659 SPACER 1 -1 -
...
## sw:VAV_HUMAN 601-<pfam:SH3#3-601 617-<prf:SH3#1-617 658-pfam:SH3>#3-658
660-prf:SH3>#1-660 782-<prf:SH3#2-782 785-<pfam:SH3#4-785 840-pfam:SH3>#4-840
842-prf:SH3>#2-842
sw:VAV_HUMAN 601 660 METAMOTIF - - -
sw:VAV_HUMAN 601 658 pfam:SH3 1 57 10.423
sw:VAV_HUMAN 617 660 prf:SH3 20 -1 11.796
#
sw:VAV_HUMAN 782 842 METAMOTIF - - -
sw:VAV_HUMAN 782 842 prf:SH3 1 -1 17.560
sw:VAV_HUMAN 785 840 pfam:SH3 1 57 18.215
#
Here's the pseudocode for a metamotif search:
1: get match data about relevant sequences and features -> listHere's what each of these steps does in more detail:
2: scan metamotif regexp -> tokens
3: parse tokens -> automaton
4: for each sequence in list:
5: convert sequence to string representation
6: search for regexp over string using automaton
sw:VAV_HUMAN 402 504 prf:PH_DOMAIN 1 -1 10.759In this particular format (PFF - see B.2 for details), the first four fields are sequence ID, start of match, end of match, and motif ID. These are the fields needed by mmsearch, the others aren't used. Other popular formats, like GFF (see B.1), are also supported. Whatever the origin and format of the data, they are supplied to mmsearch on standard input.
Metamotif searching is (partially) a pattern matching problem,
so the match data pertaining to a sequence must first be
converted into some form of string. Here's an example of such
a string:
The feature end data appear in order of position. This may lead to ambiguities when two or more features start (resp. end) at the same position. In this case, the longer feature appears first (resp. last) in the string. All in all, expression 1 states that the sequence contains the start ('<') of domain SH2 # 1 at position 181, and the end ('>') of domain SH2 # 1 (i.e., the same SH2 domain) at position 256.
Note: To speed things up, the whole names of motifs are not used;
instead, they are converted to one-letter symbols, e.g. SH2
a, etc.
This representation is also used in mmsearch's native output (option nat, see 2).
The engine consists of the following Python files:
This section describes all the elements of the metamotif syntax.
A feature is denoted simply by its name, which may include alphanumeric characters (case is significant) as well as ':' and '_'.
SH2e.t.c.
PKINASE
When two features (or feature ends) directly follow one another, separate
them with a separator, ' =
' (space, 'equal', space).
The spaces are in fact optional, but I feel they enhance readability.
SH2 = SH3See also note 1 in section 5.SH2 followed by SH3
It is sometimes necessary to specify feature ends rather than whole features, for example when dealing with overlaps or inclusions. The start of a feature is indicated by a '<' preceding the feature, its end by a '>' following the feature:
<SH2start of SH2,
SH2>end of SH2,
In fact, mmsearch only deals with feature ends. A "whole" feature, i.e. one representing the total extent of the feature, is silently converted to two corresponding ends, e.g. if you say
SH2the program will convert it to
<SH2 = SH2>
When the number of residues that separate feature ends (or features) is important, specify the range of acceptable values with a spacer: 'm,n'.
53EXO_N_DOMAIN = 4,11 = 53EXO_I_DOMAINThis means "from 4 to 11 residues between the Exo-N and the Exo-I domains", and is typical of eubacterial DNA polymerases. Spacers are separated from feature ends by a separator (4.2). They can be open-ended, e.g. ',20' means ``at most 20 residues'' while '600,' specifies at least 600. Spacers must always be preceded and followed by a feature or feature end.
It can be requested that a metamotif occur at the beginning or end of the string. This is done with the usual regexp characters, '^' (start) and '$' (end). Anchoring at the start of the string can speed up the search appreciably, because if the pattern does not match immediately, mmsearch does not try to match at other positions in the sequence. See also note 2 in section 5.
Anchors may sometimes 'correct' unexpected (but nevertheless correct) behaviour. For example, say you wish to find this arrangement:
FNIII = FNIII = PKINASE
Running this will yield all manner of proteins with at least two fibronectin type-III domains followed by a protein kinase:
In this case, what the user wanted was probably only sequence #2. What has gone wrong? Nothing, in fact. It's just that in cases #1 and #3, the match does not start at the beginning of the sequence (the match is indicated by a dashed line). To specify that the match must begin at the start of the sequence, say
^FNIII = FNIII = PKINASE
The '$' anchor works much in the same way, but restricts
the match to the end of the sequence. Thus, if you wished to look
for sequences that contain exactly the above pattern, nothing before,
nothing after, you would say:
^FNIII = FNIII = PKINASE$
When there are several mutually exclusive possibilities (or 'branches'), separate them by an alternative: '(b1|...|bn)'. A branch can consist of an arbitrarily long list of features, spacers are allowed except at the beginning or end of a branch. Alternatives can be nested, i.e. a branch can contain an alternative. (In terms of grammar, a branch must be a FEATURE_LIST (see section 5)). Examples:
(SH2|SH3)There may be any number of branches, but only one of the branches may match (if you need to express the possibility of multiple branches simultaneously matching (i.e., overlapping), use an Equivalence (4.8).either SH2 or SH3
It is possible to look for a variable number of occurrences of some arrangement of motifs (again, a FEATURE_LIST in grammatical terms), which we call a range. Delimit the list with parentheses and specify the maximum and minimum values between braces ('{}') just after the closing parenthesis: '(...){m,n}'. Here's a possible characterization of the nerve growth factor receptor family:
(TNFR_NGFR_2){1,4} = DEATH_DOMAINWhat this stands for is "one to four (inclusive) tumor necrosis or nerve growth factor receptor domain(s) (TNFR_NGFR), then a death domain". Ranges can be open-ended, e.g., (XY){2,} means at least two XYs, and (XY){,3} means at most 3 XYs.
Ranges of the form {,0} can be used to indicate a motif that must not occur at this position. Here is an expression for a class of receptor protein kinases:
FURIN_LIKE = PKINASESuch proteins fall in two categories: Insulin receptors and related; and ERB-like oncogenes. A discriminating feature is the presence, in the former group, of at least one fibronectin type-III domain (FN3) between the Furin-like and the Protein-kinase domains. To select the oncogenes, i.e. those who don't have any FN3 domain at this position, use this expression:
FURIN_LIKE = (FN3){,0} = PKINASE
Two different predictors of the same motif (say, Pfam and PROSITE's version of SH2) do not always completely agree : there may be small to medium discrepancies in the start and stop positions, for example. When several motifs can occur at the same position, or at least with some overlap, specify them with an equivalence class, [b1|...|bn]:
[PROSITE_SH2|PFAM_SH2]This reads "SH2 from PROSITE, or Pfam, or both - in which case they must overlap". The branches of a equivalence class must be FEATURE_LISTs (see the grammar, section 5). There may be any number of branches.
This variant lets the user specify that all branches must match. The matched substring is a contiguous region which has at least one match of each branch. For example, to see where a gene on the minus strand overlaps a gene on the plus strand, you may say:
[!PLUS_STRAND_GENE|MINUS_STRAND_GENE]The '!' ensures that only regions with matches of both features will be reported. With an ordinary equivalence class, you'd get a report of all genes, because a match of single branch is enough for a match of the equivalence class.
Representing a sequence's features as a string has a potential problem, namely when two features start (or end) at the same position. This would be the case, for example, when two predictors of the same feature are in agreement (this is far from being always the case, but it happens). Suppose a sequence has an SH3 domain, identified both by a PROSITE profile and a Pfam HMM, starting on residue 53. The string representation could be
...53-<PROSITE_SH3#1-53 53-<PFAM_SH3#2 ...or
...53-<PFAM_SH3#1-53 53-<PROSITE_SH3#2 ...In this case, a simple metamotif like
^PROSITE_SH3
will match only in the first case. The workaround is to use inclusive
'or's, like this:
^[PROSITE_SH3|PFAM_SH3]
Sometimes it is necessary to identify feature ends, i.e. to know which start corresponds to which stop. Consider the disulfide bridges in this diagram.
Both correspond to the arrangement,
<SS = <SS = SS> = SS> = SS
Where SS is a disulfide bridge. Now case A corresponds to the EGF domain (and a few others). However, the above expression cannot distinguish case A from case B and is thus not suitable for finding EGF domains. It must be modified to
<SS#1 = <SS#2 = SS>#1 = SS>#2 = SS
Where the '#1's and '#2's identify individual disulfide bridges.
Here's the metamotif grammar:
METAMOTIF ::= (SEQUENCE|FORK)+ SEQUENCE ::= (FEATURE_LIST|GROUP)+ FORK ::= L_BRACKET BANG? FEATURE_LIST { PIPE FEATURE_LIST }* R_BRACKET GROUP ::= L_PAREN SEQUENCE { PIPE SEQUENCE }* R_PAREN { RANGE } FEATURE_LIST ::= FEATURE_BLOCK { SPACER FEATURE_BLOCK }* FEATURE_BLOCK ::= FEATURE_END+ SPACER ::= INTEGER COMMA INTEGER | ::= COMMA INTEGER | ::= INTEGER COMMA FEATURE_END ::= FEATURE_START | FEATURE_STOP FEATURE_START ::= L_A_BRACKET LETTER { HASH ( LETTER | DIGIT ) }? FEATURE_STOP ::= LETTER R_A_BRACKET { HASH ( LETTER | DIGIT ) }? INTEGER ::= DIGIT+ LETTER ::= ['A'-'Z'a'-'z'] DIGIT ::= ['0'-'9'] L_PAREN ::= '(' R_PAREN ::= ')' L_BRACE ::= '{' R_BRACE ::= '}' L_BRACKET ::= '[' R_BRACKET ::= ']' L_A_BRACKET ::= '<' R_A_BRACKET ::= '>' COMMA ::= ',' PIPE ::= '|' HASH ::= '#' BANG ::= '!'
Note 1: The grammar has no notion of separators. In fact, separators are ignored after converting the feature names to a 1-letter representation (their role is precisely to allow this conversion), and they are not part of the automaton.
Note 2: The grammar has no notion of anchors ('^' and '$'). These characters do not cause different automata to be constructed (this used to be the case in older versions); they simply cause the automaton to behave differently (e.g., by aborting early in the case of '^').
Note 3: The grammar has no notion of "whole" features, because any such names are converted to the equivalent two-ends form (see 4.3).
It is assumed that match data are available, either in a database or by running a search program on-the fly, and that they are passed to mmsearch on standard input.
Eubacterial DNA polymerases:
mmsearch '53EXO_N_DOMAIN = 5,10 = 53EXO_I_DOMAIN = 600, =
C_TERM'
Inclusion: sequences that have XPG_1 embedded in 53EXO_N:
mmsearch '<53EXO_N_DOMAIN = XPG_1 =
53EXO_N_DOMAIN>'
superposition: PROTEIN_KINASE_DOMAIN or
PKINASE or both:
mmsearch '[PKINASE|PROTEIN_KINASE_DOMAIN]'
alternative:
mmsearch '(IG|FN3) = PKINASE'
repetition:
mmsearch '(IG|FN3){2,4}'
long repetition:
mmsearch '(EGF){30,}'
Tyr PK embedded into PK (Pfam or Prosite) - some are found outside
PK domains!
mmsearch '[<PROTEIN_KINASE_DOM|<PKINASE] =
PROTEIN_KINASE_TYR = [PROTEIN_KINASE_DOM>|PKINASE>]'
Gene with at least 10 exons:
mmsearch '<gene = (exon){10,} = gene>'
Gene less than 10 kb long:
mmsearch '<gene#1 = ,10000 = gene>#1'
Gene with at least 10 exons and less than 10 kb long:
mmsearch -o pff -n 10_ex_gene '<gene = (exon){10,} = gene>'
| mmsearch '<10_ex_gene#1 = ,10000 = 10_ex_gene>#1'
See also the Hits examples page.
GFF (General Feature Format) was originally proposed by Richard Durbin and
David Haussler. This is a format for describing features in DNA
sequences. A full description is available from
http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml There is one
record per line, each record pertains to one feature in one sequence. The fields
are:
<seqname> <source> <feature> <start> <end> <score> <strand>
<frame> [attributes]
An here's an example, taken from the above URL:
SEQ1 EMBL exon 103 172 . + 0
PFF (Protein Feature Format) is a derivative of GFF, specialized for protein
features. It is also able to represent partial matches of a profile or
HMM. Like GFF, PFF has one record per line, each record pertaining
to one feature in one sequence. The fields are:
<sequence><seq_begin><seq_end><feature><ft_begin><ft_end><score>
The <seq_begin> and <seq_end> fields are
the positions of the match in the sequence. The <ft_begin> and
<ft_end> are the positions of the match along the model (from
the beginning and end, respectively). For full matches, these are 1 and -1, but
when matches are partial, these may differ. If the first five positions of the
model are missing in the match, say, then <ft_begin> will be
6. If the last five are missing, then <ft_end> will be -6.
This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -no_navigation -split 1 manual.tex
The translation was initiated by Thomas Junier on 2001-04-03