Copyright (C) 1993 1994 The Regents of the University of California
Note: The Ultimate Parser C++ library is still in test release. You will be performing a valuable service if you report any bugs you encounter.
In addition to the codes mentioned below, the following people all participated in the numerous discussions and design sessions leading to the Ultimate Parser library. The areas to which they primarily contributed to are listed below.
The Ultimate Parser C++ library is designed to support the development of new machine learning techniques for biosequence analysis. Its guiding principles include:
The following conventions are adapted from the GNU C++ library manual.
istream
and ostream, for AT&T C++ compatibility. Multi-word class
names capitalize each word, with no underscore separation.
#pragma once facility
is also used to avoid re-inclusion.
_Srep struct, which
is used only by the String and SubString classes.)
set_File_exception_handler().
These classes are used for reprsenting bases, alphabets and sequences.
Base class
The Base class is the primary data representation of biosequence
characters, whether they be nucleotides, amino acids, or an alternative.
The motivating factor begine the Base class is to enable largely
alphabet-independent data manipulation by relying on a standard data
format accross all alphabets. This implementations enables the
efficient implementation of multi-alphabet routines while also
providing an interface that supports alphabet-specific operations
See section The Alphabet Classes. Across all alphabets, the null_char() is an
input and output null character, and the bad_char() is an output
illegal character for that alphabet.
The Base class has no constructors. This is to speed pass by
value and to enable to placement of bases in registers.
Bases are initialized and assigned as in the following examples:
Base base;
Base for a given alpahbet, and
should be set to the null base specifically if desired.
Base base = Base::null();
Base x; x.set_int (n);
Base x = base_int (n);
Base x, y; x = y;
Alph Class
Alph Class
Bases, such as nucleotides, often have chemical variants which are often
ignored in the development of analysis software. The Ultimate Parser
enables the consideration of variant bases by including variant
structures in the base class. Many functions of the Base and
Alphabet class have a parameter from the global enum
VarEnum {NO_VARS, VARS}, whose options specify, respectively, that
character variants should be ignored (all treated as the primary
character) or not ignored (treated as different characters). Routines
should default, in the absense of a VarEnum parameter, to
NO_VARS, using the canon() function, below, to ignore
variant characters.
Bases may also be indeterminant. In the amino acid case, for example, biologists use the letter `B' to represent either `N' or `D', and the letter `X' to reresent any of the 20 amino acids. Translation of wildcards is impossible without alphabet information, however identification of wildcards is.
The null character can be regarded as a special, non-matching wildcard.
Checking for the null character, performed by default, can be turned off
for routines that include a NullEnum of NULL_NULL_FALSE
or NULL_NULL_TRUE argument. It is generally advised that checking
for a null bases be enabled, however for reasons of efficiency this
option may be turned off from time to time.
Matching a wildcard against another wildcard has two flavors according
to which function is called. In wc_match functions
both match tables are searched for the other character, and
in the wc_subset functions, only the first base's match table is
checked, allowing, for example, the assertion that any character is a
subset of the complete wild card, while the complete wild card is only a
subset of itself and other complete wildcards.
The null base is not a subset of any other base, including the
null base. All other bases are subsets of themselves.
The following informational and conversion functions involving variant and wildcard bases are available for members of the Base class.
Base: int is_normal (void)
Base: int is_wild (void)
Base: int is_variant (void)
Base: int is_null (void)
Base: Base null (void)
Base.
Base: char null_char (void)
Base.
Base: int is_null_char (int char)
Base: char bad_char (void)
Base.
Base: Base canon (void)
Base that is the canonical form of *this. If x
is not a variant character, x.canon() == x.
In the current implementation, each Base is represented by
an 8-bit character. The most significant 2 bits are used to represent
up to 7 variants of the primary character. The canon() function
simply masks out these bits. The null base is always represented by the
integer 63 (i.e., the lower 6 bits all set), and the numbers between 21
and 62, inclusive, are definable wildcards that require an
Alphabet (see section The Alphabet Classes) for translation.
Base limits (void)
Base class' underlying representation.
Base: int raw_int (void)
Alphabet::max_num_var_base(), which is expected to remain at
least as high as 256. Sparse conversion, providing indices dependent on
the alphabet length, require knowledge of the alphabet. See section The Alphabet Classes.
Base: void set_int (const int i)
raw_int.
Base: Base base_int (const int i)
raw_int.
The Alphabet class (see section The Alphabet Classes) has access to private
Base members for direct integer cast, construction, and
assignment.
Bases can be matched to check for equality. The equality operator,
== is not provided (it is implemented as a private class
member to provide a compilation error message). All equality operations on
bases must
specifically specifify whether or not wildcard matching is desired.
The routines are:
Base: int no_wc_match (const Base base2, const
NULL_NULL_FALSE, const VarEnum
varopt = NO_VARS)
Base: static int no_wc_match (const Base base1 const Base base2, const NullEnum nullopt = NULL_NULL_FALSE, const VarEnum varopt = NO_VARS)
*this or
base1, 0
otherwise. Parameter nullopt, either NULL_NULL_FALSE or
NULL_NULL_TRUE, determines the result of matching two null
characters. Parameter varopt controls
use of variants -- set to VARS to treate character variants as
unique characters. Wild cards will match themselves but no other
characters. See section Variants and Wildcards.
Base: int wc_match (const Base base, const Alphabet *alphabet, const VarEnum varopt = NO_VARS)
*this, 0 otherwise. Null
characters, a type of wildcard, are always checked.
Two wild cards will match if either is included in the other's match
table. Parameter varopt indicates whether or not variants should be
used -- set to VARS to treat character variants as unique
characters.
See section Variants and Wildcards.
Base: int wc_subset (const Base base, const Alphabet *alphabet, const VarEnum varopt = NO_VARS)
*this associated with
alphabet, return 1 if base is a member of *this's
wildcard table, 0 otherwise. That is, return whether or not
*this matches either the same or more characters than base.
Null characters have no subsets are not subsets of any
other character. Parameter varopt indicates whether or not
variants should be used -- set to VARS to treat character
variants as unique characters.
See section Variants and Wildcards.
The alphabet classes contain information on the intrepretation of a
Base (see section The Base class). Each member of the Alphabet class hierarchy
is expected to have at most one instantiation (this is checked at
runtime), a static member of the Alph class (see section Alph Class). The
alphabet class is implemented with its descendents in mind, so that most
functions are not virtual.
Alphabet ClassAll alphabets support a variety of functions.
Alphabet: String& name (void)
Alphabet: Alphabet* id (void)
The following three information functions are virtual to allow extensibility of the alphabet class. They are the only virtual functions in the alphabet class.
Alphabet: int is_nucleic (void)
Alphabet: int is_rnucleic (void)
Alphabet: int is_amino (void)
Alphabet: char to_char (const Base base, const VarEnum varopt = NO_VARS)
Alphabet: Base to_base (const char ch)
Base class). If ch is not a valid character for the alphabet,
a null base is returned.
Alphabet: Base valid_or_null (const Base base)
Alphabet: Base null (void)
Alphabet: int is_valid (const Base b)
Alphabet: int wc_match (const Base base1, const Base base2, const VarEnum varopt = NO_VARS)
VARS.
Alphabet: int wc_subset (const Base base1, const Base base2, const VarEnum varopt = NO_VARS)
base2 is a (possibly improper) subset of
base1. That is, whether or not base1 matches the same or
more characters than base2
Null bases have no subsets and are not included in any
subset. To match variant forms, set varopt to VARS.
Alphabet: static int no_wc_match (const Base base1, const Base base2, const NullEnum nullopt = NULL_NULL_FALSE, const VarEnum varopt = NO_VARS)
base2, 0
otherwise. Parameter nullopt, either NULL_NULL_FALSE or
NULL_NULL_TRUE, determines result when both bases are null
characters, while varopt controls use of variants -- set to
VARS to treate character variants as unique characters. Wild
cards will match themselves but no other characters. This is simply
another way of accessing the Base::no_wc_match()
function. See section Matching.
Alphabet: int index (const Base base)
max_index(), defined below. If data is known to
contain no wildcards (see section Variants and Wildcards), the programmer
may wish to simple perform an integer cast on base rather than
calling this function. If it is known that no null characters are
included in the data, the index will range between 0 and
norm_length()+wc_length(). This function is most useful for
nucleotide alphabets as integers casts of base for an amino acid
alphabet are already reasonably compact.
Alphabet: Base unindex (const int index)
Alphabet: const Base* matches (const Base base, const VarEnum v = NO_VARS)
Base::null()-terminated list of all non-variant,
non-wildcard bases that match base (0 if base is not valid,
1 if base is a non-wildcard, more if base is a wildcard.)
It is slightly more efficient to check base.is_wild() explicitly
rather than relying on the return of a singleton set.
Alphabet: int num_matches (const Base base)
Alphabet: const String& abbrev (const Base base)
Alphabet: const String& full_name (const Base base)
Alphabet: int norm_length (void)
Alphabet::norm_length() are returned from the integer
type conversion of a normal character (see section Indexing).
Alphabet: int wc_length (void)
Alphabet: int norm_wc_length (void)
Alphabet: int max_num_base (void)
Alphabet: int max_num_var_base (void)
Alphabet: int first_char (void)
Alphabet: int last_char (void)
Alphabet: int first_wc (void)
first_char() and norm_length().
Alphabet: int last_wc (void)
Alphabet: int first_var (void)
Alphabet: int last_var (void)
The ability to easily create efficient alphabet-independent procedures
has been the guiding feature of the Alphabet (see section The Alphabet Classes) and Base
(see section The Base class) implementations. Not only must a uniform interface to
the biosequence (or alternate domain) alphabets be provided, but the
system must allow alphabets of different types to coexist within one
program. Thus, compile-time switches on alphabets were quickly ruled
out. For efficiency, many operations, such as comparing without
wildcards and assembling counts of base occurances, can be completed
without reference (or without inner-loop reference) to an alphabet. The
structure of the base class also ensures that, for functions that
require alphabet information, efficiency is preserved for the common
case. Thus, for example, the index of a normal character or comparison
of two normal characters is performed without referencing the alphabet.
The current implementation is geared to nucleotides and amino acids --- the 64-element codon alphabet, for example, would not fit will in the current underlying implementation becuase of the base classes current upper limit of 20 normal characters. Codons could, of course, be represented as variants on the amino acids, though this would require a radically different index funtion for the Codons to compress the range to 64 elements. Thus, in future revisions, index may have to become a virtual function.
The alphabet class has several protected member functions to aid the creation of new alphabets. These functions are not needed for general programming.
Alphabet Alphabet (const String& name, const String& chars = "", const int case_sensitive = 0)
Alphabet constructor. It requires a name and a
(possibly empty) list of the normal (non-wildcard) chars in the
alphabet (or the empty string). Case during characters is ignored
unless case_sensitive is non-zero. This constructor is typically
used without any chars, as it does not allow the naming of characters.
Alphabet virtual ~Alphabet (void)
Alphabet void add_normal_char (const char c, const String& s_name = "", const String& l_name = "")
Alphabet. All normal characters must be added before any
wildcards. (There is no inherent reason for this restriction: it helps
ensure that everything a wildcard references is already in place.)
The short (s_name) and long (l_name) annotation strings
may be used to describe the new character.
Alphabet void add_alias (const String& newchar, const String& alias)
to_char(to_base(newchar)) will be equal to alias.
Alphabet void add_wild_card (char wildcard, const String& matches, const String& s_name = "", const String& l_name = "")
Alphabet void add_all_match (char wildcard, const String& s_name = "", const String& l_name = "")
Alphabet, except the
null character.
The short (s_name) and long (l_name) annotation strings
may be used to describe the new wildcard.
Alphabet void reset_name (const String& name);
Alphabet. Useful means of avoiding
name propagation in constructors.
Several descendents of the alphabet class are implemented as part of the
library. Currently, these include the basic nucleotide and amino acid
alphabets. Users are encouraged to call Alph::describe()
(see section Member Functions) for an up-to-date description of all available
alphabets.
The nucleic acid alphabets are all descendents of the minimal (most
general) NucleicAlphabet. In addition to the features of
Alphabet, this class includes an enumerated type defining the
symbols A = 0, G = 1, C = 2, TU = 3, and several functions. The
functions are currently not virtual, though as alphabets are refined,
they may become virtual.
NucleicAlphabet: Base complement (const Base base)
describe()
NucleicAlphabet: int same_group (const Base base1, const Base base2)
NucleicAlphabet int is_complement (const Base base1, const Base base2)
Return 1 or 0 depending on whether or not the two bases are in the same
group (pyrimidine or purine) or are Watson-Crick complements of each other.
section Specific Alphabets. same_group will return false if either
or both bases are the null base. is_complement checks b1
against b2.complement(), and thus will return 1 if both bases are
null and 0 if exactly one base is null.
NucleicAlphabet: int is_pyrimidine (const Base base)
NucleicAlphabet int is_purine (const Base base)
Return 1 or 0 depending on whether or not base is a pyrimidine or
purine.
section Specific Alphabets.
The RNAAlphabet class inherits from NucleicClass, and
additionally defines the symbolic constant U=TU, and asserts the
virtual function is_rnucleic().
The DNAAlphabet class inherits from NucleicClass, and
additionally defines the symbolic constant T=TU.
The ExtDNAAlphabet class inherits from DNAAlphabet, and
introduces a large number of wildcards defined as symbolic constants.
The virtual functions above have been defined on these wildcards. The
complement of a wildcard includes the complements of every base
that wildcard matches. A wildcard is_pyrimidine or
is_purine only if it exactly matches both characters (i.e.,
A, G, and R are pyrimidines, while C, T, and Y are purines). Two bases
are in the same_group if they are both purines or they are both
pyrimidines. For reference, the charactrers are: K=GT, W=AT, Y=CT, M=AC,
R=AG, S=GC, V=AGCRMS, B=GCTSKY, D=AGTRWK, H=ACTMWY, and the wildcards N
and X match all characters.
The AminoAlphabet class inherents from Alphabet, asserts the
is_amino() virtual function, and defines the standard
single-letter symbilic constants of the 20 amino acids, references as,
for example, AminoAlphabet::W.
The ExtAminoAlphabet class inherents from AminoAlphabet,
and adds treatment of three wildcards. They are included as symbolic
constants B=20,Z=21,X=22, where B matches N and D, Z matches Q
and E, and X matches any amino acid.
Alph ClassThe Alph class is a wrapper for alphabets. It contains as static members each of the instantiated alphabets. These are:
Alph Nucleic
NucleicAlphabet Alph::Nucleic is the most general nucleic
alphabet.
Alph RNA
RNAAlphabet Alph::RNA.
Alph DNA
DNAAlphabet Alph::DNA.
Alph ExtDNA
ExtDNAAlphabet Alph::ExtDNA.
Alph AA
AminoAlphabet Alph::AA. The amino acids without wildcards.
Alph ExtAA
ExtAminoAlphabet Alph::ExtAA. The amino acids with the standard
wildcards B, Z, and X.
The alphabet member function Alphabet::id() can be used to get
an identifier for each of these alphabets.
Alph describe (ostream& output)
Alph: void silent_convert (int val = 1)
cerr whenever Alphabet::to_base
is passed an inconvertible character for which the null character is
returned. If val is zero, error messages are produced. The
default is to produce the conversion error messages.
Alph: void set_default (const Alphabet& default)
Sequence class
members without an Alphabet type. section Sequence Constructors.
Alph: const Alphabet * ret_default (void)
Sequence class
members without an Alphabet type. Note that use of the default
is syntactically different from other alphabets:
Alph::ret_default()->id() rather than Alph::RNA.id().
Possibly, the standard alphabet names should be changed to function
calls and pointers, but the user should not be using these much anyway,
accessing them instead from sequence's alphabet functions.
section Sequence Constructors.
Alph: const Alphabet* name_to_alphabet (const char *name)
Alphabet corresponding to the character
string name, or NULL.
Alph: int num_alphabets (void)
Alph: const Alphabet* alphabet (int num)
NULL if
num is out of range.
Sequences are dynamically sized arrays of Base, with
reference-counting semantics similar to gnu Strings, and special I/O
routines which interact with common genetic database formats.
See section The Base class, section `The String Class' in Libg++ User's Guide,
and section ASN Sequence Streams.
Sequence Sequence ()
Sequence Sequence (int sz)
Sequence Sequence (int sz, const String &nm=nilSTR);
Sequence Sequence (const String &data, const String &nm=nilSTR);
Sequence Sequence (const char *data, const String &nm=nilSTR);
The following constructors return a subsequence which points to part of the base sequence section Reference Counting.
Sequence Sequence (const Sequence&, int offset, int size)
Sequence: Sequence SubSequence (const Sequence&, int offset, int size)
foo (Subsequence (s1, 408, 12));
Storage in newly allocated Sequence variables is unititialized. It can be initialized with the scalar assignment operators:
Sequence void operator= (Base base)
Bases in the Sequence to base.
Sequence void operator= (char basechar)
Bases in the Sequence to the Base
corresponding to basechar. The conversion alphabet must be set.
section Setting The Sequence Conversion Alphabet
SeqRep maintains the statically allocated information for Sequence representations: the alphabet, SeqLabel, and Sequence pointer, along with the reference counts. The instantiation, assignment and destruction of SeqReps is provided through Sequence constructors, assignment functions, and destructors.
Sequence is a reference counted class. Operations which assign one sequence variable to another do not normally do any copying, but instead cause the array part of one sequence to point to the other and increment a reference count.
The copy constructor is invoked whenever one sequence is initialized with another, either in declarations such as
Sequence s = base_seq;
or in pass-by-value:
void foo (Sequence s)
{
...
}
Sequence Sequence (Sequence& seq)
If copying is desired instead, this can be done with an explicit call to the copy function:
Sequence: Sequence copy (const Sequence&)
Sample usage:
Sequence s = copy (base_seq);
foo (copy (base_seq));
Similarly, the assignment operator causes referencing of its argument:
Sequence: Sequence& operator = (Sequence& sq)
If copying is desired instead, the copy method can be called:
Sequence void copy (Sequence& seq)
The copy method is slightly different than the copy function, in that the copy function always allocates a new sequence and copies into it, while the copy method, when called on a object of the same size, will copy into already allocated storage. section Automatic Resizing of Sequences
Because reference counting allows any of the sequence variables which refer to a piece of storage to change that storage, the copy constructor does not allow initialization of another Sequence variable with a const Sequence variable.
The desired effect can be obtained with an explicit call to the copy (Sequence& sq) function, i.e.:
const Sequence A; ... Sequence sq = copy (A);
Similarly, the assignment operator (operator =) only allows assignment from non-const Sequences, but one can assign from a copy of a const Sequence.
It would be desirable to allow const Sequence variables to be initialized with other const Sequence variables, but the language does not allow the specification of constructors which differentiate between const and non-const variables.
Since having const variables that can be changed by other parts of the code is an undesirable feature, it was judged better to simply not allow initializations of other Sequence variables with const Sequences.
Individual bases in a Sequence can be accessed using the usual subscripting operator:
Sequence: Base & operator [] (int index)
For const Sequences, the subscripting operator is read-only:
Sequence: Base operator [] (int index) const
Sequence rather than a reference.
The size of a Sequence variable can be gotten with the
size () method:
Sequence: int size () const
Sequence.
(For subsequences, will only reflect subsequence size)
For subsequences, it is sometimes useful to know the offset
into the base Sequence. This can be gotten with the offset ()
method:
Sequence: int offset () const
Sequence from the base Sequence.
(Will be zero unless one of the subsequence constructors
was called by a parent.)
Sequence: int & {operator==} ( const Sequence & c) const
Sequence: int & {operator!=} ( const Sequence & c) const
Each sequence has its own alphabet pointer section The Alphabet Classes. The
Alph::set_default() function should be called before performing
any I/O operators or String conversions are called.
The current alphabet can be gotten with the alphabet function:
Sequence: const Alphabet *const alphabet ()
Sequence: Sequence& operator= (const String &string)
Sequence: Sequence& operator= (const char *string)
It would be desirable to also allow initialization
of a Sequence from a String by defining a Sequence
(String &) constructor, as in:
Sequence sq = "ACGT"
However, because the order of construction in different compilation units is undefined, there is no way to ensure that the alphabet is set before the constructor is called.
Sequence: void scanFrom (istream &is)
scanFrom.
Sequence: void printOn (ostream &o)
printOn.
Currently only one format is implemented:
RAW_ASCII
; or #.
The copy () function and input functions operator << ()
and scanFrom () all automatically resize the Sequence
variable they are storing into.
Since any extra references to the resized Sequence variable will
still refer to the old storage, an error message is generated
if the references are greater than 1.
Currently the resizing algorithm resizes the Sequence
to be exactly equal to the new size, by allocating new storage
and copying into it. This will result in some overhead if many
sequences with similar but not equal sizes are read or copied
into the same variable.
The following functions are intended for fully debugged library code. They allow indexing into a sequence without range-checking. This will allow user code to turn range-checking on and off without affecting the speed of the library code.
Sequence: const Base elem (int index) const
Sequence: Base& elem (int index)
Sequence: Base* data ()
Sequence: const Base* data () const
Sample usage:
int s = sq.size (); // cache size in a local variable
// the compiler probably isn't
// smart enough to figure out that it
// is a loop invariant
Base *ptr = sq.data ();
for (i=0; i < s; i++)
foo (ptr[i]);
For backward compatibility, Sequence defines the following conversion operators, which allow a Sequence to be passed to a function which expects an array of Bases:
Sequence.
To invoke the conversion operator, just put a Sequence
where an array of Base is expected:
void foo (Base ar[], ...);
Sequence sq;
foo (sq,...); // conversion operator is invoked
Treating a Sequence as an array of Base is not
recommended for new code, because it does not allow for bounds checking
of subscripts, or for use of any of the other functions defined on
Sequence variables.
SeqList is a dynamically sized array of Sequence. Since
each Sequence in a SeqList has its storage allocated
separately, it acts more
like a list of Sequences than a 2D array with respect to efficiency
of accessing columns. See section Sequence
SeqList SeqList ()
SeqList variable with no storage allocated.
SeqList SeqList (int size)
SeqList of size. Each cell in the array
is automatically initialized to a zero-length (null) Sequence with a call to
the Sequence () constructor.
SeqList: void clear (void)
If one wishes to initialize the contents of a SeqList variable to something
other than null Sequences, the Sequence& assignment operator can be
used:
SeqList sqlst; sqlst = "------------------------------------";
The operator makes size () copies of the right
hand side so that each SeqList cell will refer to different
storage.
SeqList: SeqList & {operator+=} (Sequence & seq);
Unlike Sequences, SeqList just has the regular copy
semantics, i.e, the copy constructor and the assignment
operator make copies of their arguments. If you want to
pass around little square pieces of an alignment rather than
just pieces of individual Sequences, use an Alignment
section `Alignment Class' in To Appear in the Ultimate Manual.
SeqList SeqList (const SeqList& slist)
SeqList: SeqList& operator= (SeqList& slist)
SeqList: Sequence& operator [] (int index)
With this definition, a SeqList acts like ragged 2D array with respect to indexing operations: For example:
SeqList sqlst; Sequence sq; sq = "ACGT"; // assumes the alphabet has been set sqlst[0] = sq; // calls SeqList subscript operator cout << sqlst[0][2]; // calls SeqList subscript operator, // then Sequence subscript operator
SeqList: int size ()
SeqList: int total_size ()
The SeqList I/O format is governed by the Sequence I/O format variables.
The Bases are converted according to the Sequence alphabet
(see section Setting The Sequence Conversion Alphabet), and the label of each individual
sequence is input according to the format set with
Sequence::set_format () ,
or the default, which is RAW_ASCII. See section Sequence I/O Formats.
SeqList from the istream, according to the
current Sequence alphabet and format section Sequence I/O Formats.
Automatically resizes the SeqList variable section SeqList Automatic Resizing.
scanFrom.
printOn.
A SeqList is automatically resized on input, or on copying
another SeqList with either the copy constructor or the
assignment operator.
The ASN sequence stream classes support sequence (see section Sequence)
input and output
using the NCBI's ASN data format. The BaseStream class has the
kernal interface to the NCBI software, the AsnStream is designed
for reading and writing user files in ASN format, while the
EntrezStream stream class provides access to the compress Entrez
databse as distributed on CD-Rom.
BaseStream Class
BaseStream is a the base class for reading sequences using stream
functions from ASN and Entrez genetic sequence databases. This class
should not be instantiated per se but instead provides functionality
common to the derived AsnStream (see section The AsnStream Class) and
EntrezStream (see section The EntrezStream Class) classes.
Public functions are available to check the state of the stream, to read a
given sequence (out of a sequence-set), to read the description of the
sequence-set, and to read the title of the current sequence within
the sequence-set. The protected functions (called internally only) are
used by the derived classes to set up to read (loading information from
the current sequence) and to initialize the ASN data structures.
See section Sequence
Use the constructor as following:
BaseStream ()
BaseStream: BaseStream& operator>> (Sequence &seq)
AsnStream or EntrezStream must have
been positioned at a paricular sequence before using this operator. The
operator also advances to the next sequence within the sequence set. If there
are no more sequences within this set, it will set the noseqsetbit
(see section Flags and States) and return without attempting to transfer any
more information.
BaseStream: char* ExtractDescr (void);
BaseStream: char* ExtractTitle (int whichSeq)
BaseStream: void SetupToRead ()
The base stream class include the following status bits:
BaseStream badbit
BaseStream noseqsetbit
BaseStream badtypebit
The following methods can be used to examine and modify the
BaseStream status bits.
BaseStream clear (int value=0)
BaseStream: int bad (void)
BaseStream: int good (void)
BaseStream: int badtype (void)
BaseStream: int noseqset (void)
AsnStream Class
AsnStream is a class (inherited from BaseStream,
section The BaseStream Class) for finding and
reading sequences using stream functions from ASN genetic sequence databases.
(ASN-format databases typically have `.asn' or `.aso'
extensions and should not be confused with Entrez/CDROM databases.)
Supported functionality includes
opening, building or reading or writing an index, finding a list of file
positions that satisfies a query on sequence descriptions, going to a file
position and going to the next sequence. Functionality inherited from
BaseStream allows AsnStream to get the next subsequence and read an alphabetic
(ASCII) sequence into a Sequence class instance (see section Sequence). The
intended use for an
AsnStream will be to create private databases of information probably taken
from the Entrez databases. Writing material to an AsnStream is not currently
supported.
Use the constructor as following:
AsnStream (char* filename, int mode)
AsnStream using filename. badbit is set if
this file cannot be opened. Parameter mode is currently
binary|input only. Note: mode flags as with the usual stream classes
are enumerated members of the AsnStream class.
AsnStream: void GoNextSeqSet (void)
Seek () should be called before beginning to read using the >> operator, or
at any time when noseqset () is true, which could also happen when reaching
the end of a certain sequence set.
AsnStream: void FindFilePos (String& searchString, long *& locs)
Seek() to go to whichever
location; at that point the data from the sequence-set found could be read using
the usual functions from BaseStream such as the >> operator. Note that
this function ends up allocating memory for locs; the caller is
responsible for freeing that memory with 'delete'.
AsnStream: void BuildIndex ()
FindFilePos() function but it is not otherwise
necessary. Ordinarily after calling BuildIndex () one would call
WriteIndex () to write it out.
AsnStream: int ReadIndex (char * filename)
AsnStream: int WriteIndex (char * filename)
BuildIndex() before using this function.
AsnStream: SeqEntryPtr read ()
GoNextSeqSet()
and by BuildIndex().
EntrezStream Class
EntrezStream is a class derived from BaseStream for reading sequences
using stream functions from the Entrez/CDROM genetic sequence database.
Supported functionality includes finding a sequence set ID by description and
loading a sequence set given the sequence set ID. Functionality inherited
from BaseStream allows reading sequence from sequence sets and getting the
title and description.
See section Sequence
Use the constructor as following:
EntrezStream ()
EntrezStream. failbit is set if Entrez access initialization
fails for some reason.
EntrezStream: void FindUIDSet (char * searchString, DocUid * uidsFound, Int2 recordType, Int2 fieldType, Int2 * beginError = NULL, Int2 * endError = NULL);
The syntax uses
&,|, and-, respectively, as the intersection, union, and set substraction operators. Terms are usually followed by a field qualifier like[AUTH](indicating author name). When terms contain embedded spaces or special characters, they must be enclosed in double quotes ("). Parentheses are used to override the standard precedence in the way that you would expect. The[*]field qualifier is used to say "give me the union of this term over all available fields."
Here are some examples:
"Kay LE" [AUTH] - "Forman-Kay JD" [AUTH] carcinoma [MESH] | oncogene [WORD]
Ordinarily the field type should be specified as -1, since you'll be specifying the field in the search string. Valid field types in the search string are:
WORD
MESH
KYWD
AUTH
JOUR
ORGN
ACCN
GENE
PROT
ECNO
Valid data types are TYPE_ML (MEDLINE references), TYP_AA
(amino acid sequences), and TYPE_NT (nucleotide sequences).
These are defined in `accentr.h'.
The return values (coming back through DocUid * uidsFound)
give the number of items found as the first item; 0 for none and -1 in
the case of a nonfatal error. With a nonfatal parse error,
beginError and endError will be set to the beginning and
ending of the error in the search string.
Ordinarily, one would use LoadSequence() after this to load the
sequence set specified by a particular DocUid. Then the base
stream functionality can be used to get the sequence set title, sequence
description, and extract the actual ASCII sequence.
EntrezStream: void LoadSequence (DocUid id)
DocUid id
(returned from FindUIDSet).
This is necessary before beginning to read using the >> operator.
This function should be called if EntrezStream::noseqsetbit is set.
Here's how to set up your configuration file to access Entrez or ASN
databases with Sequence-Streams. section The AsnStream Class, and See section The EntrezStream Class.
On Unix your configuration file would be called `.ncbirc', and must
be in the same directory as the program using EntrezStream and/or
AsnStream. It looks much like the following (which is valid as
of March 1995):
[NCBI] ROOT=/projects/compbio/entrez ASNLOAD=/projects/compbio/entrez/asnload DATA=/projects/compbio/entrez
The only section is entitled NCBI. The ROOT entry refers
to the path to the root for all the Entrez CD-ROM data. ASNLOAD
refers to the location for ASN.1 parse files (`*.l00'). The
DATA entry refers to the location for the `cdromdat.val'
file, which contains conversion specifications for sequences to whatever
alphabet (the alphabet being IUPACNA for Sequence Stream's). Don't
confuse that alphabet with the Alphabet class (see section The Alphabet Classes),
by the way; this is an alphabet translation internal to NCBI. For right
now, you can copy `.ncbirc' from
`ultimate/lib/proto.ncbirc', and that should work.
AlphabetTuple and BaseTuple Classes
The class AlphabetTuple is for creating tuples (cartesian products)
of objects from class Alphabet(see section The Alphabet Classes).
Class AlphabetTuple is intended for use with short fixed length tuples.
The class BaseTuple provides for tuples of class Base (see section The Base class),
the elements of the cartesian product of alphabets.
AlphabetTuple Class
The following methods are available for AlphabetTuple:
AlphabetTuple: AlphabetTuple (const Alphabet *a0)
AlphabetTuple class that creates a singleton tuple of the alphabet pointed
at by a0.
AlphabetTuple: AlphabetTuple (const Alphabet *a0, const Alphabet *a1)
AlphabetTuple class that creates an ordered pair of alphabets, with the
first element of the tuple pointed at by a0, and the second element pointed at by a1.
AlphabetTuple: AlphabetTuple (const Alphabet *a0, const Alphabet *a1, const Alphabet *a2)
AlphabetTuple class that creates an ordered triple of alphabets, with the
alphabets of the tuple pointed at by a0, a1, and a2.
AlphabetTuple: AlphabetTuple (int i, const Alphabet **a)
AlphabetTuple class that creates a tuple consisting of i alphabets
pointed to by elements of the array a.
AlphabetTuple: AlphabetTuple (const AlphabetTuple &a)
AlphabetTuple class.
AlphabetTuple: ~AlphabetTuple (void)
AlphabetTuple class.
AlphabetTuple: int num_alphabets (void) const
AlphabetTuple: int num_normal (void) const
BaseTuples that are elements of the cartesian product of the
alphabets represented by the AlphabetTuple.
AlphabetTuple: const Alphabet * operator [] (int i) const
AlphabetTuple: int same_as (const AlphabetTuple *other) const
AlphabetTuple pointed at by other is identical elementwise to the
this AlphabetTuple. Return 0 otherwise.
AlphabetTuple: int index (const BaseTuple & bt) const
BaseTuple bt when considered as an element
of the AlphabetTuple. This is useful for indexing a linear array of BaseTuples.
This is the inverse of the function unindex described below.
AlphabetTuple: BaseTuple *unindex (int index) const
BaseTuple corresponding to the integer index. This is the inverse of the function
index described above.
AlphabetTuple: void print_unindex (ostream &out, int index) const
BaseTuple corresponding to the integer index.
AlphabetTuple: void print_command (ostream &out) const
AlphabetTuple.
For additional input and output, the following functions are available:
AlphabetTuple: AlphabetTuple * read_AlphabetTuple (istream &in) const
AlphaTuple and return it.
The commands are of the form
Alphabet= <alphabet_name>
AlphabetPair= <alphabet_name> <alphabet_name>
AlphabetTriple= <alphabet_name> <alphabet_name> <alphabet_name>
AlphabetTuple= <number> <alphabet_name> ... <alphabet_name>
as would be output by print_command
If the firstword is not recognized, it is looked up as an alphabet name,
as if preceded by "Alphabet="..
AlphabetTuple: ostream & operator<< (ostream &out, const AlphabetTuple &a)
Alphabets of the AlphabetTuple a to stream out.
BaseTuple Class
The following methods are available for class BaseTuple:
BaseTuple: BaseTuple (const AlphabetTuple &a)
BaseTuple class. Argument a is the
AlphabetTuple that the newly constructed BaseTuple is a
member of.
BaseTuple: ~BaseTuple ()
BaseTuple class.
BaseTuple: Base & operator [] (int i)
BaseTuple.
BaseTuple: const Base operator [] (int i) const
BaseTuple.
The following functions are for input and output of BaseTuples to streams:
BaseTuple: ostream & operator<< (ostream &out, const BaseTuple &bt)
Bases of the BaseTuple bt to stream out.
BaseTuple: istream & operator>> (istream &in, BaseTuple &bt)
Bases of a BaseTuple bt from stream intm.
Prob, ShortProb, LargeReal, ShortLargeReal classesAll four classes are found in `Prob.h'. They are all done in a macrotized way, no templates.
The Prob class is designed to allow convenient manipulation of
probability values. It usually stores probabilities in logarithmic
form, but the exact implementation is hidden from client code. To allow
renormalization within the class, values greater than unity are allowed.
The ShortProb class interface is identical to Prob, but may use a
smaller internal representation (float instead of double). Casting
operators are provided for changing from one form to the other, though
the ShortProb inherently has less precision.
The ProbBase and ShortProbBase classes have all the
functionality of Prob and ShortProb (indeed, they are
inherited by the latter), except for the constructors. Users may want
to use these variants for large arrays in which constructor calls could
be a significant part of execution time. Corresponding
LargeRealBase classes are not available at this time.
Should you want arbitrary range or negative numbers, look at
LargeReal.
Prob and ShortProbProbs may declared in several ways:
Prob P;
ProbBase, and the value of a ProbBase
declared this way is undefined rather than 0.0.
Prob P(Prob::Zero);
Prob P(Prob::One);
Prob P(Prob::Invalid);
Prob P(0.53);
Prob P1; P2(P1);
In general, Probs behave like real numbers. They may be manipulated with the standard normal arithmetic and relational operations:
+ - * / += -= *= /= < > <= >= == !=
Probabilities and normal numeric values may not be freely intermixed.
The conversion from a double is PRIVATE, so you can not get away with
mixing them. To use a number with a Prob, the from_double cast
must be used:
P = (from_double)0.67; // create a prob of .67
P = Q * (from_double)0.5;
If you have a double whose value represents a LOG of a probability, and you want to stuff that into a Prob, you must use the cast (from_log):
P = (from_log) (-0.28768); // the value of ln(.75), note the explicit minus sign!
Alternatives to these casting operations ProbBase are included
among the member functions. section Member Functions
There are several methods and functions unique to Probs.
Prob: double ret_epsilon (void)
Prob: int valid (void)
Prob: int non_prob (void)
Prob is not a true probability: when either
it is not valid(), or it represents a number that is greater than
1. A 0 return indicates that the Prob represents a number on the
closed unit interval.
Prob: int is_zero (void)
Prob: int is_one (void)
Prob: double ret_double (void)
Prob: Prob& set_double (double dval)
Prob: double ret_log (void)
Prob: Prob& set_log (double logval)
Prob: Prob& set_const (ConstProb c)
Prob::Zero or Prob::One.
Prob: Prob operator~ (void)
Prob: double entropy (void)
- P * ln(P).
The return value is positive.
Prob: int approxEqual (Prob const & a, Prob const & b)
Prob: int approxEqual (from_double const & a, Prob const & b)
Prob: int approxEqual (Prob const & a, from_double const & b)
a and b are less than 1E-7
apart. This constant is the same for Prob and ShortProb,
about right for the latter, and likely too large for some applications
using the former. Casts to (from_double) are required to avoid
ambiguity with other approxEqual functions.
Prob: double log2 (Prob const & a)
Prob.
Prob: double log (Prob const & a)
Prob.
Prob: Prob pow (Prob const & base, double exponent)
Prob: Prob pow (Prob const & base, int exponent)
Prob, exponent can
not be negative. If you give it a negative exponent, the return
is an invalid Prob; valid() will be false.
Prob: double pow_double (Prob const & base, double exponent)
pow, but returns a double, so exponent can
be any value. Since C++ does not notice the return type difference, this
function has to have a different name.
LargeReal and ShortLargeReal
The LargeReal and ShortLargeReal classes implement signed
numbers using Prob (or ShortProb) as the hidden internal
representation.
LargeReals may be declared in several ways:
LargeReal R;
LargeReal R(LargeRealSign::Pos);
LargeReal R(LargeRealSign::Neg, 10);
LargeReal R(-1, 10);
LargeReal R(-123.8);
LargeReal R(aProbClassThing);
LargeReal R(LargeReal::Zero);
LargeReal R(LargeReal::One);
LargeReal R(LargeReal::Unity);
LargeReal R(LargeReal::Invalid);
In general, LargeReals behave like signed real numbers. The
standard arithmetic and relational operators are defined.
Like Probs, there are severaly unique functions.
LargeReal: int is_zero (void)
Prob::.is_zero()).
LargeReal: int valid (void)
Prob::.valid()).
LargeReal: Prob ret_mag (void)
LargeReal: int is_nonneg (void)
LargeReal: int sgn (void)
LargeReal: int is_positive (void)
LargeReal: int bigger (LargeReal a, LargeReal b)
LargeReal: int approxEqual (LargeReal const & a, LargeReal const & b)
LargeReal: int approxEqual (double const & a, LargeReal const & b)
LargeReal: int approxEqual (LargeReal const & a, double const & b)
Prob::approxEqual(), or when the signs are
different and both individually are approximately equal to zero. Return
0 otherwise. Note that this function is not uniform about 0.
LargeReal: LargeReal pow (LargeReal base, double exponent)
LargeReal: LargeReal log (LargeReal base, double exponent)
The two print functions are print and rawprint.
The Prob and LargeReal implementations depend on IEEE754
32- and 64-bit floating point numbers. Since not all compilers or
systems define the same constants in their header files, the Prob
class relies on none of these, instead defining its own constants based
on the assumed standard format of single- and double-precision floating
point numbers.
One goal of the class is to provide the same range for Prob and
its single-precision version ShortProb to ensure error-free
conversion between the two formats. Thus, the 64-bit Prob class
only has more precision than the ShortProb class, not more range.
In the IEEE754 standard, double-precision numbers range to about 2E307,
while single-precision numbers range to 2E38. For this reason, zero has
been chosen to be exp(-2e35) (Probs uses the natural
logarithm), while the sentinal Prob::Invalid value is -2E37, and
any Prob smaller than -1E37 is considered invalid. If the range
between the zero probability and the invalid probability were further
spread, it would be possible to semi-safely perform multiplication
(addition of the underlying probabilities) without checking for zero.
However, as they are defined here, the check for zero must occur, or
adding together one thousand zeros would result in a ShortProb
becoming invalid. If the infinity and NaN (Not A Number) checks were
guarenteed to be done in hardware, rather than software, relying on
their IEEE definitions would be another means of speeding the code.
64-bit IEEE floating-point numbers are used for comparison and
operations wich require exponentiation. The smallest IEEE denormal is
about exp(-713), or 1E-323.306. Not trusting denormals (for the
same reasons as not relying on infinity and NaN), however, provides a
smallest number of about exp(-708), or 1E-307.65. Two
probabilities are approxEqual if they differ by less than 1E307
(ret_epsilon()), or if the difference in their log
representations is less than 708. Also, in the conversion to a log, any
double that is smaller than epsilon is converted to zero to prevent
underflow in the call to log(). In the pow_double(a,r)
routines, episilon is similarly used to avoid called exp() with
too small a number.
The addition of two Probs is another interesting problem. Here,
the numbers must be exponentiated, summed, and then reduced again to the
log domain. To avoid reduce this operation to only one exponentiation,
the difference between the two log values is exponentiated.
Suppose P>Q, and both are log values. Then, we want to calculate:
log(exp(P)+exp(Q)), orlog(exp(P)*(1.0+exp(Q)/exp(P))), orP+log(1.0+exp(Q-P).
When the 64-bit representation is being used, each log value has 54 bits
of precision. Thus, if exp(Q-P) is less than 2**-54, the quantity
added to P will be zero. This motivates the private
_maxDiff() value of log(2**-54), or 37.4. If the log
values of two Probs differ by more than this, addition of the
Probs simply results in the return of the larger value. For
ShortProbs, the threshhold is log(2**-24), or 16.6. This
is the only constant that varies between the Prob and
ShortProb classes.
Addition is further sped by a check for zero: if one arguement is zero, the other argument is returned.
Hist is the basic histogram class. It has functions to add and
delete samples, and get sample statistics such as the mean and variance.
SmoothHist is a virtual derived class. It defines all of the
functions in Hist, plus smoothing-related functions.
The derived classes of SmoothHist implement specific kernels.
Currently, RectHist, NormalHist, and LogNormalHist
are implemented.
Hist Hist (int low, int high)
Hist with bins from low to high. Bins
on the end are infinitely wide: all instances less than or equal to
low are put in low, and all instances greater than or equal to
high are put in high.
RectHist RectHist (int low, int high)
SmoothHist which uses a rectangular kernel.
NormalHist NormalHist (int low, int high)
SmoothHist which uses a normal kernel.
LogNormalHist LogNormalHist (int low, int high)
SmoothHist which uses a log normal kernel.
Because of systematic bias, this kernel is not recommended.
Hist add_instance (int instance)
Hist operator+= (int instance)
Hist sub_instance (int instance)
Hist operator-= (int instance)
In all classes, the kernel is initially defined to be the identity function. (Note: this should probably be changed to some suitable fraction of the range.) Setting the kernel width will cause the kernel to be recalculated. You can either set the kernel explicitly, or the class will calculate a kernel width based on the variance of the data and the number of samples. The class will print an error message if the kernel width is set to zero.
Hist set_kernel_width (double sigma)
Hist set_kernel_width (void)
Hist smoothing (DYNAMIC)
Hist smoothing (LAZY)
prob(), count(),
or countVec() methods .
Hist smoothing (USER)
smooth() function.
For smoothed classes, the count information will not reflect
any updates to either the kernel or the counts
until smoothing is called explicitly with the smooth() method
Hist smooth (void)
The following refer to smoothed counts if the class defines a kernel, otherwise they refer to raw counts.
Hist: int operator () (int bin) const
Hist: float counts (int bin) const
Hist: float prob (int bin) const
The entire count Vector can be retrieved as follows:
Hist: const floatVec countVec (int) const
Hist: float mean (void) const
Hist: float variance (void) const
For smoothed classes, the unsmoothed information can be retrieved by
appending raw_ to the previous functions:
Hist: float raw_counts (int bin) const
Hist: float raw_prob (int bin) const
Hist: const floatVec raw_countVec (int) const
Hist: virtual void printOn (ostream& ostr)
SmoothHist to print extra information.
Hist: virtual void dumpOn (ostream& ostr)
printOn().
Hist: virtual void plotOn (ostream& ostr)
Hist, smoothed counts for SmoothHist).
Hist: ostream& operator<< (ostream& ostr, Hist& histogram)
printOn().
Hist: virtual void scanFrom (istream& istr )
Hist, raw
counts for SmoothHist.
Hist: istream& operator>> (istream& istr, Hist& histogram)
scanFrom().
Hist: virtual void storeOn (ostream& ostr)
Hist: virtual void restoreFrom (istream& istrm)
storeOn.
Hist: static const char* classID (void)
Hist: virtual const char* type (void) const
Hist h; if (h.type() == RectHist::classID()) ( do something)
For input the following functions provide useful, uniform ways for dealing with comments in input files.
The following classes are useful for keeping track of objects that need to be named, with classes for hash tables of objects and command scripts.
ClassNameRegistry
Note that ClassNameRegistry is an obsolete feature
and should be replaced by
using NamedClass(see section NamedClass Class).
The class ClassNameRegistry is used for registering names of classes
in a hash table. Registering a class is intended to facilitate I/O and
most importantly, to enable a program to create an object of certain type
without knowing at compile time what it is. For example, a program could
create an object based on data in a text file by using the
ClassNameRegistry and the name the object is registered under.
A class should normally be registered using the following macro:
char * id, create_name, _init_name1, __init__name2)
char * string of the class name. create_name, _init_name1,
__init__name2 should be identifiers that are unique within the scope
of the file in which the call to RegisterClass is made.
create_name will be used as the name of a function to create a new
object of type name. _init_name1 and __init__name2 will
be used for structures necessary for automatic initialization and entry of
name into the ClassNameRegistry. These last two identifiers should
never need to be used by the programmer again. If a class has already been
registered under the same identifier, then the error is reported and the
program aborted.
A class needs to have a void constructor to be registered. In addition, a class must have the following 3 members defined before it can be correctly registered:
static const char * ID. RegisterClass sets this member to point
to it id parameter.
static const char * classID (void) { return ID; }.
This static member function should just return the value of the static member
ID.
virtual const char * type (void) { return ID; }.
This static member function should also just return the value of the static
member ID.
To see if a class has been set up correctly, the value of the virtual function
type should equal the value of the static function classID.
For example, suppose we have a class cl which we have registered under
"cl", and that x is an instance of class cl. Then the expression
x->type() == cl::classID()
should be true.
When a class has been properly registered, the value returned by the
static function classID should be the same as the value returned by
ClassNameRegistry::ID. In the previous example, the expression
ClassNameRegistry::ID("cl") == cl::classID()
should be true. These two expressions can be used as consistency checks when registering a class.
These methods are also defined:
ClassNameRegistry: static void add_class (const char * id, const CreateFcn creator)
ClassNameRegistry using the RegisterClass macro
described above.
ClassNameRegistry: static const char * ID (const char * id)
ClassNameRegistry: static void * create (const char * id)
NamedObject Class
The class NamedObject should be used as a base class for objects that
have names. In addtion to having a name, a NamedObject can also
be provided with a help string containing useful information about the
particular NamedObject. If a NamedObject is not provided with
a help string, then a default help string is provided which says
"No help available.".
Class NamedObject can be used in conjunction with class NameToPtr
(see section NameToPtr Class) when a lookup table of objects is needed.
The following methods are available for NamedObject
NamedObject: NamedObject (void)
NamedObject class.
NamedObject: NamedObject (const char * nm, const char * help)
NamedObject class. Argument nm is a char *
string containing the name of the NamedObject and help is a
string containing help information. The default value of help is
0. Note that the instance of NamedObject constructed owns a copy
of the string nm. However, the instance does not own a private copy
of the helpstring help.
NamedObject: ~NamedObject (void)
NamedObject class.
NamedObject: const char * name (void) const
char * string containing the name of the NamedObject.
NamedObject: void set_name (const char * x)
NamedObject to the name pointed to by x.
Note that the NamedObject owns a copy of the string x.
NamedObject: const char * help (void) const
char * string containing the help information for the
NamedObject. If the NamedObject has no help string, the a string
reading "No help available" is returned instead.
NamedObject: void set_help (const char * h)
NamedObject to the name pointed to by
h.
NamedObject: void read_name (istream & in)
Name = in the input stream.
NamedObject: void write_name (ostream & out) const
Name = and terminated with a newline.
IdObject Class
The class IdObject is the type used for giving
unique IDs to classes.
It is derived from class NamedObject (see section NamedObject Class).
In addition, it is needed for maintaining the `is_a' hierarchy. The `is_a'
hierarchy allows querying of an object to see what classes it is derived
from. Note that class IdObject is used only for these purposes of
registering classes derived from NamedClass (see section NamedClass Class).
The two types NamedClassFunction
and IdObjectFunction are defined in
IdObject. The type NamedClassFunction is a pointer to a function
of zero arguments returning a pointer to NamedClass
(see section NamedClass Class).
Type
IdObjectFunction is a function taking a pointer to an
IdObject as an argument with no return value.
They are declared as follows
typedef NamedClass * (*NamedClassFunction) (void); typedef void IdObjectFunction(IdObject *);
The following methods are available:
IdObject: IdObject (const char * name, NamedClassFunction createfn, IdObjectFunction *init_is_a, const char * help)
IdObject. The argument name is a string containing the name to be used in looking up the object. The argument createfn is
a function which should allocate a new object of the type associated with
name. Init_is_a is a pointer to a function which initializes
the `is_a' hierarchy for the type. Finally, help is a help string describing the class. The arguments createfn, init_is_a, and
help, all have a default value of 0.
IdObject: ~IdObject (void)
IdObject class.
IdObject: static IdObject * id (const char * name)
IdObject associated with name. If no object is
associated with name, then returns 0. The lookup of name
is not case sensitive.
IdObject: NamedClass * create (void) const
IdObject if there is such a type. Returns
0 otherwise.
IdObject: int is_a (IdObject * x)
IdObject is an x. Otherwise, returns 0.
IdObject: void add_is_a (IdObject * x)
IdObject.
IdObject: static void apply_all (IdObjectFunction * fnc)
IdObject
created.
IdObject: static const NameToPtr * lookup_table (void)
NameToPtr * that points to the lookup
table containing all objects of type IdObject.
NamedClass Class
The class NamedClass should be used as a base class for classes that
need to have run-time type determination. Run-time type determination is
convenient for I/O and is essential for programs that need to create objects
whose exact type may not be known at compile-time. Class NamedClass
also provides an `is_a' mechanism for expressing class relationships.
The class IdObject(see section IdObject Class) is used for representing
this hierarchy as well
as for providing unique IDs for classes.
The following members and functions are available for use by all derived classes:
NamedClass: int is_a (IdObject * x)
NamedClass: void write (ostream & out)
NamedClass: static NamedClass * read_new (istream & in)
NamedClass
and returns a pointer to a newly allocated object created based on the
input. Returns a 0 if there is an error during input.
NamedClass: ostream & operator << (ostream &out, const NamedClass nc)
NamedClass object nc on stream out.
NamedClass
In order to work properly, a class derived from NamedClass must
have certain members in support of identification, I/O,
and the `is_a' hierarchy.
All classes derived from NamedClass should have the following
private member, which provides the unique IdObject for the class:
static IdObject ID;
All classes derived from NamedClass should have the following
public member functions:
NamedClass: static IdObject* classID (void)
ID described above.
NamedClass: virtual IdObject* type (void) const
ID described above.
Note that this function is a virtual function and is intended for dynamic
type determination of an object.
In addition, a class derived from NamedClass may need the
following member functions:
NamedClass: static void init_is_a (IdObject * self)
ID member of the derived class.
When a class derived from NamedClass does need the `is_a' hierarchy
and hence this member function, a pointer to this member function should be
passed as an argument on construction of the derived class's ID member.
NamedClass: virtual int read_knowing_type (istream & in)
NamedClass
(see section NamedClass commands).
Returns 1 if it has read the EndClassName =
command that terminates the text of a NamedClass
command script. Otherwise, returns 0.
Note that the static member function NamedClass::read_new(),
which calls this method for input, will read the
opening ClassName = command, and will also try to read the
EndClassName = command if this method returns a 0.
NamedClass: virtual void write_knowing_type (ostream & out)
NamedClass::write() for output. NamedClass::write() will
output the bracketing ClassName = and EndClassName =, while this
method is repsonsible for everthing in between.
Finally, if objects of the derived class will need to be created in situations where the exact type may not be known at compile time, such as for input and output, the derived class will need a void constructor.
Consider the following virtual class virt derived from
NamedClass. Its declaration in a header file would be
class virt: public NamedClass
{ private:
static IdObject ID;
public:
static IdObject *classID(void) {return &virt::ID;}
virtual IdObject *type(void) const {return &virt::ID;}
// ... and whatever is needed for the class itself.
};
In its corresponding .cc file, we would need the initialization
IdObject virt::ID ("virt",0,0,"a demo pure virtual class");
Note that since this is a pure virtual class, the second and third arguments
to the constructor of ID are both 0.
If we derive a class deriv from virt, the header file could
look like
class deriv: public virt
{
private:
static IdObject ID;
static void init_is_a(IdObject *self)
{ self->add_is_a(virt::classID());
}
int i;
virtual int read_knowing_type(istream &in)
{ in >> i;
return 0;
}
virtual void write_knowing_type(ostream &out) const
{out << i << "\n";}
public:
static IdObject *classID(void) {return &deriv::ID;}
virtual IdObject *type(void) const {return &deriv::ID;}
deriv(void) {i=0;}
// ... and whatever is needed for the class itself.
};
To initialize deriv the .cc file would need to have
static NamedClass *create_deriv(void) {return new deriv;}
IdObject deriv::ID ("deriv", create_deriv, deriv::init_is_a,
"a demo derived class\n");
Note that since deriv is derived from class virt, deriv
has a static member function init_is_a. Also, the static function
create_deriv is needed so that an object of class deriv can
be created dynamically.
NamedClass commands
The class NamedClass supports a script style format for the input
and output of NamedClass objects.
All commands should have the form
CommandName = arg1 arg2 arg3 . . .
That is, a command has the format of the the command name followed by an `=', followed by any arguments of the command.
Since NamedClass is meant to be the basic root class from which to
derived other classes that need to be named, it supports just two
rudimentary commands, one for beginning the description of an object and another for
ending the description.
NamedClass command: ClassName = name_of_the_class
NamedClass command: EndClassName = name_of_the_class
ID member (see section Deriving a class from NamedClass).
An error is signaled if the class name arguments
for matching ClassName = and EndClassName = commands are
not the same.
NameToPtr Class
Class NameToPtr provides a hash table for mapping
the name of a thing to a pointer to a thing.
The things must be
derived from the class NamedObject (see section NamedObject Class).
The case sensitivity of lookups in the hash table can be controlled using
the method ignore_case described below. By default, lookups are
case-sensitive. The setting of ignore_case affects only lookups.
It does not matter what its setting is at the time a name is added to the
table.
For flexibility in lookups, insertions or deletions, special control
flags are allowed to handle situations in which a name may or may not
be already entered into a table. These are the ifold and ifnew
flags.
An ifnew flag is used to handle not finding a name in a table.
Allowable values for this flag are:
ZeroIfNew. Return 0, so that the calling program knows the name is new.
ErrorIfNew. Return report and error and abort. The calling program expected
the name to already be in the table.
CreateIfNew. If the name is not already there, create a new object with the
name and put it there. The CreateIfNew option just adds the name to
the hash table with a zero pointer, since it doesn't know what type of object
to create.
An ifold flag is used to check that a name in the table doesn't
already exist. Allowable values for this flag are:
OKIfOld. Just return a pointer to the old object.
ErrorIfOld. Should not already be something with the same name in the
hash table. Report and error and abort.
ReplaceIfOld. Replace the old pointer with a new one. Note that the old
object is not destroyed.
The following methods are for lookups, insertions, and deletions:
NameToPtr: NamedObject * FindOldName (const char * name, OptionIfNew ifnew)
ErrorIfNew, then an error is reported
and the function aborts. If ifnew is ZeroIfNew, then a
0 pointer is returned. A value of CreateIfNew for ifnew
is not allowed. The default value of ifnew is ErrorIfNew.
NameToPtr: void AddName (NamedObject * object, OptionIfOld ifold)
ErrorIfOld then an error is reported and the function aborted.
If ifold is not ErrorIfOld, then object is always
added to the table, replacing any value previously having the same
name as object.
The default value of ifold is ReplaceIfOld.
NameToPtr: void DeleteName (const char * name, OptionIfNew ifnew)
ErrorIfNew,
then an error is reported
and the function aborted.
If name is not in the table and ifnew has value ZeroIfNew,
then the call to this method has no affect on the hash table.
A value of CreateIfNew for ifnew is not
allowed. The default value of ifnew is ErrorIfNew.
These methods are also available:
NameToPtr: void ignore_case (int ignore)
NameToPtr: NameToPtr NameToPtr (int size)
NameToPtr: void Rehash (int newsize)
NameToPtr: void ~NameToPtr (void)
NameToPtr: int RetNumNames (void) const
NameToPtr: int RetHashSize (void) const
NameToPtr: void ApplyAll (FunctionNameObj fun)
FunctionNameObj is a pointer to a function which accepts
a single parameter of type NamedObject *.
NameToPtr: void ApplyAll (FunctionNameConstObj fun)
FunctionNameConstObj is a pointer to a function which accepts
a single parameter of type const NamedObject *.
Command Class
Class Command should be used for defining script commands.
It is derived from class NamedObject (see section NamedObject Class).
It provides functions for defining script commands, as well as
reading and executing script files.
When defining a command, a function is associated with the command name. The function provides the definition for what the command is to do. The function must take three arguments of the following types in order:
istream. This argument is intended to be
the stream of the script
input. By reading from this stream, the function can obtain any additional
parameters or information that are needed to execute the associated command.
Command, type Command *. This is intended to
be a pointer to the command object that actually invoked this command action
function. This is mainly used for when the action function needs to know what
command called it.
ostream. The function can print the output
of the command to this stream.
The function must return an integer. A value of 0 should be returned
if the script should terminate execution, either due to an error or simply
because the end of the script has been reached. Otherwise, a value of
1 should be returned.
The type of a pointer to such a function is called CommandFunctionPtr
and is defined as
typedef int (*CommandFunctionPtr)(istream &, Command *, ostream &);
The following methods are available for Command
Command Command (char *nm, CommandFunctionPtr c, const char *use)
Command class. Argument nm is the name of the
command, c is the function which defines the action the command performs,
and use is a help string for describing what the command does.
Both c and use have default values of 0.
Command Command (char *alias_nm, Command * c)
Command class. Argument alias_nm is a new
alternative name for the
command and c is a pointer to the Command * object of the
original command.
Command: int execute (istream &in, ostream &log) const
The following static members are also available:
Command: static Command * command (const char * nm)
Command object associated with name nm.
Command: static NameToPtr * command_table (const char * nm)
Command: static void remove_from_table (char *nm)
Command: static int read_command (istream &in, ostream &log)
0 is returned if the end of input is reached,
and a 1 is returned otherwise.
Command: static void read_script (istream &in, ostream &log, ostream *prompt)
0.
The file log2.h provides a fast double precision function to
compute the logarithm base 2 of a number.
The functions in LogGamma.h provide a fast way to compute the
log Gamma(x), the natural logarithm of the Gamma function, and its first
two derivatives.
The file Multinomial.h declares functions for computing the multinomial probability of a
sequence of counts. Both function return probabilities in the form of class Prob (see section The Prob, ShortProb, LargeReal, ShortLargeReal classes).
Regularizer Class
The class Regularizer and its derived classes provide ways of estimating
a normalized probability distribution from a sample of observed counts.
Regularizers allow a prior distribution to be specified. Such a prior
distribution should incorporate information about what one expects the
distribution to be in the absence of any observed samples. The prior
distibution is especially important to avoid overfitting when the
number of observed samples is small. Typically, as the observed sample
size grows, the prior distribution is given less weight by a good
regularizer.
The class Regularizer is an abstract class meant to provide
the basic interface for deriving regularizer classes.
The Regularizer class represents the set of elemental events of the
probability distribution as tuples of alphabets (see section AlphabetTuple and BaseTuple Classes).
The Regularizer class is pure virtual class derived
from class NamedClass (see section NamedClass Class) and supports methods
required of a NamedClass. It is also derived from
class NamedObject (see section NamedObject Class) so that regularizers
can be named.
Regularizer methods
In addition to the methods from NamedClass and NamedObject,
Regularizers support the following methods:
Regularizer: Regularizer (void)
Regularizer.
Regularizer: Regularizer (const Alphabet *a, const char *nm)
Regularizer. Sets the alphabet tuple of the
regularizer to the singleton tuple containing the alphabet pointed
to by a, and sets
the name of the regularizer to the string pointed to by nm if nm is
not 0. By default, nm has value 0.
Regularizer: Regularizer (const AlphabetTuple *at, const char *nm)
Regularizer. Sets the alphabet tuple of the
regularizer to a copy of the tuple pointed to by at, and sets
the name of the regularizer to the string pointed to by nm, if nm is
not 0. By default, nm has value 0.
Regularizer: ~Regularizer (void)
Regularizer.
Regularizer: virtual Regularizer * copy (void) const
Regularizer.
Regularizer: int alphabet_size (void) const
Regularizer: const AlphabetTuple * alphabet_tuple (void) const
Regularizer: void set_alphabet_tuple (AlphabetTuple* at)
Regularizer: void set_alphabet (Alphabet *a)
Regularizer: void print_order (ostream &out) const
Regularizer: void read_order (istream &in)
Regularizer: const int * input_order (void) const
Regularizer: virtual void print_info (void) const
Regularizer: static Regularizer * read_new (istream &in, IdObject * required_type)
IdObject
of the type of regularizer that should be read in.
If the type of the regularizer read from stream in is different
from required_type, then an error message is printed and 0 is returned.
Regularizer: static Regularizer * read_new (const char *filename, IdObject * required_type)
IdObject
of the type of regularizer that should be read in.
If the type of the regularizer read from stream in is different
from required_type, then an error message is printed and 0 is returned.
Regularizer: virtual void get_modified_counts (const float *TrainCounts, float *ModifiedCounts)
Regularizer: void get_probs (const float *TrainCounts, float *probs)
Regularizer: float encodingCostForColumnCounts (const float *RealProbs, const float *TrainCounts, float *EstProbs)
Regularizer: virtual void normalize (void)
Regularizer: int verify_partials1 (const float *TrainCounts, float tolerance)
Regularizer: int verify_partials2 (const float *TrainCounts, float tolerance)
Regularizer commands
Class Regularizer supports the NamedClass script command format
(see section NamedClass commands). In addition to the basic commands supported
by NamedClass, all regularizers have the following commands:
Regularizer command: Alphabet = alphabet_name
Alphabet
with name alphabet_name.
Regularizer command: AlphabetPair = name_1 name_2
Regularizer command: AlphabetTriple = name_1 name_2 name_3
Regularizer command: AlphabetTuple = n name_1 ... name_n
Regularizer command: Name = name
Regularizer command: Order = bt_1 bt_2 ... bt_n
Order =
command name as there are elements in the regularizer's alphabet tuple.
If this command does not appear in a regularizer script, then the default
order is the order in which the base tuples are ordered according to
the method index() from class AlphabetTuple (see section AlphabetTuple and BaseTuple Classes).
When this command is used, the regularizer must have had its
alphabet tuple previously set already.
Regularizer command: Comment = comments_to_end_of_line
Comment = command name and extend to the end of the line.
MLPReg Class
Class MLPReg implements a maximum likelihood regularizer with
a pseudocount prior distribution. In a MLPReg, for each
base tuple of the alphabet tuple, there is a corresponding
pseudocount. Given
a sample of observed counts, the modified (posterior) count for
a base tuple is
computed by adding the observed count for the base tuple to the
base tuple's corresponding pseudocount.
MLPReg methods
In addition to the methods inherited from Regularizer
(see section Regularizer methods),
class MLPReg supports the following methods:
MLPReg: MLPReg (void)
MLPReg.
MLPReg: MLPReg (const Alphabet *a, istream &in, const char *name)
MLPReg. Sets the alphabet tuple of the regularizer
to the singleton tuple containing the alphabet pointed to by a, and
sets the name to the string pointed to by name. Other information on the
regularizer, such as the pseudocounts, are read from stream in.
MLPReg: MLPReg (const Alphabet *a, const float *ps, const char *name)
MLPReg. Sets the alphabet tuple of the regularizer
to the singleton tuple containing the alphabet pointed to by a, sets
the pseudocounts to the values in the array ps, and
sets the name to the string pointed to by name. The default value of
name is 0.
MLPReg: MLPReg (const AlphabetTuple *at, const float *ps, const char *name)
MLPReg. Sets the alphabet tuple of the regularizer
to a copy of the alphabet tuple pointed to by a, sets
the pseudocounts to the values in the array ps, and
sets the name to the string pointed to by name. The default value of
name is 0.
MLPReg: ~MLPReg (void)
MLPReg.
MLPReg: const float * pseudocounts (void) const
MLPReg: void set_pseudocounts (const float *ps)
MLPReg: void freeze_dist (void)
MLPReg: void unfreeze_dist (void)
MLPReg commands
Along with the commands supported by its parent class Regularizer
(see section Regularizer commands),
MLPReg has one additional command:
MLPReg to the numbers pc_1
... pc_n, with one number per base tuple of the regularizer's
alphabet tuple. The order in which pseudocounts are associated with
base tuples can be changed with the Order = command
(see section Regularizer commands).
MLZReg Class
Class MLZReg implements a maximum likelihood regularizer with
a uniform prior distribution. The posterior counts are computed from
observed counts by adding a positive number (the zero offset) to
each of the observed counts for base tuples. The number added to
the observed counts is the same for each base tuple. Thus,
an MLZReg is just a like an MLPReg whose pseudocounts
are all the same.
MLZReg methods
In addition to the methods inherited from Regularizer
(see section Regularizer methods),
class MLZReg supports the following methods:
MLZReg: MLZReg (void)
MLZReg.
MLZReg: MLZReg (const Alphabet *a, istream &in, const char *name)
MLZReg. Sets the alphabet tuple of the regularizer
to the singleton tuple containing the alphabet pointed to by a, and
sets the name to the string pointed to by name. Other information for the
regularizer, such as the zero offset, is read from stream in.
MLZReg: MLZReg (const Alphabet *a, float *zofs, const char *name)
MLZReg. Sets the alphabet tuple of the regularizer
to the singleton tuple containing the alphabet pointed to by a, sets
the zero offset to the value zofs, and
sets the name to the string pointed to by name. The default value of
name is 0, and the default value of zofs is 0.0001.
MLZReg: MLZReg (const AlphabetTuple *at, float *zofs, const char *name)
MLZReg. Sets the alphabet tuple of the regularizer
to a copy of the alphabet tuple pointed to by a, sets
the zero offset to the value ps, and
sets the name to the string pointed to by name. The default value of
name is 0, and the default value of zofs is 0.0001.
MLZReg: void set_zero_offset (float *zofs)
MLZReg: float zero_offset (void) const
MLZReg commands
In addition to the commands from Regularizer, MLZReg
has the following command:
MLZReg to zero_offset.
DirichletReg Class
Class DirichletReg implements regularizers that use Dirichlet
mixtures for prior distributions. As mixtures, class DirichletReg
requires parameters for each of the component distributions as well
as a mixture coefficient for each of the components. Since the components
are Dirichlet distributions, the parameters required of an
individual component are the
pseudocounts, one for each base tuple of the regularizer's alphabet tuple.
DirichletReg methods
In addition to the methods inherited from Regularizer
(see section Regularizer methods),
class DirichletReg supports the following methods:
DirichletReg: DirichletReg (void)
DirichletReg.
DirichletReg: DirichletReg (const Alphabet *a, istream &in, const char *name)
DirichletReg. Sets the alphabet tuple of the regularizer
to the singleton tuple containing the alphabet pointed to by a, and
sets the name to the string pointed to by name. Other information on the
regularizer, such as the alphas and the mixture coefficients, are read from stream in.
DirichletReg: DirichletReg (const Alphabet *a, const char *name, int size)
DirichletReg. Sets the alphabet tuple of the regularizer
to the singleton tuple containing the alphabet pointed to by a, sets the name to the string pointed to by name, and sets the number of components of the Dirichlet
mixture to size. The default value of
name is 0, and the default value of size is 0.
DirichletReg: DirichletReg (const DirichletReg & dreg)
DirichletReg.
DirichletReg: DirichletReg (const MLPReg & mlpreg)
DirichletReg from class MLPReg.
The prior distibution of mlpreg is converted in the sole component of a
one component Dirichlet mixture.
DirichletReg: DirichletReg (const MLZReg & mlzreg)
DirichletReg from class MLZReg.
The uniform prior distibution of mlzreg is converted in the sole component of a
one component Dirichlet mixture.
DirichletReg: ~DirichletReg (void)
DirichletReg.
DirichletReg: DirichletReg * posterior_mixture (const float *TrainCounts)
DirichletReg representing the correct posterior
distribution using the current DirichletReg as the prior and TrainCounts
as an observed sample.
DirichletReg: void AddComponent (float MixCoeff, const float *comp)
DirichletReg: void print_ordered_component (ostream &out, int comp_num) const
DirichletReg: void get_moments (const float *TrainCounts, double *ex_prob, double *ex_prob2)
DirichletReg: int num_components (void) const
DirichletReg: void set_component (int comp_num, int lett, float z)
DirichletReg: void scale_component (int comp_num, float multiplier)
DirichletReg: void delete_component (int comp_num)
DirichletReg: void set_mixture (int comp_num, float mix_coeff)
DirichletReg: float mixture_coeff (int comp_num) const
DirichletReg: double sum_component (int comp_num) const
DirichletReg: const float * component (int comp_num) const
DirichletReg: const float component (int comp_num, int lett) const
DirichletReg: void freeze_components (void)
DirichletReg: void unfreeze_components (void)
DirichletReg: void freeze_mixture (void)
DirichletReg: void unfreeze_mixture (void)
DirichletReg: void component_probs (const float * TrainCounts, double & SumTrainCounts, double *comp_probs, double *log_sum)
DirichletReg: const double * component_probs (void) const
DirichletReg: double log_probability (const float *TrainCounts, float *deriv1, float *deriv2)
DirichletReg: Prob Probability (const float *TrainCounts)
DirichletReg: Prob UnorderedProbability (const float *TrainCounts)
DirichletReg commands
In addition to the commands supported by Regularizer
(see section Regularizer commands), DirichletReg
has commands for specifying the parameters of a Dirichlet mixture.
The following commands specify parameters global to a DirichletReg:
DirichletReg command: AlphaChar = alphabet_size
DirichletReg command: NumDistr = num_components
The following commands are for specifying a component of the DirichletReg.
DirichletReg command: Number = component_num
Number = command is encountered. The numbering
of components starts at 0. This command should be followed
immediately by the corresponding Mixture = and Alpha =
commands for the component numbered component_num. The
Number = commands should occur in ascending order by
component_num. If they do not, a warning message is printed.
DirichletReg command: Mixture = mixture_coeff
Number = command encountered)
to the value mixture_coeff.
DirichletReg command: Alpha = pc_sum pc_1 pc_2 ... pc_n
Order = (see section Regularizer commands).
Note that the first argument pc_sum should be the sum all of the
following pc_i.
DirichletReg command: FullUpdate = comment
DirichletReg command: QUpdate = comment
DirichletReg command: StructID = comment
GribskovReg Class
Class GribskovReg implements a regularizer using Gribskov's average
score method. The parameters of a GribskovReg are the elements of the
square alphabet_size()*alphabet_size() score matrix along with the
background probabilties of each of the base tuples of the alphabet tuple.
The modified count of the base tuple indexed by i is computed by
multiplying the background probability of base tuple i times
the exponential of the inner product of row i of the score matrix
and the vector of observed counts.
GribskovReg methods
In addition to the methods inherited from class Regularizer
(see section Regularizer methods),
class GribskovReg provides the following methods:
GribskovReg: GribskovReg (void)
GribskovReg.
GribskovReg: GribskovReg (const Alphabet *a, istream &in, const char *name, double l_base)
GribskovReg. Sets the alphabet tuple of the regularizer
to the singleton tuple containing alphabet a, and sets the name of the regularizer
to name.
Other information for constructing the regularizer, such as the score matrix and
background probabilities, are read from the stream in.
The argument l_base is the natural logarithm of the base
in which numbers from the input from stream in are to be interpreted.
GribskovReg: GribskovReg (const Alphabet *a, const char *name)
GribskovReg. Sets the alphabet tuple of the regularizer
to the singleton tuple containing alphabet a, and sets the name of the regularizer
to name.
The default value of name is 0.
GribskovReg: GribskovReg (const AlphabetTuple *at, const char *name)
GribskovReg. Sets the alphabet tuple of the regularizer
to a copy of the tuple at, and sets the name of the regularizer
to name.
The default value of name is 0.
GribskovReg: ~GribskovReg (void)
GribskovReg.
GribskovReg: double log_base (void) const
GribskovReg: void set_log_base (double l_base)
GribskovReg: float & element (int i, int j)
GribskovReg: float element (int i, int j) const
GribskovReg: float & background (int i)
GribskovReg: float background (int i) const
GribskovReg commands
In addition to the Regularizer commands
(see section Regularizer commands), GribskovReg
has the following commands:
GribskovReg command: Background = p_1 p_2 ... p_n
Regularizer command
Order = (see section Regularizer commands).
GribskovReg command: Scores = s_11 s_12 ... s_nn
Regularizer command
Order = (see section Regularizer commands).
GribskovReg command: LogBase = l_base
GribskovReg command: Base = base
SubstPseudoReg Class
The class SubstPseudoReg implements substitution matrixes
as regularizers. In addition to the basic substitution matrix,
there are options for adding pseudocounts and scaled counts
Addition of pseudocounts can improve performance when the
sample size is 0, while addition of scaled counts can
help when the sample size is very large.
The parameters for a SubstPseudoReg include the elements of
the substitution matrix, which has alphabet_size()*alphabet_size()
many entries. The entry at row i and column j of the matrix
should be the probability of base tuple i given a sample of
size 1 containing base tuple j.
If the option to add pseudocounts is used, then the parameters also include one pseudo count for each base tuple of the alphabet tuple.
When neither the pseudocount nor the scaled counts options are used, the modified count of the base tuple with index i is computed as the inner product of row i of the matrix with the vector of observed counts. If the pseudocounts option is used, then the pseudocount corresponding to base tuple i is add to the above inner product to obtain the modified counts. When the scaled counts option is used, the observed count from the sample for base tuple i is scaled (multiplied) by the total size of the observed sample and added to the above inner product to get the modified counts.
SubstPseudoReg methods
In addition to the methods inherited from class Regularizer
(see section Regularizer methods),
class SubstPseudoReg supports the following methods:
SubstPseudoReg: SubstPseudoReg (void)
SubstPseudoReg.
SubstPseudoReg: SubstPseudoReg (const Alphabet *a, istream &in, const char *name)
SubstPseudoReg.
Sets the alphabet tuple of the regularizer
to the singleton tuple containing alphabet a, and sets the name of the regularizer
to name.
Other information for constructing the regularizer,
such as the substitution matrix, are read from the stream in.
SubstPseudoReg: SubstPseudoReg (const Alphabet *a, const char *name)
SubstPseudoReg.
Sets the alphabet tuple of the regularizer
to the singleton tuple containing alphabet a,
and sets the name of the regularizer
to name.
The default value of name is 0.
SubstPseudoReg: SubstPseudoReg (const AlphabetTuple *at, const char *name)
SubstPseudoReg.
Sets the alphabet tuple of the regularizer
to a copy of the tuple at, and sets the name of the regularizer
to name.
The default value of name is 0.
SubstPseudoReg: ~SubstPseudoReg (void)
SubstPseudoReg.
SubstPseudoReg: int num_columns (void) const
SubstPseudoReg: int use_scaled_counts (void) const
SubstPseudoReg: void use_scaled_counts (int i)
SubstPseudoReg: int use_pseudocounts (void) const
SubstPseudoReg: void use_pseusocounts (int i)
SubstPseudoReg: void freeze_columns (void)
SubstPseudoReg: void unfreeze_columns (void)
SubstPseudoReg: void freeze_pseudocounts (void)
SubstPseudoReg: void unfreeze_pseudocounts (void)
SubstPseudoReg: float element (int row, int col) const
SubstPseudoReg: void set_element (int row, int col, float val)
SubstPseudoReg: float min_element (int row, int col) const
SubstPseudoReg: float sum_col (int col) const
SubstPseudoReg commands
In addition to the commands supported by Regularizer
(see section Regularizer commands),
SubstPseudoReg has the following commands:
SubstPseudoReg command: Order = bt_1 bt_2 ... bt_n option_word
Regularizer (see section Regularizer commands). In
addition an option word should follow the base tuples arguments.
The argument option_word should be one of the following:
SubstPseudoReg command: Subst = sm_11 sm_22 ... sm_nn
SubstPseudoReg command: Subst = sm_11 sm_22 ... sm_1n pc_1 ... sm_n1 sm_n2 ... sm_nn pc_n
FeatureReg Class
Class FeatureReg uses feature partitions as a regularizer.
In the feature partitioning method, the alphabet of the regularizer
is divided up into disjoint sets. Such a partitioning is referred
to as a feature alphabet. The feature alphabet is then given a
zero offset distribution. In a regularizer, many feature alphabets
may be used to obtain the the posterior counts. The posterior counts
for a base tuple are computed by multiplying the posterior
counts of the base
tuple from each of the separate feature alphabets.
The posterior counts
for a base tuple from each feature alphabet is in turn computed
by adding the zero offset of the the feature alphabet to
the sum of the observed counts of all the base tuples that
are in the same set of the partition as the target base tuple
whose posterior count we want to compute.
FeatureReg methods
In addition to the methods inherited from Regularizer
(see section Regularizer methods),
class FeatureReg supports the following methods:
FeatureReg: FeatureReg (void)
FeatureReg.
FeatureReg: FeatureReg (const Alphabet *a, istream &in, const char *name)
FeatureReg.
Sets the alphabet tuple of the regularizer
to the singleton tuple containing alphabet a, and sets the name of the regularizer
to name.
Other information for constructing the regularizer,
such as the feature partitions, are read from the stream in.
FeatureReg: FeatureReg (const Alphabet *a, const char *name)
FeatureReg.
Sets the alphabet tuple of the regularizer
to the singleton tuple containing alphabet a,
and sets the name of the regularizer
to name.
The default value of name is 0.
FeatureReg: FeatureReg (const AlphabetTuple *at, const char *name)
FeatureReg.
Sets the alphabet tuple of the regularizer
to a copy of the tuple at, and sets the name of the regularizer
to name.
The default value of name is 0.
FeatureReg: ~FeatureReg (void)
FeatureReg.
FeatureReg: int num_alphs (void) const
FeatureReg: void set_zero_offset (int i, int z)
FeatureReg: float zero_offset (int i) const
FeatureReg: const FeaturePartition * partition (int i) const
FeaturePartition * that points to the
feature partition with the index i.
FeatureReg: void add_partition (FeaturePartition * fp, float z)
FeatureReg: FeaturePartition * pop_partition (int delete_this)
FeatureReg: void add_best_partition (const float * Summary, int min_features, int max_features)
float
with Summary[i*alphabet_size() + j] being the frequency of character
i, having seen a sample containing character j.
FeatureReg commands
In addition to the commands available from class Regularizer
(see section Regularizer commands),
FeatureReg supports the following command:
FeatureReg command: Parition = ( ds_1 , ds_2, ..., ds_n ) zero_offset
Parition = command is
Partition = ( D + E, F + R + H, N + Q, S + T, I + L + V,
F + W + Y, C, M, A + G, P ) 0.764163
In this example partition of amino acids, there are ten disjoint sets
in the partition, with a zero offset of 0.7641663.
For each FeatureReg, there must be as many Partition =
commands in its specification as the regularizer needs.
FeaturePartition class
The class FeaturePartition supports the implementation of
the class FeatureReg by providing a representation of
a feature partition.
It has the following methods:
FeaturePartition: FeaturePartition (const AlphabetTuple * at)
FeaturePartition. The argument at is
the alphabet tuple that is partitioned in to disjoint sets by the
FeaturePartition.
FeaturePartition: FeaturePartition (const FeaturePartition & partition)
FeaturePartition.
FeaturePartition: ~FeaturePartition (void)
FeaturePartition.
FeaturePartition: int OK (void) const
FeaturePartition: const AlphabetTuple * alphabet_tuple (void) const
FeaturePartition: int which_feature (int i) const
FeaturePartition: int & which_feature (const BaseTuple bt)
FeaturePartition: void set_feature (int letter, int which)
FeaturePartition: int num_features (void) const
FeaturePartition: void ReduceCounts (const float * counts, float zero_offset, float * reduced) const
FeaturePartition: void print (ostream & out)
FeatureReg command Partition =
(see section FeatureReg commands).
FeaturePartition: void read (ostream & in)
FeatureReg command Partition =
(see section FeatureReg commands).
The following function can be used to optimize a FeatureReg:
There are three hash table classes: SimpleHashClass,
DictionaryClass, UserDefinitions. All are based on a
safer derivative of the GNU string class. See section The StringListClass class, and
section `The String Class' in Libg++ User's Guide. The
DictionaryClass adds file reading to the basic hash table class,
while UserDefinitions is a hash table for global initialization
data read from an initialization file.
SimpleHashClass class
A SimpleHashClass class implements a simple hash table using
GNU string classes. The DictionaryClass is built on top
of this class.
The data it holds are pairs of GNU String Classes, one part being the name to lookup with and the other being the value.
The functions supported are to lookup on a name to see if it exists, and return its value if it does. Inserting a new name/value pair. Obtaining the number of stored items. Clearing all the contents from the table.
SimpleHashClass: String lookup (String key)
SimpleHashClass: long insert (String key, String value)
SimpleHashClass: long listSize (void)
SimpleHashClass: void reset (void)
Strings that are stored, so use of this is a potential
memory leak.
You can dump the values to a stream. The format is name blank value newline.
SimpleHashClass: void print (ostream& file)
file.
One name/value pair per line. The name is separated from the
value by a single space.
DictionaryClass class
A DictionaryClass class implements a hash table with
a file reader. The functions are the same as SimpleHashClass
along with functions to read data from a stream or the
environment.
The data it holds are pairs of GNU String Classes, one part being
the name to lookup with and the other being the value. It functions
like a hash table, you lookup using the name, and the return is
the value.
See section The SimpleHashClass class.
The only functions supported are to lookup on a name to see if it exists, and return its value if it does. Inserting a new name/value pair, and obtaining the number of stored items.
DictionaryClass: String lookup (String key)
DictionaryClass: long insert (String key, String value)
DictionaryClass: long listSize (void)
You can dump the values to a stream. The format is name blank value newline.
DictionaryClass: void print (ostream& file)
UserDefinitions class
The UserDefinitions class holds the values that the user
of the program set before program execution.
It is intended that this be a unified way for programs to access
standard startup information.
It is intended that there only be one of these classes created, and
it should be a global that is visible to everyone.
It should be a static global, so it is created at startup time.
This class is a restricted version of the DictionaryClass.
The data it holds are pairs of GNU String Classes, one part being
the name to lookup with and the other being the value. It functions
like a hash table, you lookup using the name, and the return is
the value.
See section The DictionaryClass class.
What is loaded into this class is the environment as it existed at program start, and the contents of a startup file.
Being a static object, it is not permitted to delete it. In fact the only valid things to do with this class are to lookup values, insert values, find out how many values are stored, and print the contents to an ostream.
The UserDefinition class should initialized
once in the main function.
If you require the functionality, use a DictionaryClass.
The only functions supported are to lookup on a name to see if it exists, and return its value if it does. Inserting a new name/value pair, and obtaining the number of stored items.
UserDefinitions: String lookup (String key)
UserDefinitions: long insert (String key, String value)
UserDefinitions: long listSize (void)
You can dump the values to a stream. The format is name blank value newline.
UserDefinitions: void print (ostream& file)
StringListClass class
A StringListClass class is a simple wrapper for an
array of GNU string classes. The emphasis is on runtime safety,
not speed. It does checking of the indexes you give it.
It dynamically grows if it
needs to when you add a new string. It is 1 based, meaning
that the index of the first item is 1, not 0.
The SimpleHashClass is built on top
of this class.
The functions supported are: get an item at a given index; get the number of items in the listl; add a new item at the end or at a given index; remove an item from a given index or search to find the given item, and then remove it; and search the list for a given item and return the index. The value used for NULL data and returns, called CURRENTNULL, can be set.
StringListClass: String operator [] (long index)
StringListClass: String getItem (long index)
StringListClass: void putAppend (String item)
listSize() + 1.
StringListClass: long listSize (void)
StringListClass: void putItemAtIndex (long index, String item)
StringListClass: void removeItemAtIndex (long index)
StringListClass: void removeItem (String item)
indexOfItem, and removeItemAtIndex.
StringListClass: long indexOfItem (String item)
StringListClass: void setCurrentNull (String item=GNU NULL)
getItem when indexes are out of range, and the value that
will be put into the list to overwrite a removed item.
StringListClass: String getCurrentNull (void)
The Baskin Center's MasPar MP-2204 has 4096 32-bit SIMD processing elements, each with 64 Kbytes of local memory, a mesh interconnection network, a global router, and 128 Mbytes of global memory connected to the router that can be used for parallel independent file access.
Documentation of the MasPar ganesha can be found in
`/usr/maspar/doc', and an excellent tutorial on the DECmpp, another
name for the MasPar, can be found in `~rph/220/mppdoc'.
Many biosequence projects fit well on a linear array of processing elements. Unfortunately, the MasPar x-net is not perfectly suited for providing a chain of processing elements rather than a square mesh. The following routines, provided in `mp_linear.m' and `mp_linear.h' in the `ultimate/include' and `ultimate/maspar' directories, provide the necessary functionality.
The following routines shift data in all processing elements:
the active set is not obeyed. Also, if processing elements are grouped
for more memory or computation power, more efficient variants on these
routines using xnetpipe routines could speed operations by a
factor proportional to the group size.
xnet primitive) and shift
them in all processing elements one element to the east or
one element to the west, respectively. The shift treats the processor
array as linear, meaning that in the former case, the value in
processing element iproc is shifted to processing element
iproc+1, which may be on a different row. The final element in
the array, processing element nproc-1 is shifted to
processing element 0. For shifting quantities larger than 64
bits (long long), see below.
(iproc)
to (iproc+dist) with wraparound. The result is stored in the
plural block of memory starting pointed to by dest, may be the
same as src. The routine is based on ss_xfetch, and could
easily be modified for plural pointers to plural data. The source and
the desination may be the same, and if dist is zero a plural
bcopy is performed.
BlockIn, which perform a similar
function for rectangular subarrays of processing elements using the host
to array DMA channel. Data is copied from the block of memory starting
at from of length (size*npe). The first size
bytes are copied to the length size block of memory starting at
to in processor number start, the second block size
bytes is copied to processor (start+1) starting at to, and
so on until the final block of memory is copied to processing element
(start+npe-1). The active set is ignored. If data is
originating in a file, it may be faster to use parallel file access and
the IORAM.
BlockOut, which perform a similar
function for rectangular subarrays of processing elements using the host
to array DMA channel. Data is copied to the block of memory starting
at to of length (size*npe). The first size
bytes are copied from the length size block of memory starting at
from in processor number start, the second block of size
bytes is copied from processor (start+1) starting at from, and
so on until the final block of memory is copied to processing element
(start+npe-1). The active set is ignored. If data is to be
sent to a file, it may be faster to use parallel file access and
the IORAM.
The MPL compiler does not support parallel C++, and is thus unable to
make use of the Ultimate Probability class (see section The Prob, ShortProb, LargeReal, ShortLargeReal classes). The
following routines in `mp_Prob.h' and `mp_Prob.m' in the
`include' and `maspar' directories support probabilities with
a hidden representation as log-probabilities stored in 32-bit integers.
Addition is performed using table lookup on a plural table of 7600 short
integers (using 1.5 kBytes of local PE memory on each PE). Future
versions may optionally implement this table in IORAM to save space in
the processor memory. This, of course, would be significantly slower.
The routines are all macros or inline function definitions, and
include singular (ACU) and plural (DPU) variants. The plural versions
are all defined as macros. The most involved, the group for adding
probabilities using the lookup table, require temporary `register'
arguments to be used for intermediate values. Make sure that these
arguments really are registers, not memory locations, or the routines
will grind to a halt.
The internal format of these probabilities may not be compatible with
those of the Prob class (see section The Prob, ShortProb, LargeReal, ShortLargeReal classes). Code that links these
routines to G++ (see section Linking G++ with MPL) will have to convert
between formats. The simplest way to do this is to use the appropriate
function to convert the probability to a double log-probability,
and then create the new probability using that value section Member Functions.
The smallest representable probability, Prob_val (Prob_zero()),
is exp(-20), or approximately 2.06E-9.
A call must be made to the Prob_init() function before using
singular or plural probabilities to compute the table used for adding
probabilities.
Note that MPL (and C in general) regards a typedef as an alias
for a type, not a new type. Therefor, unlike the C++ probability class
(see section The Prob, ShortProb, LargeReal, ShortLargeReal classes), statements like prob + 0.5 will not generate an
error, and will certainly not be the same as prob + Prob_make
(0.5).
Prob of value 0.
Prob with a very strong value of zero. In log-prob
terms, this fucntion returns a value much larger than the log-prob of
the smallest representably number. Used in the protein HMM code for
initialization of boundary conditions and such.
Prob of value unity.
prob_val)
Prob corresponding to the 32-bit floating-point number
prob_val.
log_val)
Prob corresponding to the 32-bit floating-point
log-probability number log_val. This is based on the
natural logarithm. Natural logarithms are generally more
efficient as they are the primary source of logarithms (that is, logs in
other bases, such as 2, are computed from the natural logarithm).
prob)
prob as a 32-bit floating-point number. This function
call hides an exponentiation.
prob)
prob.
prob is zero according to the granularity of the
internal representation.
p1, Prob p2)
prob1,Prob prob2)
prob1 and prob2.
The plural functions are quite similar to the above singular functions,
with the exception that, to aid hand optimization of MPL code (perhaps
not as needed as originally, now that the -Omax compiler flag is
available), many variants of the Prob_add function (see section Singular Probability Functions) are provided.
Plural versions of the constant functions are not provided. This type conversion (a data broadcast) can be performed by the MPL compiler automatically.
plural Prob, but is defined separately in case future changes
are required.
prob_val)
p_Prob corresponding to the 32-bit plural
floating-point number
prob_val.
log_val)
p_Prob corresponding to the 32-bit plural floating-point
log-probability number log_val. This is based on the
natural logarithm.
prob)
prob as a 32-bit plural floating-point number. This function
call hides an exponentiation.
prob)
prob.
prob is zero according to the granularity of the
internal representation.
p1, p_Prob p2)
prob1,p_Prob prob2)
prob1 and prob2.
The following functions are defined as macros rather than inline functions, and all require temporary registers as arguments.
p1, p_Prob p2, p_Prob p3, p_Prob tmp1, p_Prob tmp2, p_Prob tmp3)
p1, p2, and p3,
given three temporary p_Prob registers. The three
probabilities, assumed to be in memory, are copied into registers before
computing on them.
p1, p_Prob p2, p_Prob tmp1, p_Prob tmp2)
p1 and p2,
given two temporary p_Prob registers. The first
probability should be in memory, the second in a register.
p1, p_Prob p2, p_Prob p3, p_Prob tmp1)
p1.
p1, p_Prob p2, p_Prob tmp1, p_Prob tmp2, p_Prob tmp3)
p1 += p2, with the probabilities initially residing in
memory. The three temproary registers must be p_Prob registers.
p1, p_Prob p2, p_Prob tmp1, p_Prob tmp2, p_Prob tmp3)
As above, except that the accumulator p1 is assumed to be a register.
As of this writing, G++ (see section `Top' in Gcc User's Guide) is not supported on the Maspar.
However, it is possible to link G++ object code with MPL object
code to create a program where the G++ part executes only on
the front end DECstation, and the MPL runs normally.
The key to this is the G++ compiler option -fno-gnu-linker
(see section `Code Gen Options' in Gcc User's Guide).
To create a program that runs on the Maspar, the final linker
has to be mpld. So the idea is to compile all the G++ first,
link it all into one object file (using ld -r) and then
use mpld as the final link step to merge the MPL and final G++
object file into a program.
This works except for static class objects. There has to be special initialization code for these that occurs before main is called, and the destructors for them have to be called upon exit. Constructors are pretty easy, but the destructors are not.
The GNU G++ compiler can be instructed to output the code that will call
the static constructor and destructor code. This makes it possible to
use a different linker (in this case, mpld) but still have the static
object constructor and destructor code be operational. I believe that
main must be in your C++ code for this all to work.
The key step here is to create the code that knows what the static
objects are so they can be built. The program that does this
is called findconstructors. It is a pretty simple operation,
the standard GNU program collect2 does exactly the same
thing (I got the routines from there, but it tries pretend to be
the whole linker, and I decided to do this explicitly).
This program uses the output of nm to find the static object
constructors and destructors, and then outputs a C source file that
has a table of these functions. Upon linking, the names get turned
into addresses to functions, and the startup (and exit)
code will use this table
to call the functions.
# compile all your G++ code g++ -fno-gnu-linker -c (*.C) # prelink all the G++ .o files, MUST USE -r option g++ -fno-gnu-linker -r -o allC++.o (*.o) # now find the static objects nm allC++.o | findconstructors > Constructors.c # compile it as normal C, don't use G++!! gcc -c Constructors.c # now link it into the rest of the C++ code gcc -r -o totalC++.o allC++.o Constructors.o # now the C++ is done, link it in with the rest of your # MPL code mpld -o program (MPL code .o) totalC++.o
Some things that people have mentioned that they would like to see in the Ultimate Parser library, but for which there have not been any offers:
SeqList:
Sequence:, & on Sequence:
AlphabetTuple:
Alphabet:
FeatureReg:
ClassNameRegistry:
Hist
IdObject:
FeatureReg:
DirichletReg:
NameToPtr:
Alph:
Alphabet
Sequence:
Regularizer:
FeaturePartition:
Regularizer:
AlphabetTuple:, AlphabetTuple on AlphabetTuple:, AlphabetTuple on AlphabetTuple:, AlphabetTuple on AlphabetTuple:, AlphabetTuple on AlphabetTuple:
IdObject:
NameToPtr:, ApplyAll on NameToPtr:
GribskovReg:, background on GribskovReg:
BaseStream:
Base:
BaseStream:
BaseTuple:
AsnStream:
Base:
Hist:
NamedClass:
BaseStream
SeqList:
Command, Command on Command
Command:
Command:
NucleicAlphabet:
DirichletReg:, component on DirichletReg:
DirichletReg:, component_probs on DirichletReg:
Regularizer:
Sequence:
Hist:
Hist:
ClassNameRegistry:
IdObject:
Sequence:, data on Sequence:
DirichletReg:
NameToPtr:
Alph
DirichletReg:, DirichletReg on DirichletReg:, DirichletReg on DirichletReg:, DirichletReg on DirichletReg:, DirichletReg on DirichletReg:, DirichletReg on DirichletReg:
Hist:
Sequence:, elem on Sequence:
GribskovReg:, element on GribskovReg:
SubstPseudoReg:
Regularizer:
Prob:
Command:
BaseStream:
BaseStream:
FeaturePartition:, FeaturePartition on FeaturePartition:
FeatureReg:, FeatureReg on FeatureReg:, FeatureReg on FeatureReg:, FeatureReg on FeatureReg:
AsnStream:
NameToPtr:
EntrezStream:
Alphabet:
Alphabet:
Alphabet:
SubstPseudoReg:
DirichletReg:
MLPReg:
DirichletReg:
SubstPseudoReg:
Alphabet:
Regularizer:
DirichletReg:
Regularizer:
StringListClass:
StringListClass:
AsnStream:
BaseStream:
GribskovReg:, GribskovReg on GribskovReg:, GribskovReg on GribskovReg:, GribskovReg on GribskovReg:
NamedObject:
Hist
Alphabet:
ClassNameRegistry:
IdObject:
IdObject:
NameToPtr:
Alphabet:
AlphabetTuple:
StringListClass:
NamedClass:
Regularizer:
DictionaryClass:
SimpleHashClass:
UserDefinitions:
IdObject:
NamedClass:
Alphabet:
LargeReal:
Base:
Alphabet:
Base:
Base:
Prob:
LargeReal:
NucleicAlphabet:
Alphabet:
Alphabet:
Base:
Base:
LargeReal:
Prob:
Alphabet:
Alphabet:
Alphabet:
Base
DictionaryClass:
SimpleHashClass:
StringListClass:
UserDefinitions:
EntrezStream:
GribskovReg:
DirichletReg:
LogNormalHist
DictionaryClass:
SimpleHashClass:
UserDefinitions:
IdObject:
Alphabet:
Alphabet:
Alphabet:
Hist:
SubstPseudoReg:
DirichletReg:
MLPReg:, MLPReg on MLPReg:, MLPReg on MLPReg:, MLPReg on MLPReg:
MLZReg:, MLZReg on MLZReg:, MLZReg on MLZReg:, MLZReg on MLZReg:
Alphabet:
NamedObject:
Alph:
NamedObject:, NamedObject on NamedObject:
NameToPtr:
Alphabet:
Base:, no_wc_match on Base:
Prob:
Alphabet:
Alphabet:
NormalHist
Regularizer:
BaseStream:
Alphabet:
Base:
Base:
Alph:
AlphabetTuple:
FeatureReg:
SubstPseudoReg:
DirichletReg:
FeaturePartition:
Alphabet:
AlphabetTuple:
Sequence:
FeaturePartition:
AlphabetTuple:
BaseTuple:, operator on BaseTuple:
Hist:
SeqList:
Sequence:, operator on Sequence:, operator on Sequence:
StringListClass:
Hist
Hist
Hist:
SeqList:
Sequence:, operator= on Sequence:
BaseStream:
Hist:
Prob:
FeatureReg:
Hist:
FeatureReg:
DirichletReg:
DictionaryClass:
FeaturePartition:
SimpleHashClass:
UserDefinitions:
AlphabetTuple:
Regularizer:
Regularizer:
DirichletReg:
AlphabetTuple:
Hist:
Sequence:
Hist:
DirichletReg:
MLPReg:
StringListClass:
StringListClass:
Hist:
Hist:
Base:
Hist:
AsnStream:
FeaturePartition:
Command:
NamedClass:
NamedObject:
NamedClass:
Regularizer:, read_new on Regularizer:
Regularizer:
Command:
AsnStream:
RectHist
FeaturePartition:
Regularizer:, Regularizer on Regularizer:, Regularizer on Regularizer:
NameToPtr:
Command:
StringListClass:
StringListClass:
SimpleHashClass:
Hist:
Alph:
Prob:
Prob:
Prob:
LargeReal:
NameToPtr:
NameToPtr:
AlphabetTuple:
NucleicAlphabet:
DirichletReg:
Hist:
Sequence:
SeqList, SeqList on SeqList, SeqList on SeqList
Sequence, Sequence on Sequence, Sequence on Sequence, Sequence on Sequence, Sequence on Sequence, Sequence on Sequence, Sequence on Sequence
Regularizer:
Regularizer:
DirichletReg:
Prob:
Alph:
Prob:
SubstPseudoReg:
FeaturePartition:
NamedObject:
Base:
Hist, set_kernel_width on Hist
Prob:
GribskovReg:
DirichletReg:
NamedObject:
MLPReg:
FeatureReg:
MLZReg:
StringListClass:
BaseStream:
LargeReal:
Alph:
SeqList:
Sequence:
Hist
Hist, smoothing on Hist, smoothing on Hist
Hist:
Hist
Sequence:
SubstPseudoReg:, SubstPseudoReg on SubstPseudoReg:, SubstPseudoReg on SubstPseudoReg:, SubstPseudoReg on SubstPseudoReg:
SubstPseudoReg:
DirichletReg:
Alphabet:
Alphabet:
SeqList:
Hist:
NamedClass:
SubstPseudoReg:
DirichletReg:
MLPReg:
DirichletReg:
SubstPseudoReg:
Alphabet:
DirichletReg:
SubstPseudoReg:
SubstPseudoReg:
SubstPseudoReg:, use_scaled_counts on SubstPseudoReg:
LargeReal:
Prob:
Alphabet:
Hist:
Regularizer:
Regularizer:
Alphabet
Alphabet, void on Alphabet, void on Alphabet, void on Alphabet, void on Alphabet
Sequence, void on Sequence, void on Sequence
Alphabet:
Alphabet:
Base:
Alphabet:
Base:
FeaturePartition:, which_feature on FeaturePartition:
NamedClass:
NamedClass:
NamedObject:
AsnStream:
FeatureReg:
MLZReg:
AlphabetTuple:
BaseTuple:
DirichletReg:
FeaturePartition:
FeatureReg:
GribskovReg:
IdObject:
MLPReg:
NamedObject:
NameToPtr:
Regularizer:
SubstPseudoReg:
This document was generated on 28 October 1996 using the texi2html translator version 1.51.