User's Guide

Ultimate Parser Library

last updated September 20, 1996

for version 1.0


Copyright (C) 1993 1994 The Regents of the University of California

Note: The Ultimate Parser C++ library is still in test release. You will be performing a valuable service if you report any bugs you encounter.

Contributors to the Ultimate Parser C++ library

In addition to the codes mentioned below, the following people all participated in the numerous discussions and design sessions leading to the Ultimate Parser library. The areas to which they primarily contributed to are listed below.

Ultimate Parser library aims, objectives, and limitations

The Ultimate Parser C++ library is designed to support the development of new machine learning techniques for biosequence analysis. Its guiding principles include:

C++ library stylistic conventions

The following conventions are adapted from the GNU C++ library manual.

These classes are used for reprsenting bases, alphabets and sequences.

The Base class

The Base class is the primary data representation of biosequence characters, whether they be nucleotides, amino acids, or an alternative. The motivating factor begine the Base class is to enable largely alphabet-independent data manipulation by relying on a standard data format accross all alphabets. This implementations enables the efficient implementation of multi-alphabet routines while also providing an interface that supports alphabet-specific operations See section The Alphabet Classes. Across all alphabets, the null_char() is an input and output null character, and the bad_char() is an output illegal character for that alphabet.

Constructors

The Base class has no constructors. This is to speed pass by value and to enable to placement of bases in registers. Bases are initialized and assigned as in the following examples:

Base base;
Declare base with an unspecified value. Note that the unspecified value could be a legal or illegal Base for a given alpahbet, and should be set to the null base specifically if desired.
Base base = Base::null();
Declaration of a null base that does not match any character.
Base x; x.set_int (n);
Set x to be base number n. Base numbers have little meaning without corresponding alphabet (see section The Alphabet Classes) definitions. Most often, n will the be integer return value of an alphabet-specific function.
Base x = base_int (n);
As above, base x is set to the integer n.
Base x, y; x = y;
Simple Base assignment.

Function: ostream& operator<< (ostream & out, const Base base)
Print base on out using the current default alphabet Alph::ret_default(). section Alph Class

Function: istream& operator>> (istream & in, Base & base);
Read base from in using the current default alphabet Alph::ret_default(). section Alph Class

Variants and Wildcards

Bases, such as nucleotides, often have chemical variants which are often ignored in the development of analysis software. The Ultimate Parser enables the consideration of variant bases by including variant structures in the base class. Many functions of the Base and Alphabet class have a parameter from the global enum VarEnum {NO_VARS, VARS}, whose options specify, respectively, that character variants should be ignored (all treated as the primary character) or not ignored (treated as different characters). Routines should default, in the absense of a VarEnum parameter, to NO_VARS, using the canon() function, below, to ignore variant characters.

Bases may also be indeterminant. In the amino acid case, for example, biologists use the letter `B' to represent either `N' or `D', and the letter `X' to reresent any of the 20 amino acids. Translation of wildcards is impossible without alphabet information, however identification of wildcards is.

The null character can be regarded as a special, non-matching wildcard. Checking for the null character, performed by default, can be turned off for routines that include a NullEnum of NULL_NULL_FALSE or NULL_NULL_TRUE argument. It is generally advised that checking for a null bases be enabled, however for reasons of efficiency this option may be turned off from time to time.

Matching a wildcard against another wildcard has two flavors according to which function is called. In wc_match functions both match tables are searched for the other character, and in the wc_subset functions, only the first base's match table is checked, allowing, for example, the assertion that any character is a subset of the complete wild card, while the complete wild card is only a subset of itself and other complete wildcards. The null base is not a subset of any other base, including the null base. All other bases are subsets of themselves.

The following informational and conversion functions involving variant and wildcard bases are available for members of the Base class.

Method on Base: int is_normal (void)
Returns 1 if a normal (non-variant, non-wildcard, non-null) base, 0 otherwise.

Method on Base: int is_wild (void)
Returns 1 if a wildcard, 0 otherwise.
Method on Base: int is_variant (void)
Returns 1 if a variant base, 0 otherwise.

Method on Base: int is_null (void)
Returns 1 if a null base, 0 otherwise.

Method on Base: Base null (void)
Return the null Base.

Method on Base: char null_char (void)
Return the ASCII character corresponding to the null Base.

Method on Base: int is_null_char (int char)
Return 1 if char is the null character, 0 otherwise.

Method on Base: char bad_char (void)
Return the ASCII character corresponding to an invalid Base.

Method on Base: Base canon (void)
Return a Base that is the canonical form of *this. If x is not a variant character, x.canon() == x.

Implementation

In the current implementation, each Base is represented by an 8-bit character. The most significant 2 bits are used to represent up to 7 variants of the primary character. The canon() function simply masks out these bits. The null base is always represented by the integer 63 (i.e., the lower 6 bits all set), and the numbers between 21 and 62, inclusive, are definable wildcards that require an Alphabet (see section The Alphabet Classes) for translation.

Method: Base limits (void)
Describe the Base class' underlying representation.

Indexing

Method on Base: int raw_int (void)
Bases are often used to index arrays and other data structions. For this reason, conversion to an integer is provided. Note that this is an uncompressed conversion, returning a number between 0 and Alphabet::max_num_var_base(), which is expected to remain at least as high as 256. Sparse conversion, providing indices dependent on the alphabet length, require knowledge of the alphabet. See section The Alphabet Classes.

Method on Base: void set_int (const int i)
Set a base to a specific integer value, the inverse of raw_int.

Friend to Base: Base base_int (const int i)
Return a base set to a specific integer value, the inverse of raw_int.

The Alphabet class (see section The Alphabet Classes) has access to private Base members for direct integer cast, construction, and assignment.

Matching

Bases can be matched to check for equality. The equality operator, == is not provided (it is implemented as a private class member to provide a compilation error message). All equality operations on bases must specifically specifify whether or not wildcard matching is desired. The routines are:

Method on Base: int no_wc_match (const Base base2, const
NullEnum nullopt = NULL_NULL_FALSE, const VarEnum varopt = NO_VARS)
Method on Base: static int no_wc_match (const Base base1 const Base base2, const NullEnum nullopt = NULL_NULL_FALSE, const VarEnum varopt = NO_VARS)
Ignoring wildcards, return 1 if the base2 matches *this or base1, 0 otherwise. Parameter nullopt, either NULL_NULL_FALSE or NULL_NULL_TRUE, determines the result of matching two null characters. Parameter varopt controls use of variants -- set to VARS to treate character variants as unique characters. Wild cards will match themselves but no other characters. See section Variants and Wildcards.

Method on Base: int wc_match (const Base base, const Alphabet *alphabet, const VarEnum varopt = NO_VARS)
Using the wildcard definitions associated with alphabet, return 1 if base matches *this, 0 otherwise. Null characters, a type of wildcard, are always checked. Two wild cards will match if either is included in the other's match table. Parameter varopt indicates whether or not variants should be used -- set to VARS to treat character variants as unique characters. See section Variants and Wildcards.

Method on Base: int wc_subset (const Base base, const Alphabet *alphabet, const VarEnum varopt = NO_VARS)
Using the wildcard definitions for *this associated with alphabet, return 1 if base is a member of *this's wildcard table, 0 otherwise. That is, return whether or not *this matches either the same or more characters than base. Null characters have no subsets are not subsets of any other character. Parameter varopt indicates whether or not variants should be used -- set to VARS to treat character variants as unique characters. See section Variants and Wildcards.

The Alphabet Classes

The alphabet classes contain information on the intrepretation of a Base (see section The Base class). Each member of the Alphabet class hierarchy is expected to have at most one instantiation (this is checked at runtime), a static member of the Alph class (see section Alph Class). The alphabet class is implemented with its descendents in mind, so that most functions are not virtual.

Generic Alphabet Class

All alphabets support a variety of functions.

General Alphabet Functions

Method on Alphabet: String& name (void)
Return the name of an alphabet.

Method on Alphabet: Alphabet* id (void)
Return the address of an alphabet object as its identifier.

The following three information functions are virtual to allow extensibility of the alphabet class. They are the only virtual functions in the alphabet class.

Method on Alphabet: int is_nucleic (void)
Return 1 if the alphabet is a nucleotide alphabet (a descendent of NucleicAlphabet section Specific Alphabets), 0 otherwise.

Method on Alphabet: int is_rnucleic (void)
Return 1 if the alphabet is a ribonucleic alphabet (a descendent of RNAAlphabet section Specific Alphabets), 0 otherwise.

Method on Alphabet: int is_amino (void)
Return 1 if the alphabet is an amino acid alphabet (a descendent of AminoAlphabet section Specific Alphabets), 0 otherwise.

Base Manipulation

Method on Alphabet: char to_char (const Base base, const VarEnum varopt = NO_VARS)
Use the alphabet information to convert base to an ASCII character, possibly using variants (see section Variants and Wildcards).

Method on Alphabet: Base to_base (const char ch)
Use the alphabet information to convert an ASCII character ch to a Base (see section The Base class). If ch is not a valid character for the alphabet, a null base is returned.

Method on Alphabet: Base valid_or_null (const Base base)
Check if base is valid according to the alphabet information. If it is valid, return base, otherwise return a null base.

Method on Alphabet: Base null (void)
Return a null base.

Method on Alphabet: int is_valid (const Base b)
Return a 1 if b is a valid base within the alphabet. Note that characters can be valid in several alphabets with a different meaning in each.

Method on Alphabet: int wc_match (const Base base1, const Base base2, const VarEnum varopt = NO_VARS)
Performs a symmetric wildcard match on two bases that does not take into account character variants. In the 2-wildcard pcase both match tables are checked. To match variant forms, set varopt to VARS.

Method on Alphabet: int wc_subset (const Base base1, const Base base2, const VarEnum varopt = NO_VARS)
Return 1 if base2 is a (possibly improper) subset of base1. That is, whether or not base1 matches the same or more characters than base2 Null bases have no subsets and are not included in any subset. To match variant forms, set varopt to VARS.

Method on Alphabet: static int no_wc_match (const Base base1, const Base base2, const NullEnum nullopt = NULL_NULL_FALSE, const VarEnum varopt = NO_VARS)
Ignoring wildcards, return 1 if the base1 matches base2, 0 otherwise. Parameter nullopt, either NULL_NULL_FALSE or NULL_NULL_TRUE, determines result when both bases are null characters, while varopt controls use of variants -- set to VARS to treate character variants as unique characters. Wild cards will match themselves but no other characters. This is simply another way of accessing the Base::no_wc_match() function. See section Matching.

Method on Alphabet: int index (const Base base)
Converts base to an index, possibly more compact than the Base to int cast. Indexes are always based on canonical form, range between 0 and max_index(), defined below. If data is known to contain no wildcards (see section Variants and Wildcards), the programmer may wish to simple perform an integer cast on base rather than calling this function. If it is known that no null characters are included in the data, the index will range between 0 and norm_length()+wc_length(). This function is most useful for nucleotide alphabets as integers casts of base for an amino acid alphabet are already reasonably compact.

Method on Alphabet: Base unindex (const int index)
Reverses the process of the index function, above.

Method on Alphabet: const Base* matches (const Base base, const VarEnum v = NO_VARS)
Returns a Base::null()-terminated list of all non-variant, non-wildcard bases that match base (0 if base is not valid, 1 if base is a non-wildcard, more if base is a wildcard.) It is slightly more efficient to check base.is_wild() explicitly rather than relying on the return of a singleton set.

Method on Alphabet: int num_matches (const Base base)
Return the number of normal characters that match base. This will be 0 if base is not valid, 1 if base is a normal character, and some other number if base is a wildcard.

Method on Alphabet: const String& abbrev (const Base base)
Return a possibly abbreviated name of the given base, such as `Ala' for the amino acid Alanine.

Method on Alphabet: const String& full_name (const Base base)
Return the full textual name of base.

Alphabet Lengths

Method on Alphabet: int norm_length (void)
The number of normal (not wildcard, variant, or null) characters in an alphabet. In the base representation, the integers 0...Alphabet::norm_length() are returned from the integer type conversion of a normal character (see section Indexing).

Method on Alphabet: int wc_length (void)
The number of wildcards in the alphabet, excluding the null character.

Method on Alphabet: int norm_wc_length (void)
The total number of normal and wildcard characters.

Method on Alphabet: int max_num_base (void)
Maximum number characters possible in the alphabet not including variants.

Method on Alphabet: int max_num_var_base (void)
Maximum number of characters in an alphabet including variants. This is most likely a sparce or uncompressed representation; many of the numbers from 0 to this value are not used.
Method on Alphabet: int first_char (void)
Index of first normal character. Returns 0.

Method on Alphabet: int last_char (void)
Index of last normal character.

Method on Alphabet: int first_wc (void)
Index of first wildcard, can be unequal to the sum of first_char() and norm_length().

Method on Alphabet: int last_wc (void)
Index of last wildcard.

Method on Alphabet: int first_var (void)
Index of first variant character.
Method on Alphabet: int last_var (void)
Index of last variant character.

Implementation

The ability to easily create efficient alphabet-independent procedures has been the guiding feature of the Alphabet (see section The Alphabet Classes) and Base (see section The Base class) implementations. Not only must a uniform interface to the biosequence (or alternate domain) alphabets be provided, but the system must allow alphabets of different types to coexist within one program. Thus, compile-time switches on alphabets were quickly ruled out. For efficiency, many operations, such as comparing without wildcards and assembling counts of base occurances, can be completed without reference (or without inner-loop reference) to an alphabet. The structure of the base class also ensures that, for functions that require alphabet information, efficiency is preserved for the common case. Thus, for example, the index of a normal character or comparison of two normal characters is performed without referencing the alphabet.

The current implementation is geared to nucleotides and amino acids --- the 64-element codon alphabet, for example, would not fit will in the current underlying implementation becuase of the base classes current upper limit of 20 normal characters. Codons could, of course, be represented as variants on the amino acids, though this would require a radically different index funtion for the Codons to compress the range to 64 elements. Thus, in future revisions, index may have to become a virtual function.

Alphabet Creation

The alphabet class has several protected member functions to aid the creation of new alphabets. These functions are not needed for general programming.

Method: Alphabet Alphabet (const String& name, const String& chars = "", const int case_sensitive = 0)
The Alphabet constructor. It requires a name and a (possibly empty) list of the normal (non-wildcard) chars in the alphabet (or the empty string). Case during characters is ignored unless case_sensitive is non-zero. This constructor is typically used without any chars, as it does not allow the naming of characters.
Method: Alphabet virtual ~Alphabet (void)
Destructor.
Method: Alphabet void add_normal_char (const char c, const String& s_name = "", const String& l_name = "")
Add a normal (non-wildcard) character c to the Alphabet. All normal characters must be added before any wildcards. (There is no inherent reason for this restriction: it helps ensure that everything a wildcard references is already in place.) The short (s_name) and long (l_name) annotation strings may be used to describe the new character.

Method: Alphabet void add_alias (const String& newchar, const String& alias)
Add character-to-base translation for newchar that is identical to that of the existing character alias. When added, to_char(to_base(newchar)) will be equal to alias.

Method: Alphabet void add_wild_card (char wildcard, const String& matches, const String& s_name = "", const String& l_name = "")
Add a single wildcard that matches the characters provided. The wildcard will only match wildcards specified in matches. The short (s_name) and long (l_name) annotation strings may be used to describe the new wildcard.
Method: Alphabet void add_all_match (char wildcard, const String& s_name = "", const String& l_name = "")
Add a single wildcard that matches all current and future characters (normal and wildcard) in the Alphabet, except the null character. The short (s_name) and long (l_name) annotation strings may be used to describe the new wildcard.
Method: Alphabet void reset_name (const String& name);
Change the name of an Alphabet. Useful means of avoiding name propagation in constructors.

Specific Alphabets

Several descendents of the alphabet class are implemented as part of the library. Currently, these include the basic nucleotide and amino acid alphabets. Users are encouraged to call Alph::describe() (see section Member Functions) for an up-to-date description of all available alphabets.

Nucleic Acid Alphabets

The nucleic acid alphabets are all descendents of the minimal (most general) NucleicAlphabet. In addition to the features of Alphabet, this class includes an enumerated type defining the symbols A = 0, G = 1, C = 2, TU = 3, and several functions. The functions are currently not virtual, though as alphabets are refined, they may become virtual.

Method on NucleicAlphabet: Base complement (const Base base)
Return the Watson-Crick complement of a base. All-matching wildcards and the null base return themselves, and other wildcards are alphabet specific (see describe()

Method on NucleicAlphabet: int same_group (const Base base1, const Base base2)
@deftypemethodx NucleicAlphabet int is_complement (const Base base1, const Base base2) Return 1 or 0 depending on whether or not the two bases are in the same group (pyrimidine or purine) or are Watson-Crick complements of each other. section Specific Alphabets. same_group will return false if either or both bases are the null base. is_complement checks b1 against b2.complement(), and thus will return 1 if both bases are null and 0 if exactly one base is null.

Method on NucleicAlphabet: int is_pyrimidine (const Base base)
@deftypemethodx NucleicAlphabet int is_purine (const Base base) Return 1 or 0 depending on whether or not base is a pyrimidine or purine. section Specific Alphabets.

The RNAAlphabet class inherits from NucleicClass, and additionally defines the symbolic constant U=TU, and asserts the virtual function is_rnucleic().

The DNAAlphabet class inherits from NucleicClass, and additionally defines the symbolic constant T=TU.

The ExtDNAAlphabet class inherits from DNAAlphabet, and introduces a large number of wildcards defined as symbolic constants. The virtual functions above have been defined on these wildcards. The complement of a wildcard includes the complements of every base that wildcard matches. A wildcard is_pyrimidine or is_purine only if it exactly matches both characters (i.e., A, G, and R are pyrimidines, while C, T, and Y are purines). Two bases are in the same_group if they are both purines or they are both pyrimidines. For reference, the charactrers are: K=GT, W=AT, Y=CT, M=AC, R=AG, S=GC, V=AGCRMS, B=GCTSKY, D=AGTRWK, H=ACTMWY, and the wildcards N and X match all characters.

Amino Acid Alphabets

The AminoAlphabet class inherents from Alphabet, asserts the is_amino() virtual function, and defines the standard single-letter symbilic constants of the 20 amino acids, references as, for example, AminoAlphabet::W.

The ExtAminoAlphabet class inherents from AminoAlphabet, and adds treatment of three wildcards. They are included as symbolic constants B=20,Z=21,X=22, where B matches N and D, Z matches Q and E, and X matches any amino acid.

Alph Class

The Alph class is a wrapper for alphabets. It contains as static members each of the instantiated alphabets. These are:

Static Class Variable: Alph Nucleic
NucleicAlphabet Alph::Nucleic is the most general nucleic alphabet.

Static Class Variable: Alph RNA
RNAAlphabet Alph::RNA.

Static Class Variable: Alph DNA
DNAAlphabet Alph::DNA.

Static Class Variable: Alph ExtDNA
ExtDNAAlphabet Alph::ExtDNA.

Static Class Variable: Alph AA
AminoAlphabet Alph::AA. The amino acids without wildcards.

Static Class Variable: Alph ExtAA
ExtAminoAlphabet Alph::ExtAA. The amino acids with the standard wildcards B, Z, and X.

The alphabet member function Alphabet::id() can be used to get an identifier for each of these alphabets.

Member Functions

Method: Alph describe (ostream& output)
Print a description of the available alphabets on output.

Method on Alph: void silent_convert (int val = 1)
Called with no arguments, or val unequal to zero, this will supress error messages to cerr whenever Alphabet::to_base is passed an inconvertible character for which the null character is returned. If val is zero, error messages are produced. The default is to produce the conversion error messages.

Method on Alph: void set_default (const Alphabet& default)
Set the default alphabet used to create Sequence class members without an Alphabet type. section Sequence Constructors.

Method on Alph: const Alphabet * ret_default (void)
Return the default alphabet used to create Sequence class members without an Alphabet type. Note that use of the default is syntactically different from other alphabets: Alph::ret_default()->id() rather than Alph::RNA.id(). Possibly, the standard alphabet names should be changed to function calls and pointers, but the user should not be using these much anyway, accessing them instead from sequence's alphabet functions. section Sequence Constructors.

Method on Alph: const Alphabet* name_to_alphabet (const char *name)
Return a pointer to the Alphabet corresponding to the character string name, or NULL.

Method on Alph: int num_alphabets (void)
Return the number of alphabets available.
Method on Alph: const Alphabet* alphabet (int num)
Return a pointer to the numth alphabet or NULL if num is out of range.

Sequence

Sequences are dynamically sized arrays of Base, with reference-counting semantics similar to gnu Strings, and special I/O routines which interact with common genetic database formats. See section The Base class, section `The String Class' in Libg++ User's Guide, and section ASN Sequence Streams.

Sequence Constructors

Method: Sequence Sequence ()
Constructs a Sequence variable with no storage allocated.

Method: Sequence Sequence (int sz)
Constructs a Sequence variable with an allocation size sz.

Method: Sequence Sequence (int sz, const String &nm=nilSTR);
Constructs a Sequence variable with an allocation size sz, and with ID string nm.

Method: Sequence Sequence (const String &data, const String &nm=nilSTR);
Constructs a Sequence variable initialized with String data, and with ID string nm.

Method: Sequence Sequence (const char *data, const String &nm=nilSTR);
Constructs a Sequence variable initialized with char * data, and with ID string nm.

Subsequences

The following constructors return a subsequence which points to part of the base sequence section Reference Counting.

Method: Sequence Sequence (const Sequence&, int offset, int size)
Construct a subsequence which is offset into the referenced sequence by offset, and contains size Bases.

Method on Sequence: Sequence SubSequence (const Sequence&, int offset, int size)
Pseudo-constructor. Constructs and returns a subsequence using the above constructor. The name "SubSequence" may make for more readable code in some contexts. For example:

foo (Subsequence (s1, 408, 12));

Sequence Initialization

Storage in newly allocated Sequence variables is unititialized. It can be initialized with the scalar assignment operators:

Method: Sequence void operator= (Base base)
Sets all Bases in the Sequence to base.

Method: Sequence void operator= (char basechar)
Sets all Bases in the Sequence to the Base corresponding to basechar. The conversion alphabet must be set. section Setting The Sequence Conversion Alphabet

The SeqRep Class

SeqRep maintains the statically allocated information for Sequence representations: the alphabet, SeqLabel, and Sequence pointer, along with the reference counts. The instantiation, assignment and destruction of SeqReps is provided through Sequence constructors, assignment functions, and destructors.

Reference Counting

Sequence is a reference counted class. Operations which assign one sequence variable to another do not normally do any copying, but instead cause the array part of one sequence to point to the other and increment a reference count.

The copy constructor is invoked whenever one sequence is initialized with another, either in declarations such as

Sequence s = base_seq;

or in pass-by-value:

void foo (Sequence s)
{
...
}

Method: Sequence Sequence (Sequence& seq)
The copy constructor for the sequence classes does not copy the sequence seq, but references it.

If copying is desired instead, this can be done with an explicit call to the copy function:

Method on Sequence: Sequence copy (const Sequence&)
Return a copy of the argument.

Sample usage:

Sequence s = copy (base_seq);
foo (copy (base_seq));

Similarly, the assignment operator causes referencing of its argument:

Method on Sequence: Sequence& operator = (Sequence& sq)

If copying is desired instead, the copy method can be called:

Method: Sequence void copy (Sequence& seq)
Copies seq. Storage is automatically resized.

The copy method is slightly different than the copy function, in that the copy function always allocates a new sequence and copies into it, while the copy method, when called on a object of the same size, will copy into already allocated storage. section Automatic Resizing of Sequences

Const and Reference Counting

Because reference counting allows any of the sequence variables which refer to a piece of storage to change that storage, the copy constructor does not allow initialization of another Sequence variable with a const Sequence variable.

The desired effect can be obtained with an explicit call to the copy (Sequence& sq) function, i.e.:

const Sequence A;

...

Sequence sq = copy (A);

Similarly, the assignment operator (operator =) only allows assignment from non-const Sequences, but one can assign from a copy of a const Sequence.

Design Note:

It would be desirable to allow const Sequence variables to be initialized with other const Sequence variables, but the language does not allow the specification of constructors which differentiate between const and non-const variables.

Since having const variables that can be changed by other parts of the code is an undesirable feature, it was judged better to simply not allow initializations of other Sequence variables with const Sequences.

Sequence Subscripting

Individual bases in a Sequence can be accessed using the usual subscripting operator:

Method on Sequence: Base & operator [] (int index)
Returns a reference to the index element in the Sequence (Will be offset in subsequences). Will check array bounds, depending on a compile flag.

For const Sequences, the subscripting operator is read-only:

Method on Sequence: Base operator [] (int index) const
As with non-const Sequences, but returns a copy of the index element in the Sequence rather than a reference.

Sequence Sizes and Offsets

The size of a Sequence variable can be gotten with the size () method:

Method on Sequence: int size () const
Returns the number of Bases in the Sequence. (For subsequences, will only reflect subsequence size)

For subsequences, it is sometimes useful to know the offset into the base Sequence. This can be gotten with the offset () method:

Method on Sequence: int offset () const
Returns the offset of the Sequence from the base Sequence. (Will be zero unless one of the subsequence constructors was called by a parent.)

Sequence Comparisons

Method on Sequence: int & {operator==} ( const Sequence & c) const
Checks if the data pointers and lengths of two Sequences are identical,

Method on Sequence: int & {operator!=} ( const Sequence & c) const
Checks if data pointers of two Sequences differ.

Setting The Sequence Conversion Alphabet

Each sequence has its own alphabet pointer section The Alphabet Classes. The Alph::set_default() function should be called before performing any I/O operators or String conversions are called.

The current alphabet can be gotten with the alphabet function:

Method on Sequence: const Alphabet *const alphabet ()
Return a const pointer to the current conversion alphabet.

String Conversions

Method on Sequence: Sequence& operator= (const String &string)
converts the characters in string to bases and puts them in the sequence. Automatically resizes the allocated space in the sequence.

Method: operator String ()
Convert a sequence to a String. Not implemented.

Method on Sequence: Sequence& operator= (const char *string)
converts the characters in string to bases and puts them in the sequence.

Design Note

It would be desirable to also allow initialization of a Sequence from a String by defining a Sequence (String &) constructor, as in:

Sequence sq = "ACGT"

However, because the order of construction in different compilation units is undefined, there is no way to ensure that the alphabet is set before the constructor is called.

Sequence I/O

Method on Sequence: void scanFrom (istream &is)
Scans a sequence from the input stream, according to the current format. section Sequence I/O Formats Automatically resizes the allocated space in the sequence.

Function: istream& operator>> (istream &, Sequence &)
Call scanFrom.

Method on Sequence: void printOn (ostream &o)
Prints a sequence on the output stream as a sequence of characters, according to the current format. section Sequence I/O Formats.

Function: ostream& operator<< (ostream &, const Sequence &)
Calls printOn.

Sequence I/O Formats

Currently only one format is implemented:

RAW_ASCII
Label not input or output, sequence terminated by end-of-line, and comments may be preceded by ; or #.

Automatic Resizing of Sequences

The copy () function and input functions operator << () and scanFrom () all automatically resize the Sequence variable they are storing into.

Since any extra references to the resized Sequence variable will still refer to the old storage, an error message is generated if the references are greater than 1.

Design Note

Currently the resizing algorithm resizes the Sequence to be exactly equal to the new size, by allocating new storage and copying into it. This will result in some overhead if many sequences with similar but not equal sizes are read or copied into the same variable.

Speeding up Sequence Access in Library Code

The following functions are intended for fully debugged library code. They allow indexing into a sequence without range-checking. This will allow user code to turn range-checking on and off without affecting the speed of the library code.

Method on Sequence: const Base elem (int index) const
Method on Sequence: Base& elem (int index)
These functions access the Sequence element at index, but do not perform bounds checking. When many elements are accessed at once, it is slightly faster to access off of the data pointer (see below).

Method on Sequence: Base* data ()
Returns a pointer to the beginning of the array part of the Sequence, (translated by the specified offset for subsequences).

Method on Sequence: const Base* data () const
As above, but returns a pointer to a const Base for const Sequences.

Sample usage:

int s = sq.size ();      // cache size in a local variable
                        // the compiler probably isn't
                        // smart enough to figure out that it
                        // is a loop invariant

Base *ptr = sq.data ();

for (i=0; i < s; i++)
    foo (ptr[i]);

Sequence Conversion to C-style arrays

For backward compatibility, Sequence defines the following conversion operators, which allow a Sequence to be passed to a function which expects an array of Bases:

Method: Sequence operator Base * ()
Method: Sequence operator const Base * ()
Returns a pointer to the data part of the Sequence.

To invoke the conversion operator, just put a Sequence where an array of Base is expected:

void foo (Base ar[], ...);
Sequence sq;

    foo (sq,...);    // conversion operator is invoked

Treating a Sequence as an array of Base is not recommended for new code, because it does not allow for bounds checking of subscripts, or for use of any of the other functions defined on Sequence variables.

SeqList: Sets of Sequences

SeqList is a dynamically sized array of Sequence. Since each Sequence in a SeqList has its storage allocated separately, it acts more like a list of Sequences than a 2D array with respect to efficiency of accessing columns. See section Sequence

SeqList Constructors

Method: SeqList SeqList ()
Constructs a SeqList variable with no storage allocated.

Method: SeqList SeqList (int size)
Constructs a SeqList of size. Each cell in the array is automatically initialized to a zero-length (null) Sequence with a call to the Sequence () constructor.

SeqList Destructors

Method on SeqList: void clear (void)
Sets size to 0 and sets all Sequences in SeqList to nilSequence without deallocation.

SeqList Initialization

If one wishes to initialize the contents of a SeqList variable to something other than null Sequences, the Sequence& assignment operator can be used:

SeqList sqlst;

sqlst = "------------------------------------";

Design Note

The operator makes size () copies of the right hand side so that each SeqList cell will refer to different storage.

Appending to SeqList

Method on SeqList: SeqList & {operator+=} (Sequence & seq);
Appends Sequence to SeqList

SeqList Copy Constructor and Copy Semantics

Unlike Sequences, SeqList just has the regular copy semantics, i.e, the copy constructor and the assignment operator make copies of their arguments. If you want to pass around little square pieces of an alignment rather than just pieces of individual Sequences, use an Alignment section `Alignment Class' in To Appear in the Ultimate Manual.

Method: SeqList SeqList (const SeqList& slist)
Make a copy of slist.

Method on SeqList: SeqList& operator= (SeqList& slist)
Copy slist. Storage is automatically resized.

Subscripting and Other Info

Method on SeqList: Sequence& operator [] (int index)
Returns a reference to the Sequence at cell index. Bounds are checked.

With this definition, a SeqList acts like ragged 2D array with respect to indexing operations: For example:

SeqList sqlst;
Sequence sq;

sq = "ACGT";               // assumes the alphabet has been set

sqlst[0] = sq;	           // calls SeqList subscript operator
cout << sqlst[0][2];       // calls SeqList subscript operator,
		           // then Sequence subscript operator

Method on SeqList: int size ()
Report current size of SeqList.
Method on SeqList: int total_size ()
Report total size of all Sequences in SeqList.

Input and Output

The SeqList I/O format is governed by the Sequence I/O format variables. The Bases are converted according to the Sequence alphabet (see section Setting The Sequence Conversion Alphabet), and the label of each individual sequence is input according to the format set with Sequence::set_format () , or the default, which is RAW_ASCII. See section Sequence I/O Formats.

Method: SeqList void scanFrom (istream &is)
Reads in a SeqList from the istream, according to the current Sequence alphabet and format section Sequence I/O Formats. Automatically resizes the SeqList variable section SeqList Automatic Resizing.

Function: istream& operator>> (istream &, SeqList &)
Call scanFrom.

Method: SeqList void printOn (ostream&)
Prints a SeqList onto the ostream, according to the current Sequence format. section Sequence I/O Formats.

Function: ostream& operator<< (ostream &, const SeqList&)
Calls printOn.

SeqList Automatic Resizing

A SeqList is automatically resized on input, or on copying another SeqList with either the copy constructor or the assignment operator.

ASN Sequence Streams

The ASN sequence stream classes support sequence (see section Sequence) input and output using the NCBI's ASN data format. The BaseStream class has the kernal interface to the NCBI software, the AsnStream is designed for reading and writing user files in ASN format, while the EntrezStream stream class provides access to the compress Entrez databse as distributed on CD-Rom.

The BaseStream Class

BaseStream is a the base class for reading sequences using stream functions from ASN and Entrez genetic sequence databases. This class should not be instantiated per se but instead provides functionality common to the derived AsnStream (see section The AsnStream Class) and EntrezStream (see section The EntrezStream Class) classes. Public functions are available to check the state of the stream, to read a given sequence (out of a sequence-set), to read the description of the sequence-set, and to read the title of the current sequence within the sequence-set. The protected functions (called internally only) are used by the derived classes to set up to read (loading information from the current sequence) and to initialize the ASN data structures. See section Sequence

Constructors

Use the constructor as following:

BaseStream ()
This constructor does nothing and is protected, preventing instantiation of this class.

Public Methods

Method on BaseStream: BaseStream& operator>> (Sequence &seq)
This is the primitive extraction operator, which will pull an actual alphabetic (ASCII) sequence out of the BaseStream at its current reading location and use the Sequence class (see section Sequence) functions to store it in seq. The AsnStream or EntrezStream must have been positioned at a paricular sequence before using this operator. The operator also advances to the next sequence within the sequence set. If there are no more sequences within this set, it will set the noseqsetbit (see section Flags and States) and return without attempting to transfer any more information.

Method on BaseStream: char* ExtractDescr (void);
Pulls out the description of the current sequence.

Method on BaseStream: char* ExtractTitle (int whichSeq)
Pulls out the name of the current sequence set. In sequence sets with just one sequence, usually the title of the sequence set and the description of the first sequence are the same.

Protected Methods

Method on BaseStream: void SetupToRead ()
Does the necessary housekeeping to create a list of sequences (seqlist) and convert them to alphabetic format.

Flags and States

The base stream class include the following status bits:

Status Bit: BaseStream badbit
Indicates that a read from the ASN database has returned NULL as the next sequence set, usually resulting from reaching the end of a file.

Status Bit: BaseStream noseqsetbit
Indicates that the user has not asked to go to a sequence, or that user has read past the end of a sequence set and not asked for another one.

Status Bit: BaseStream badtypebit
An internal error, generated if the database does not have a Bioseq-Set in it.

The following methods can be used to examine and modify the BaseStream status bits.

Method: BaseStream clear (int value=0)
Set the state bits to a given value (0 by default).

Method on BaseStream: int bad (void)
Return badbit, TRUE if bad, usually indicating the end-of-file.

Method on BaseStream: int good (void)
Return TRUE if no state bits set.

Method on BaseStream: int badtype (void)
Returns badtypebit, TRUE if bad type.

Method on BaseStream: int noseqset (void)
Returns noseqsetbit, TRUE if no sequence loaded.

The AsnStream Class

AsnStream is a class (inherited from BaseStream, section The BaseStream Class) for finding and reading sequences using stream functions from ASN genetic sequence databases. (ASN-format databases typically have `.asn' or `.aso' extensions and should not be confused with Entrez/CDROM databases.) Supported functionality includes opening, building or reading or writing an index, finding a list of file positions that satisfies a query on sequence descriptions, going to a file position and going to the next sequence. Functionality inherited from BaseStream allows AsnStream to get the next subsequence and read an alphabetic (ASCII) sequence into a Sequence class instance (see section Sequence). The intended use for an AsnStream will be to create private databases of information probably taken from the Entrez databases. Writing material to an AsnStream is not currently supported.

Constructors

Use the constructor as following:

AsnStream (char* filename, int mode)
Opens an AsnStream using filename. badbit is set if this file cannot be opened. Parameter mode is currently binary|input only. Note: mode flags as with the usual stream classes are enumerated members of the AsnStream class.

Public Methods

Method on AsnStream: void GoNextSeqSet (void)
As the name implies, goes to the next set of sequences. Either this or Seek () should be called before beginning to read using the >> operator, or at any time when noseqset () is true, which could also happen when reaching the end of a certain sequence set.

Method on AsnStream: void FindFilePos (String& searchString, long *& locs)
This function finds the set of file positions that satisfy the query given in the description. The criterion is that the sequence descriptions found include all the words passed in searchString. All punctuation marks, except for '-', are regarded as whitespace, and the search is case-insensitive. The file positions found are returned in 'locs'. The first item in locs (locs[0]) will be the number of items found, 0 if no items found and -1 if there was an error, i.e. that a word passed in searchString was not in the index. The rest of the items in locs (locs[1] through locs[locs[0]]) will contain the file positions in the ASN database where the found sequence sets are. After calling this function, typically one would use Seek() to go to whichever location; at that point the data from the sequence-set found could be read using the usual functions from BaseStream such as the >> operator. Note that this function ends up allocating memory for locs; the caller is responsible for freeing that memory with 'delete'.

Method on AsnStream: void BuildIndex ()
This function builds an index and holds it in memory. Usually you would only call this function if an index file did not already exist. AsnStream must have an index to use the FindFilePos() function but it is not otherwise necessary. Ordinarily after calling BuildIndex () one would call WriteIndex () to write it out.

Method on AsnStream: int ReadIndex (char * filename)
Reads an index from the file called filename. If the file cannot be opened for any reason this function will return 0 (FALSE) otherwise it will return -1 (TRUE).

Method on AsnStream: int WriteIndex (char * filename)
Writes the current index to the file called filename. If the file cannot be opened to write for any reason this function will return 0 (FALSE) otherwise it will return -1 (TRUE). One will probably want to call BuildIndex() before using this function.

Private Methods

Method on AsnStream: SeqEntryPtr read ()
Reads the next sequence-set in linear order. Called by GoNextSeqSet() and by BuildIndex().

The EntrezStream Class

EntrezStream is a class derived from BaseStream for reading sequences using stream functions from the Entrez/CDROM genetic sequence database. Supported functionality includes finding a sequence set ID by description and loading a sequence set given the sequence set ID. Functionality inherited from BaseStream allows reading sequence from sequence sets and getting the title and description. See section Sequence

Constructors

Use the constructor as following:

EntrezStream ()
Opens an EntrezStream. failbit is set if Entrez access initialization fails for some reason.

Public Methods

Method on EntrezStream: void FindUIDSet (char * searchString, DocUid * uidsFound, Int2 recordType, Int2 fieldType, Int2 * beginError = NULL, Int2 * endError = NULL);
Finds all the UID's (4-byte integers in practice) that satisfy the query specified by searchString. Here's NCBI's explanation of what the search string should look like:

The syntax uses &, |, and -, respectively, as the intersection, union, and set substraction operators. Terms are usually followed by a field qualifier like [AUTH] (indicating author name). When terms contain embedded spaces or special characters, they must be enclosed in double quotes ("). Parentheses are used to override the standard precedence in the way that you would expect. The [*] field qualifier is used to say "give me the union of this term over all available fields."

Here are some examples:

  "Kay LE" [AUTH] - "Forman-Kay JD" [AUTH]
  carcinoma [MESH] | oncogene [WORD]

Ordinarily the field type should be specified as -1, since you'll be specifying the field in the search string. Valid field types in the search string are:

WORD
Text word
MESH
MeSH terms
KYWD
Keyword
AUTH
Author
JOUR
Journal
ORGN
Organism
ACCN
Accession numbers, locus names, patent ID's
GENE
Gene symbols
PROT
Protein names
ECNO
E.C. numbers

Valid data types are TYPE_ML (MEDLINE references), TYP_AA (amino acid sequences), and TYPE_NT (nucleotide sequences). These are defined in `accentr.h'.

The return values (coming back through DocUid * uidsFound) give the number of items found as the first item; 0 for none and -1 in the case of a nonfatal error. With a nonfatal parse error, beginError and endError will be set to the beginning and ending of the error in the search string.

Ordinarily, one would use LoadSequence() after this to load the sequence set specified by a particular DocUid. Then the base stream functionality can be used to get the sequence set title, sequence description, and extract the actual ASCII sequence.

Method on EntrezStream: void LoadSequence (DocUid id)
Loads the sequence-set specified by the DocUid id (returned from FindUIDSet). This is necessary before beginning to read using the >> operator. This function should be called if EntrezStream::noseqsetbit is set.

Configuration for Accessing Entrez/ASN databases

Here's how to set up your configuration file to access Entrez or ASN databases with Sequence-Streams. section The AsnStream Class, and See section The EntrezStream Class.

On Unix your configuration file would be called `.ncbirc', and must be in the same directory as the program using EntrezStream and/or AsnStream. It looks much like the following (which is valid as of March 1995):

[NCBI]
ROOT=/projects/compbio/entrez
ASNLOAD=/projects/compbio/entrez/asnload
DATA=/projects/compbio/entrez

The only section is entitled NCBI. The ROOT entry refers to the path to the root for all the Entrez CD-ROM data. ASNLOAD refers to the location for ASN.1 parse files (`*.l00'). The DATA entry refers to the location for the `cdromdat.val' file, which contains conversion specifications for sequences to whatever alphabet (the alphabet being IUPACNA for Sequence Stream's). Don't confuse that alphabet with the Alphabet class (see section The Alphabet Classes), by the way; this is an alphabet translation internal to NCBI. For right now, you can copy `.ncbirc' from `ultimate/lib/proto.ncbirc', and that should work.

AlphabetTuple and BaseTuple Classes

The class AlphabetTuple is for creating tuples (cartesian products) of objects from class Alphabet(see section The Alphabet Classes). Class AlphabetTuple is intended for use with short fixed length tuples.

The class BaseTuple provides for tuples of class Base (see section The Base class), the elements of the cartesian product of alphabets.

AlphabetTuple Class

The following methods are available for AlphabetTuple:

Method on AlphabetTuple: AlphabetTuple (const Alphabet *a0)
Constructor for AlphabetTuple class that creates a singleton tuple of the alphabet pointed at by a0.

Method on AlphabetTuple: AlphabetTuple (const Alphabet *a0, const Alphabet *a1)
Constructor for AlphabetTuple class that creates an ordered pair of alphabets, with the first element of the tuple pointed at by a0, and the second element pointed at by a1.

Method on AlphabetTuple: AlphabetTuple (const Alphabet *a0, const Alphabet *a1, const Alphabet *a2)
Constructor for AlphabetTuple class that creates an ordered triple of alphabets, with the alphabets of the tuple pointed at by a0, a1, and a2.

Method on AlphabetTuple: AlphabetTuple (int i, const Alphabet **a)
Constructor for AlphabetTuple class that creates a tuple consisting of i alphabets pointed to by elements of the array a.

Method on AlphabetTuple: AlphabetTuple (const AlphabetTuple &a)
Copy constructor for AlphabetTuple class.

Method on AlphabetTuple: ~AlphabetTuple (void)
Destructor for AlphabetTuple class.

Method on AlphabetTuple: int num_alphabets (void) const
Returns the number of alphabets in the tuple.

Method on AlphabetTuple: int num_normal (void) const
Returns the number of normal BaseTuples that are elements of the cartesian product of the alphabets represented by the AlphabetTuple.

Method on AlphabetTuple: const Alphabet * operator [] (int i) const
Returns a pointer to the ith alphabet in the tuple.

Method on AlphabetTuple: int same_as (const AlphabetTuple *other) const
Returns 1 if the AlphabetTuple pointed at by other is identical elementwise to the this AlphabetTuple. Return 0 otherwise.

Method on AlphabetTuple: int index (const BaseTuple & bt) const
Returns a unique integer corresponding to the BaseTuple bt when considered as an element of the AlphabetTuple. This is useful for indexing a linear array of BaseTuples. This is the inverse of the function unindex described below.

Method on AlphabetTuple: BaseTuple *unindex (int index) const
Returns the BaseTuple corresponding to the integer index. This is the inverse of the function index described above.

Method on AlphabetTuple: void print_unindex (ostream &out, int index) const
Prints to stream out the BaseTuple corresponding to the integer index.

Method on AlphabetTuple: void print_command (ostream &out) const
Prints to stream out the script commands for constructing the AlphabetTuple.

For additional input and output, the following functions are available:

AlphabetTuple: AlphabetTuple * read_AlphabetTuple (istream &in) const
Reads in a script command representation of an AlphaTuple and return it. The commands are of the form

      Alphabet= <alphabet_name>
      AlphabetPair= <alphabet_name> <alphabet_name>
      AlphabetTriple= <alphabet_name> <alphabet_name> <alphabet_name>
      AlphabetTuple=  <number> <alphabet_name> ... <alphabet_name>

as would be output by print_command If the firstword is not recognized, it is looked up as an alphabet name, as if preceded by "Alphabet="..

AlphabetTuple: ostream & operator<< (ostream &out, const AlphabetTuple &a)
Prints the Alphabets of the AlphabetTuple a to stream out.

BaseTuple Class

The following methods are available for class BaseTuple:

Method on BaseTuple: BaseTuple (const AlphabetTuple &a)
Constructor for BaseTuple class. Argument a is the AlphabetTuple that the newly constructed BaseTuple is a member of.

Method on BaseTuple: ~BaseTuple ()
Destructor for BaseTuple class.

Method on BaseTuple: Base & operator [] (int i)
Returns the ith base of the BaseTuple.

Method on BaseTuple: const Base operator [] (int i) const
Returns the ith base of the BaseTuple.

The following functions are for input and output of BaseTuples to streams:

BaseTuple: ostream & operator<< (ostream &out, const BaseTuple &bt)
Prints the Bases of the BaseTuple bt to stream out.

BaseTuple: istream & operator>> (istream &in, BaseTuple &bt)
Reads the Bases of a BaseTuple bt from stream intm.

The Prob, ShortProb, LargeReal, ShortLargeReal classes

All four classes are found in `Prob.h'. They are all done in a macrotized way, no templates.

The Prob class is designed to allow convenient manipulation of probability values. It usually stores probabilities in logarithmic form, but the exact implementation is hidden from client code. To allow renormalization within the class, values greater than unity are allowed.

The ShortProb class interface is identical to Prob, but may use a smaller internal representation (float instead of double). Casting operators are provided for changing from one form to the other, though the ShortProb inherently has less precision.

The ProbBase and ShortProbBase classes have all the functionality of Prob and ShortProb (indeed, they are inherited by the latter), except for the constructors. Users may want to use these variants for large arrays in which constructor calls could be a significant part of execution time. Corresponding LargeRealBase classes are not available at this time.

Should you want arbitrary range or negative numbers, look at LargeReal.

Prob and ShortProb

Constructors

Probs may declared in several ways:

Prob P;
Declares P with initial value 0.0. This is the only declaration suitable for a ProbBase, and the value of a ProbBase declared this way is undefined rather than 0.0.
Prob P(Prob::Zero);
Also declares P with initial value 0.0
Prob P(Prob::One);
Declares P with initial value 1.0
Prob P(Prob::Invalid);
Declares P with an initial invalid value
Prob P(0.53);
Declares P with initial value 0.53
Prob P1; P2(P1);
Copy constructor.

Usage

In general, Probs behave like real numbers. They may be manipulated with the standard normal arithmetic and relational operations:

+ - * / += -= *= /= < > <= >= == !=

Probabilities and normal numeric values may not be freely intermixed. The conversion from a double is PRIVATE, so you can not get away with mixing them. To use a number with a Prob, the from_double cast must be used:

P = (from_double)0.67; // create a prob of .67
P = Q * (from_double)0.5;

If you have a double whose value represents a LOG of a probability, and you want to stuff that into a Prob, you must use the cast (from_log):

P = (from_log) (-0.28768); // the value of ln(.75), note the explicit minus sign!

Alternatives to these casting operations ProbBase are included among the member functions. section Member Functions

Member Functions

There are several methods and functions unique to Probs.

Method on Prob: double ret_epsilon (void)
Returns smallest probability value representable by a 64-bit floating-point number, or about 1E-307. Values below this return 0 when cast to a double, but are represented as being smaller than zero in their log value.

Method on Prob: int valid (void)
Returns 0 when the probability is very small, very close to zero, 1 otherwise. A 0 return indicates that the underflow has occurred. Probabilities may be set to an invalid value. section Constructors.

Method on Prob: int non_prob (void)
Returns a 1 when a Prob is not a true probability: when either it is not valid(), or it represents a number that is greater than 1. A 0 return indicates that the Prob represents a number on the closed unit interval.

Method on Prob: int is_zero (void)
Returns 1 when the probability is zero.

Method on Prob: int is_one (void)
Returns 1 when the probability is 1.0.

Method on Prob: double ret_double (void)
Returns probability value as a double.
Method on Prob: Prob& set_double (double dval)
Set the value of a probability given its value as a double.

Method on Prob: double ret_log (void)
Returns the log of a probability value as a double.

Method on Prob: Prob& set_log (double logval)
Set the value of a probability given its log as a double.

Method on Prob: Prob& set_const (ConstProb c)
Set the value of a probability given its value one of the constant probability values of either Prob::Zero or Prob::One.

Method on Prob: Prob operator~ (void)
Returns complement of the probability. Equivalent to ((from_double)1.0 - P).

Method on Prob: double entropy (void)
Computes the standard entropy, defined as - P * ln(P). The return value is positive.

Friend to Prob: int approxEqual (Prob const & a, Prob const & b)
Friend to Prob: int approxEqual (from_double const & a, Prob const & b)
Friend to Prob: int approxEqual (Prob const & a, from_double const & b)
Returns 1 when the log forms of a and b are less than 1E-7 apart. This constant is the same for Prob and ShortProb, about right for the latter, and likely too large for some applications using the former. Casts to (from_double) are required to avoid ambiguity with other approxEqual functions.

Friend to Prob: double log2 (Prob const & a)
Returns base 2 logarithm of a Prob.

Friend to Prob: double log (Prob const & a)
Returns natural logarithm of a Prob.

Friend to Prob: Prob pow (Prob const & base, double exponent)
Friend to Prob: Prob pow (Prob const & base, int exponent)
The power function. To return a valid Prob, exponent can not be negative. If you give it a negative exponent, the return is an invalid Prob; valid() will be false.

Friend to Prob: double pow_double (Prob const & base, double exponent)
Same as pow, but returns a double, so exponent can be any value. Since C++ does not notice the return type difference, this function has to have a different name.

LargeReal and ShortLargeReal

The LargeReal and ShortLargeReal classes implement signed numbers using Prob (or ShortProb) as the hidden internal representation.

Constructors

LargeReals may be declared in several ways:

LargeReal R;
Declares R with initial value 0.0
LargeReal R(LargeRealSign::Pos);
Also declares R with initial value 0.0
LargeReal R(LargeRealSign::Neg, 10);
Declares R with initial value -10.0
LargeReal R(-1, 10);
Also declares R with initial value -10.0
LargeReal R(-123.8);
Declares R with initial value -123.8
LargeReal R(aProbClassThing);
Declares R with positive initial value of the passed prob
LargeReal R(LargeReal::Zero);
Also declares R with initial value 0.0
LargeReal R(LargeReal::One);
LargeReal R(LargeReal::Unity);
Declares R with initial value 1.0
LargeReal R(LargeReal::Invalid);
Declares R with an initial invalid value

Usage

In general, LargeReals behave like signed real numbers. The standard arithmetic and relational operators are defined.

Member Functions

Like Probs, there are severaly unique functions.

Method on LargeReal: int is_zero (void)
Returns when the internal magnitude (ignoring sign) is zero. (internally same as Prob::.is_zero()).

Method on LargeReal: int valid (void)
Returns when the internal magnitude, and hense the number, is valid (internally same as Prob::.valid()).

Method on LargeReal: Prob ret_mag (void)
Returns the magnitude as a Prob.

Method on LargeReal: int is_nonneg (void)
A data member access function. Not intended for real use. It returns the value of the internal sign flag.

Method on LargeReal: int sgn (void)
Returns 0 if value is zero, +1 if value is greater than zero, -1 if value is less than zero.

Method on LargeReal: int is_positive (void)
Returns 1 if value is zero or greater. This ignores the case of negative zero. This is fast as it only checks the internal sign flag.

Friend to LargeReal: int bigger (LargeReal a, LargeReal b)
Compares only the magnitudes. Returns a >= b. Comparison is of the magnitudes of the values.

Friend to LargeReal: int approxEqual (LargeReal const & a, LargeReal const & b)
Friend to LargeReal: int approxEqual (double const & a, LargeReal const & b)
Friend to LargeReal: int approxEqual (LargeReal const & a, double const & b)
Returns 1 when the signs of a and b are the same and their magnitudes are Prob::approxEqual(), or when the signs are different and both individually are approximately equal to zero. Return 0 otherwise. Note that this function is not uniform about 0.

Friend to LargeReal: LargeReal pow (LargeReal base, double exponent)
The power function.

Friend to LargeReal: LargeReal log (LargeReal base, double exponent)
Natural logarithm.

printing

The two print functions are print and rawprint.

Prob and LargeReal Implementation

The Prob and LargeReal implementations depend on IEEE754 32- and 64-bit floating point numbers. Since not all compilers or systems define the same constants in their header files, the Prob class relies on none of these, instead defining its own constants based on the assumed standard format of single- and double-precision floating point numbers.

One goal of the class is to provide the same range for Prob and its single-precision version ShortProb to ensure error-free conversion between the two formats. Thus, the 64-bit Prob class only has more precision than the ShortProb class, not more range.

In the IEEE754 standard, double-precision numbers range to about 2E307, while single-precision numbers range to 2E38. For this reason, zero has been chosen to be exp(-2e35) (Probs uses the natural logarithm), while the sentinal Prob::Invalid value is -2E37, and any Prob smaller than -1E37 is considered invalid. If the range between the zero probability and the invalid probability were further spread, it would be possible to semi-safely perform multiplication (addition of the underlying probabilities) without checking for zero. However, as they are defined here, the check for zero must occur, or adding together one thousand zeros would result in a ShortProb becoming invalid. If the infinity and NaN (Not A Number) checks were guarenteed to be done in hardware, rather than software, relying on their IEEE definitions would be another means of speeding the code.

64-bit IEEE floating-point numbers are used for comparison and operations wich require exponentiation. The smallest IEEE denormal is about exp(-713), or 1E-323.306. Not trusting denormals (for the same reasons as not relying on infinity and NaN), however, provides a smallest number of about exp(-708), or 1E-307.65. Two probabilities are approxEqual if they differ by less than 1E307 (ret_epsilon()), or if the difference in their log representations is less than 708. Also, in the conversion to a log, any double that is smaller than epsilon is converted to zero to prevent underflow in the call to log(). In the pow_double(a,r) routines, episilon is similarly used to avoid called exp() with too small a number.

The addition of two Probs is another interesting problem. Here, the numbers must be exponentiated, summed, and then reduced again to the log domain. To avoid reduce this operation to only one exponentiation, the difference between the two log values is exponentiated. Suppose P>Q, and both are log values. Then, we want to calculate:

log(exp(P)+exp(Q)), or 
log(exp(P)*(1.0+exp(Q)/exp(P))), or
P+log(1.0+exp(Q-P).

When the 64-bit representation is being used, each log value has 54 bits of precision. Thus, if exp(Q-P) is less than 2**-54, the quantity added to P will be zero. This motivates the private _maxDiff() value of log(2**-54), or 37.4. If the log values of two Probs differ by more than this, addition of the Probs simply results in the return of the larger value. For ShortProbs, the threshhold is log(2**-24), or 16.6. This is the only constant that varies between the Prob and ShortProb classes.

Addition is further sped by a check for zero: if one arguement is zero, the other argument is returned.

Histogram Classes

Hist is the basic histogram class. It has functions to add and delete samples, and get sample statistics such as the mean and variance.

SmoothHist is a virtual derived class. It defines all of the functions in Hist, plus smoothing-related functions.

The derived classes of SmoothHist implement specific kernels. Currently, RectHist, NormalHist, and LogNormalHist are implemented.

Constructors

Method: Hist Hist (int low, int high)
Construct a Hist with bins from low to high. Bins on the end are infinitely wide: all instances less than or equal to low are put in low, and all instances greater than or equal to high are put in high.

Method: RectHist RectHist (int low, int high)
Constructs a SmoothHist which uses a rectangular kernel.

Method: NormalHist NormalHist (int low, int high)
Constructs a SmoothHist which uses a normal kernel.

Method: LogNormalHist LogNormalHist (int low, int high)
Constructs a SmoothHist which uses a log normal kernel. Because of systematic bias, this kernel is not recommended.

Adding and subtracting instances

Method: Hist add_instance (int instance)
Method: Hist operator+= (int instance)
Increments the count for the appropriate bin in the Histogram. Bins at the end of the range are infinitely wide.

Method: Hist sub_instance (int instance)
Method: Hist operator-= (int instance)
Decrements the count for the appropriate bin in the Histogram. Bins at the end of the range are infinitely wide.

Setting smoothing parameters

In all classes, the kernel is initially defined to be the identity function. (Note: this should probably be changed to some suitable fraction of the range.) Setting the kernel width will cause the kernel to be recalculated. You can either set the kernel explicitly, or the class will calculate a kernel width based on the variance of the data and the number of samples. The class will print an error message if the kernel width is set to zero.

Method: Hist set_kernel_width (double sigma)
Set the kernel width parameter to sigma.

Method: Hist set_kernel_width (void)
With no argument, the class sets the kernel width to an internally-calculated value based on on the variance of the data and the number of samples.

Method: Hist smoothing (DYNAMIC)
Set dynamic smoothing. The smoothed counts will be updated every time an instance is added or subtracted. This is the default smoothing type.

Method: Hist smoothing (LAZY)
Set lazy smoothing. A dirty bit is set every time the counts are updated or the kernel is changed, and the smoothed counts are recomputed whenever the count information is accessed using the prob(), count(), or countVec() methods .

Method: Hist smoothing (USER)
User controls smoothing with smooth() function. For smoothed classes, the count information will not reflect any updates to either the kernel or the counts until smoothing is called explicitly with the smooth() method

Method: Hist smooth (void)
Recompute smoothed counts by multiplying counts by the kernel.

Retrieving count information

The following refer to smoothed counts if the class defines a kernel, otherwise they refer to raw counts.

Method on Hist: int operator () (int bin) const
Method on Hist: float counts (int bin) const
Return counts for bin, which may be fractional for smoothed classes.

Method on Hist: float prob (int bin) const
returns the probability of a bin.

The entire count Vector can be retrieved as follows:

Method on Hist: const floatVec countVec (int) const
returns a read-only reference to the internal count vector

Method on Hist: float mean (void) const
returns the mean of the samples

Method on Hist: float variance (void) const
returns the variance of the samples

Raw Count Information

For smoothed classes, the unsmoothed information can be retrieved by appending raw_ to the previous functions:

Method on Hist: float raw_counts (int bin) const
Return raw counts of bin

Method on Hist: float raw_prob (int bin) const
Return the probability of a bin.

Method on Hist: const floatVec raw_countVec (int) const
Return a read-only reference to the internal raw count vector.

Input and Output

Method on Hist: virtual void printOn (ostream& ostr)
Print the Histogram in a human-readable form on ostr. Overloaded for SmoothHist to print extra information.

Method on Hist: virtual void dumpOn (ostream& ostr)
For debugging. Print more complete information than printOn().

Method on Hist: virtual void plotOn (ostream& ostr)
Format suitable for feeding to a plotting package. Print out bucket, count pairs on ostr (counts for Hist, smoothed counts for SmoothHist).

Method on Hist: ostream& operator<< (ostream& ostr, Hist& histogram)
Invoke the appropriate version of printOn().

Method on Hist: virtual void scanFrom (istream& istr )
Not yet implemented Read in bucket, count pairs from istr. Set counts for Hist, raw counts for SmoothHist.

Method on Hist: istream& operator>> (istream& istr, Hist& histogram)
Not yet implemented Invoke the appropriate version of scanFrom().

Method on Hist: virtual void storeOn (ostream& ostr)
Not yet implemented. Save the state of the object on ostr (not in human-readable form)

Method on Hist: virtual void restoreFrom (istream& istrm)
Set the state of an object from istrm, assuming the state was saved with storeOn.

Run-time Type Information

Method on Hist: static const char* classID (void)
Return a reference to a const string identifying the class (the identifier is just the class name).

Method on Hist: virtual const char* type (void) const
As above, but when called on an object, dynamically returns the type.
Hist h;

if (h.type() == RectHist::classID())
	( do something)

Input

For input the following functions provide useful, uniform ways for dealing with comments in input files.

Input Function: void get_word (istream & in, char * word)
Reads a (white-space-terminated) word from the input into a pre-allocated buffer provided. Ignores comments starting with # or //. No length checking is done so this should only be used for input that is certain to be safe.

Input Function: char get_word (istream & in, char * word, char stop_char, int skip)
Reads a (white-space-terminated) word from the input into a pre-allocated buffer provided. Ignores comments starting with # or //. Stops immediately after reading stop_char. A comment is treated as a newline character for stopping when stop_char is newline. If skip is 1, then all input following the first white-space-terminated word is skipped over up to and including the first stop_char. If skip is 0, then all whitespace following the first white-space-terminated word is skipped over up to and including the first stop_char or the first non-whitespace character, whichever comes first. The default value of skip is 1. The value returned is the last character consumed from stream in. No length checking is done so this should only be used for input that is certain to be safe.

Input Function: int verify_word (istream & in, const char * should_be)
Reads a word and makes sure it is identical to should_be. Returns 1 if the next word in the input is should_be. Otherwise, prints an error message and returns 0.

Input Function: char * read_line (istream & in)
Reads a line (skipping comments and blank lines) and strip leading white space. Allocate new storage for the line, and return the char * string of the new storage. Return 0 if there is no more data.

Input Function: int SkipSeparators (istream & in, int & at_sep, int dest_sep, char Separator)
Skip dest_sep - at_sep separators. Return 1 if ok, 0 if not enough separators were found. Reset at_sep to at_sep + number of skipped separators. In normal use, at_sep is set to 0 at the beginning of a line.

Input Function: int SkipSeparators (istream & in, int num_to_skip, char Separator)
Skip num_to_skip occurrences of Separator on the input stream in. This form of SkipSeparators may be simpler if the position on the input line is not important.

The following classes are useful for keeping track of objects that need to be named, with classes for hash tables of objects and command scripts.

The ClassNameRegistry

Note that ClassNameRegistry is an obsolete feature and should be replaced by using NamedClass(see section NamedClass Class).

The class ClassNameRegistry is used for registering names of classes in a hash table. Registering a class is intended to facilitate I/O and most importantly, to enable a program to create an object of certain type without knowing at compile time what it is. For example, a program could create an object based on data in a text file by using the ClassNameRegistry and the name the object is registered under.

A class should normally be registered using the following macro:

ClassNameRegistry Macro: RegisterClass (name, char * id, create_name, _init_name1, __init__name2)
Registers a class name. name is the name of the class to register, id is a char * string of the class name. create_name, _init_name1, __init__name2 should be identifiers that are unique within the scope of the file in which the call to RegisterClass is made. create_name will be used as the name of a function to create a new object of type name. _init_name1 and __init__name2 will be used for structures necessary for automatic initialization and entry of name into the ClassNameRegistry. These last two identifiers should never need to be used by the programmer again. If a class has already been registered under the same identifier, then the error is reported and the program aborted.

A class needs to have a void constructor to be registered. In addition, a class must have the following 3 members defined before it can be correctly registered:

To see if a class has been set up correctly, the value of the virtual function type should equal the value of the static function classID. For example, suppose we have a class cl which we have registered under "cl", and that x is an instance of class cl. Then the expression

x->type() == cl::classID()

should be true.

When a class has been properly registered, the value returned by the static function classID should be the same as the value returned by ClassNameRegistry::ID. In the previous example, the expression

ClassNameRegistry::ID("cl") == cl::classID()

should be true. These two expressions can be used as consistency checks when registering a class.

These methods are also defined:

Method on ClassNameRegistry: static void add_class (const char * id, const CreateFcn creator)
Register the class id and a function that creates an object of the class type. This method is not intended for direct use. A class should be added to the ClassNameRegistry using the RegisterClass macro described above.

Method on ClassNameRegistry: static const char * ID (const char * id)
Return the canonical pointer corresponding to id.

Method on ClassNameRegistry: static void * create (const char * id)
Return a new object of the type corresponding to id.

NamedObject Class

The class NamedObject should be used as a base class for objects that have names. In addtion to having a name, a NamedObject can also be provided with a help string containing useful information about the particular NamedObject. If a NamedObject is not provided with a help string, then a default help string is provided which says "No help available.".

Class NamedObject can be used in conjunction with class NameToPtr (see section NameToPtr Class) when a lookup table of objects is needed.

The following methods are available for NamedObject

Method on NamedObject: NamedObject (void)
Void constructor for NamedObject class.

Method on NamedObject: NamedObject (const char * nm, const char * help)
Constructor for NamedObject class. Argument nm is a char * string containing the name of the NamedObject and help is a string containing help information. The default value of help is 0. Note that the instance of NamedObject constructed owns a copy of the string nm. However, the instance does not own a private copy of the helpstring help.

Method on NamedObject: ~NamedObject (void)
Destructor for NamedObject class.

Method on NamedObject: const char * name (void) const
Returns the char * string containing the name of the NamedObject.

Method on NamedObject: void set_name (const char * x)
Sets the name of the NamedObject to the name pointed to by x. Note that the NamedObject owns a copy of the string x.

Method on NamedObject: const char * help (void) const
Returns the char * string containing the help information for the NamedObject. If the NamedObject has no help string, the a string reading "No help available" is returned instead.

Method on NamedObject: void set_help (const char * h)
Sets the help string of the NamedObject to the name pointed to by h.

Method on NamedObject: void read_name (istream & in)
Reads a name from input stream in. The name must be preceded by Name = in the input stream.

Method on NamedObject: void write_name (ostream & out) const
Writes a name to output stream out. The name will be preceded by Name = and terminated with a newline.

IdObject Class

The class IdObject is the type used for giving unique IDs to classes. It is derived from class NamedObject (see section NamedObject Class). In addition, it is needed for maintaining the `is_a' hierarchy. The `is_a' hierarchy allows querying of an object to see what classes it is derived from. Note that class IdObject is used only for these purposes of registering classes derived from NamedClass (see section NamedClass Class).

The two types NamedClassFunction and IdObjectFunction are defined in IdObject. The type NamedClassFunction is a pointer to a function of zero arguments returning a pointer to NamedClass (see section NamedClass Class). Type IdObjectFunction is a function taking a pointer to an IdObject as an argument with no return value. They are declared as follows

typedef NamedClass * (*NamedClassFunction) (void);
typedef void IdObjectFunction(IdObject *);

The following methods are available:

Method on IdObject: IdObject (const char * name, NamedClassFunction createfn, IdObjectFunction *init_is_a, const char * help)
Constructor for IdObject. The argument name is a string containing the name to be used in looking up the object. The argument createfn is a function which should allocate a new object of the type associated with name. Init_is_a is a pointer to a function which initializes the `is_a' hierarchy for the type. Finally, help is a help string describing the class. The arguments createfn, init_is_a, and help, all have a default value of 0.

Method on IdObject: ~IdObject (void)
Destructor for IdObject class.

Method on IdObject: static IdObject * id (const char * name)
Returns the IdObject associated with name. If no object is associated with name, then returns 0. The lookup of name is not case sensitive.

Method on IdObject: NamedClass * create (void) const
Returns a pointer to a newly allocated object of the same type as associated with the IdObject if there is such a type. Returns 0 otherwise.

Method on IdObject: int is_a (IdObject * x)
Returns 1 if the IdObject is an x. Otherwise, returns 0.

Method on IdObject: void add_is_a (IdObject * x)
Adds x to the `is_a' list of the IdObject.

Method on IdObject: static void apply_all (IdObjectFunction * fnc)
Calls the function pointed by fnc on all objects of type IdObject created.

Method on IdObject: static const NameToPtr * lookup_table (void)
Returns a pointer of type NameToPtr * that points to the lookup table containing all objects of type IdObject.

NamedClass Class

The class NamedClass should be used as a base class for classes that need to have run-time type determination. Run-time type determination is convenient for I/O and is essential for programs that need to create objects whose exact type may not be known at compile-time. Class NamedClass also provides an `is_a' mechanism for expressing class relationships. The class IdObject(see section IdObject Class) is used for representing this hierarchy as well as for providing unique IDs for classes.

NamedClass methods

The following members and functions are available for use by all derived classes:

Method on NamedClass: int is_a (IdObject * x)
Returns 1 if the object is an x according to the `is_a' hierarchy information.

Method on NamedClass: void write (ostream & out)
Writes the object to the output stream out.

Method on NamedClass: static NamedClass * read_new (istream & in)
Reads in a text representation of a class derived from NamedClass and returns a pointer to a newly allocated object created based on the input. Returns a 0 if there is an error during input.

NamedClass: ostream & operator << (ostream &out, const NamedClass nc)
Outputs the NamedClass object nc on stream out.

Deriving a class from NamedClass

In order to work properly, a class derived from NamedClass must have certain members in support of identification, I/O, and the `is_a' hierarchy.

All classes derived from NamedClass should have the following private member, which provides the unique IdObject for the class:

static IdObject ID;

All classes derived from NamedClass should have the following public member functions:

Method on NamedClass: static IdObject* classID (void)
Returns a pointer to the required static member ID described above.

Method on NamedClass: virtual IdObject* type (void) const
Also returns a pointer to the required static member ID described above. Note that this function is a virtual function and is intended for dynamic type determination of an object.

In addition, a class derived from NamedClass may need the following member functions:

Method on NamedClass: static void init_is_a (IdObject * self)
Initializes the `is_a' hierarchy for the derived class. This static member function is needed if a class is derived from higher class. The argument self is a pointer to the ID member of the derived class. When a class derived from NamedClass does need the `is_a' hierarchy and hence this member function, a pointer to this member function should be passed as an argument on construction of the derived class's ID member.

Method on NamedClass: virtual int read_knowing_type (istream & in)
Reads a text representation of a class derived from NamedClass (see section NamedClass commands). Returns 1 if it has read the EndClassName = command that terminates the text of a NamedClass command script. Otherwise, returns 0. Note that the static member function NamedClass::read_new(), which calls this method for input, will read the opening ClassName = command, and will also try to read the EndClassName = command if this method returns a 0.

Method on NamedClass: virtual void write_knowing_type (ostream & out)
Writes the object to output stream out. This method is called by the method NamedClass::write() for output. NamedClass::write() will output the bracketing ClassName = and EndClassName =, while this method is repsonsible for everthing in between.

Finally, if objects of the derived class will need to be created in situations where the exact type may not be known at compile time, such as for input and output, the derived class will need a void constructor.

An Example

Consider the following virtual class virt derived from NamedClass. Its declaration in a header file would be

class virt: public NamedClass
{       private:
                static IdObject ID;
        public:
                static  IdObject *classID(void) {return &virt::ID;}
                virtual IdObject *type(void) const {return &virt::ID;}

        // ... and whatever is needed for the class itself.
};

In its corresponding .cc file, we would need the initialization

IdObject virt::ID ("virt",0,0,"a demo pure virtual class");

Note that since this is a pure virtual class, the second and third arguments to the constructor of ID are both 0.

If we derive a class deriv from virt, the header file could look like


class deriv: public virt
{
    private:
        static IdObject ID;
        static void init_is_a(IdObject *self)
        {    self->add_is_a(virt::classID());
        }
        int i;

        virtual int read_knowing_type(istream &in)
        {   in >> i;
            return 0;
        }

        virtual void write_knowing_type(ostream &out) const
        {out << i << "\n";}
    public:
        static  IdObject *classID(void) {return &deriv::ID;}
        virtual IdObject *type(void) const {return &deriv::ID;}
        deriv(void) {i=0;}

        // ... and whatever is needed for the class itself.
};

To initialize deriv the .cc file would need to have


static NamedClass *create_deriv(void) {return new deriv;}
IdObject deriv::ID ("deriv", create_deriv, deriv::init_is_a,
    "a demo derived class\n");

Note that since deriv is derived from class virt, deriv has a static member function init_is_a. Also, the static function create_deriv is needed so that an object of class deriv can be created dynamically.

NamedClass commands

The class NamedClass supports a script style format for the input and output of NamedClass objects. All commands should have the form


CommandName = arg1 arg2 arg3 . . .

That is, a command has the format of the the command name followed by an `=', followed by any arguments of the command.

Since NamedClass is meant to be the basic root class from which to derived other classes that need to be named, it supports just two rudimentary commands, one for beginning the description of an object and another for ending the description.

NamedClass command: ClassName = name_of_the_class
NamedClass command: EndClassName = name_of_the_class
The first command begins an object description, while the second command ends the description. The single argument of each command is the name of the class of the object to be input. This name must be identical to the named used when defining the required ID member (see section Deriving a class from NamedClass). An error is signaled if the class name arguments for matching ClassName = and EndClassName = commands are not the same.

NameToPtr Class

Class NameToPtr provides a hash table for mapping the name of a thing to a pointer to a thing. The things must be derived from the class NamedObject (see section NamedObject Class).

The case sensitivity of lookups in the hash table can be controlled using the method ignore_case described below. By default, lookups are case-sensitive. The setting of ignore_case affects only lookups. It does not matter what its setting is at the time a name is added to the table.

For flexibility in lookups, insertions or deletions, special control flags are allowed to handle situations in which a name may or may not be already entered into a table. These are the ifold and ifnew flags. An ifnew flag is used to handle not finding a name in a table. Allowable values for this flag are:

An ifold flag is used to check that a name in the table doesn't already exist. Allowable values for this flag are:

The following methods are for lookups, insertions, and deletions:

Method on NameToPtr: NamedObject * FindOldName (const char * name, OptionIfNew ifnew)
Look up name in the hash table. If name is found, then a pointer to the object is returned. Otherwise, if ifnew is ErrorIfNew, then an error is reported and the function aborts. If ifnew is ZeroIfNew, then a 0 pointer is returned. A value of CreateIfNew for ifnew is not allowed. The default value of ifnew is ErrorIfNew.

Method on NameToPtr: void AddName (NamedObject * object, OptionIfOld ifold)
Insert object into the hash table. If there already exists an object associated with name and ifold has value ErrorIfOld then an error is reported and the function aborted. If ifold is not ErrorIfOld, then object is always added to the table, replacing any value previously having the same name as object. The default value of ifold is ReplaceIfOld.

Method on NameToPtr: void DeleteName (const char * name, OptionIfNew ifnew)
The entry with name is removed from the hash table. If name is not in the table and ifnew has value ErrorIfNew, then an error is reported and the function aborted. If name is not in the table and ifnew has value ZeroIfNew, then the call to this method has no affect on the hash table. A value of CreateIfNew for ifnew is not allowed. The default value of ifnew is ErrorIfNew.

These methods are also available:

Method on NameToPtr: void ignore_case (int ignore)
If ignore is 1 then all subsequent lookups into the hash table will not be case sensitive. If it is 0, then all subsequent lookups will be case sensitive. Note that case-sensitivity can be turned on or off dynamically using this method. Also, the setting of ignore case only affects subsequent lookups. The setting at the time a name is added to the table is irrelavent. The default value for ignore is 1.

Method on NameToPtr: NameToPtr NameToPtr (int size)
Creates a new NameToPtr hash table with capacity size.

Method on NameToPtr: void Rehash (int newsize)
Increases the size of the hash table to newsize.

Method on NameToPtr: void ~NameToPtr (void)
Destructor for NameToPtr. Deallocates hash table.

Method on NameToPtr: int RetNumNames (void) const
Returns the number of names entered into the hash table.

Method on NameToPtr: int RetHashSize (void) const
Returns the current size of the hash table.

Method on NameToPtr: void ApplyAll (FunctionNameObj fun)
Apply the function fun to every entry in the hash table. The type FunctionNameObj is a pointer to a function which accepts a single parameter of type NamedObject *.

Method on NameToPtr: void ApplyAll (FunctionNameConstObj fun)
Apply the function fun to every entry in the hash table. The type FunctionNameConstObj is a pointer to a function which accepts a single parameter of type const NamedObject *.

Command Class

Class Command should be used for defining script commands. It is derived from class NamedObject (see section NamedObject Class). It provides functions for defining script commands, as well as reading and executing script files.

When defining a command, a function is associated with the command name. The function provides the definition for what the command is to do. The function must take three arguments of the following types in order:

  1. An input stream of class istream. This argument is intended to be the stream of the script input. By reading from this stream, the function can obtain any additional parameters or information that are needed to execute the associated command.
  2. A pointer to Command, type Command *. This is intended to be a pointer to the command object that actually invoked this command action function. This is mainly used for when the action function needs to know what command called it.
  3. An output stream of class ostream. The function can print the output of the command to this stream.

The function must return an integer. A value of 0 should be returned if the script should terminate execution, either due to an error or simply because the end of the script has been reached. Otherwise, a value of 1 should be returned. The type of a pointer to such a function is called CommandFunctionPtr and is defined as

typedef int (*CommandFunctionPtr)(istream &, Command *, ostream &);

The following methods are available for Command

Method: Command Command (char *nm, CommandFunctionPtr c, const char *use)
Constructor for Command class. Argument nm is the name of the command, c is the function which defines the action the command performs, and use is a help string for describing what the command does. Both c and use have default values of 0.

Method: Command Command (char *alias_nm, Command * c)
Constructor for Command class. Argument alias_nm is a new alternative name for the command and c is a pointer to the Command * object of the original command.

Method on Command: int execute (istream &in, ostream &log) const
Executes the command function of the command with in as the input stream argument and log as the output stream for logging the results.

The following static members are also available:

Method on Command: static Command * command (const char * nm)
Returns a pointer to the Command object associated with name nm.

Method on Command: static NameToPtr * command_table (const char * nm)
Returns a pointer to the table of all commands.

Method on Command: static void remove_from_table (char *nm)
Removes the command associated with nm from the command table.

Method on Command: static int read_command (istream &in, ostream &log)
Reads a command from in and executes it, logging the output of the command to log. A 0 is returned if the end of input is reached, and a 1 is returned otherwise.

Method on Command: static void read_script (istream &in, ostream &log, ostream *prompt)
Reads and executes commands from stream in, logging the results to stream out. Parameter prompt is a pointer to an output stream that is used to prompt the user for input, and has a default value of 0.

EqualStrings

EqualStrings: int EqualStrings (const char * a, const char * b, int ignorecase)
Compares strings a and b. Returns 1 if the strings are equal, 0 if they are not equal. The argument ignorecase is optional, with a default value of 0. If ignorecase is 0, then the comparison is case sensitive. Otherwise, the comparison is not case sensitive.

OrderSort

OrderSort: int * OrderSort (const float * value, int num_to_sort)
Takes the array of floats value and allocates and returns an array index such that value[index[i+1]] >= value[index[i]]. num_to_sort is the size of the array value. Order sort is useful for when you want to know the sorted order of an array of floats, but you also don't want to rearrange the original array.

Log2

The file log2.h provides a fast double precision function to compute the logarithm base 2 of a number.

log2: double log2 (double x)
Returns the logarithm base 2 of x.

log2: double clip_log2 (double p)
Returns the logarithm base 2 of p for positive p. When p <= 0.0, returns -1000.

LogGamma

The functions in LogGamma.h provide a fast way to compute the log Gamma(x), the natural logarithm of the Gamma function, and its first two derivatives.

LogGamma: double LogGamma (double x)
Returns the log Gamma(x).

LogGamma: double Psi (double x)
An commonly used alternative name for LogGamma(x) above.

LogGamma: double LogGamma_1 (double x)
Returns the first derivative of log Gamma(x).

LogGamma: void LogGamma_derivs (double x, double &log_gamma, double &log_gamma_1, double &log_gamma_2)
Returns the value of log Gamma(x) in log_gamma, the first derivative of log Gamma(x) in log_gamma_1 and the second derivative of log Gamma(x) in log_gamma_2.

LogGamma: double LogGamma_print_summary (ostream &out)
Prints LogGamma cache statistics onto stream out.

Multinomial

The file Multinomial.h declares functions for computing the multinomial probability of a sequence of counts. Both function return probabilities in the form of class Prob (see section The Prob, ShortProb, LargeReal, ShortLargeReal classes).

Multinomial: Prob Multinomial (const float *counts, int num_dim)
Compute the number of ways that a sample with the sum over counts[i] elements can have the distribution given by counts. Argument num_dim is the length of the array counts.

Multinomial: Prob Multinomial (const float *counts, int num_dim, double SumCounts)
Compute the number of ways that a sample with the sum over counts[i] elements can have the distribution given by counts. Argument num_dim is the length of the array counts. Argument SumCounts is the sum of all the elements of counts.

Regularizer Class

The class Regularizer and its derived classes provide ways of estimating a normalized probability distribution from a sample of observed counts. Regularizers allow a prior distribution to be specified. Such a prior distribution should incorporate information about what one expects the distribution to be in the absence of any observed samples. The prior distibution is especially important to avoid overfitting when the number of observed samples is small. Typically, as the observed sample size grows, the prior distribution is given less weight by a good regularizer.

The class Regularizer is an abstract class meant to provide the basic interface for deriving regularizer classes. The Regularizer class represents the set of elemental events of the probability distribution as tuples of alphabets (see section AlphabetTuple and BaseTuple Classes). The Regularizer class is pure virtual class derived from class NamedClass (see section NamedClass Class) and supports methods required of a NamedClass. It is also derived from class NamedObject (see section NamedObject Class) so that regularizers can be named.

Regularizer methods

In addition to the methods from NamedClass and NamedObject, Regularizers support the following methods:

Method on Regularizer: Regularizer (void)
Void constructor for class Regularizer.

Method on Regularizer: Regularizer (const Alphabet *a, const char *nm)
Constructor for class Regularizer. Sets the alphabet tuple of the regularizer to the singleton tuple containing the alphabet pointed to by a, and sets the name of the regularizer to the string pointed to by nm if nm is not 0. By default, nm has value 0.

Method on Regularizer: Regularizer (const AlphabetTuple *at, const char *nm)
Constructor for class Regularizer. Sets the alphabet tuple of the regularizer to a copy of the tuple pointed to by at, and sets the name of the regularizer to the string pointed to by nm, if nm is not 0. By default, nm has value 0.

Method on Regularizer: ~Regularizer (void)
Destructor for class Regularizer.

Method on Regularizer: virtual Regularizer * copy (void) const
Creates a new regularizer that is a copy of the current regularizer. The resulting regularier is of the same type with all fields copied from the current regularizer. Note that this is an abstract virtual function, and must be defined in any class derived from class Regularizer.

Method on Regularizer: int alphabet_size (void) const
Returns the number of distinct base tuples of the alphabet tuple of the regularizer.

Method on Regularizer: const AlphabetTuple * alphabet_tuple (void) const
Returns a pointer to the alphabet tuple of the regularizer.

Method on Regularizer: void set_alphabet_tuple (AlphabetTuple* at)
Sets the alphabet tuple of the regularizer to the tuple pointed to by at. The regularizer then owns the tuple pointed to by at and so is responsible for deleting it. Note that the alphabet tuple cannot be reset once the input order of the regularizer has been set.

Method on Regularizer: void set_alphabet (Alphabet *a)
Sets the alphabet tuple of the regularizer to the singleton tuple containing the alphabet pointed to by a. Note that the alphabet tuple cannot be reset once the input order of the regularizer has been set

Method on Regularizer: void print_order (ostream &out) const
Prints the order in which the regularizer's alphabet tuple elements are ordered. This order is in accord with the order obtained using the alphabet tuple's index() member function.

Method on Regularizer: void read_order (istream &in)
Reads the input order for base tuples of the regularizer's alphabet tuple.

Method on Regularizer: const int * input_order (void) const
Returns a pointer to an array of integers containing the input order of the regularizer. The integer at position i in the array is the index of the base tuple whose information (such as pseudocounts) should occur at position i for any input commands in which data needs to be specified for each base tuple. Note that the input order need not be that same as the order obtained from print_order.

Method on Regularizer: virtual void print_info (void) const
Prints useful information about the regularizer. Note that the kind of information printed will vary across different types regularizers.

Method on Regularizer: static Regularizer * read_new (istream &in, IdObject * required_type)
Reads a new regularizer from input stream in and returns a pointer to it. The argument required_type is a pointer to the IdObject of the type of regularizer that should be read in. If the type of the regularizer read from stream in is different from required_type, then an error message is printed and 0 is returned.

Method on Regularizer: static Regularizer * read_new (const char *filename, IdObject * required_type)
Reads a new regularizer from the file name filename returns a pointer to it. The argument required_type is a pointer to the IdObject of the type of regularizer that should be read in. If the type of the regularizer read from stream in is different from required_type, then an error message is printed and 0 is returned.

Method on Regularizer: virtual void get_modified_counts (const float *TrainCounts, float *ModifiedCounts)
Returns an unnormalized probability distribution in array ModifiedCounts for the base tuples of the regularizer's alphabet tuple, using array TrainCounts as an observed sample distribution. In both array arguments, the index of the base tuple from the alphabet tuple's index() method is used for matching a base tuple to its corresponding count.

Method on Regularizer: void get_probs (const float *TrainCounts, float *probs)
Returns a normalized probability distribution in array probs for the base tuples of the regularizer's alphabet tuple, using array TrainCounts as an observed sample distribution. In both array arguments, the index of the base tuple from the alphabet tuple's index() method is used for matching a base tuple to its corresponding count.

Method on Regularizer: float encodingCostForColumnCounts (const float *RealProbs, const float *TrainCounts, float *EstProbs)
Returns a normalized probability distribution in array EstProbs for the elements of the regularizer's alphabet tuple, using array TrainCounts as an observed sample distribution. The return value of the function is the encoding cost in bits of the column with observed sample counts TrainCounts and actual probabilities of RealProbs.

Method on Regularizer: virtual void normalize (void)
Normalizes parameters of the regularizer for those that have an extra degree of freedom and may need to be normalized for numeric stability.

Method on Regularizer: int verify_partials1 (const float *TrainCounts, float tolerance)
Verify that the first partial derivatives of the parameters are correct for the count vector TrainCounts within the tolerance given by tolerance. Returns 1 if the partials are correct. Returns 0 if they are not correct. The defaullt value of tolerance is 0.01.

Method on Regularizer: int verify_partials2 (const float *TrainCounts, float tolerance)
Verify that the second partial derivatives of the parameters are correct for the count vector TrainCounts within the within the tolerance given by tolerance. Returns 1 if the partials are correct. Returns 0 if they are not correct. The defaullt value of tolerance is 0.01.

Regularizer commands

Class Regularizer supports the NamedClass script command format (see section NamedClass commands). In addition to the basic commands supported by NamedClass, all regularizers have the following commands:

Regularizer command: Alphabet = alphabet_name
Sets the alphabet tuple of the regularizer to the singleton tuple containing the Alphabet with name alphabet_name.

Regularizer command: AlphabetPair = name_1 name_2
Sets the alphabet tuple to the pair of alphabets with names name_1 and name_2.

Regularizer command: AlphabetTriple = name_1 name_2 name_3
Sets the alphabet tuple to the triple of alphabets with names name_1, name_2, and name_3.

Regularizer command: AlphabetTuple = n name_1 ... name_n
Sets the alphabet tuple to the tuple of n dimensions of alphabets with names name_1 ... name_n.

Regularizer command: Name = name
Sets the name of the regularizer to name.

Regularizer command: Order = bt_1 bt_2 ... bt_n
Modifies the input order of the regularizer. This means that the order in which the data in other commands (such as numbers for pseudocounts) is associated with the base tuples of the regularizer's alphabet tuple changed so that the first number from the input is associated with base tuple bt_1, the second number with bt2, and so on. Thus, there should be as many base tuple arguments following the Order = command name as there are elements in the regularizer's alphabet tuple. If this command does not appear in a regularizer script, then the default order is the order in which the base tuples are ordered according to the method index() from class AlphabetTuple (see section AlphabetTuple and BaseTuple Classes). When this command is used, the regularizer must have had its alphabet tuple previously set already.

Regularizer command: Comment = comments_to_end_of_line
Declares a comment within a regularizer description. Comments start at the Comment = command name and extend to the end of the line.

MLPReg Class

Class MLPReg implements a maximum likelihood regularizer with a pseudocount prior distribution. In a MLPReg, for each base tuple of the alphabet tuple, there is a corresponding pseudocount. Given a sample of observed counts, the modified (posterior) count for a base tuple is computed by adding the observed count for the base tuple to the base tuple's corresponding pseudocount.

MLPReg methods

In addition to the methods inherited from Regularizer (see section Regularizer methods), class MLPReg supports the following methods:

Method on MLPReg: MLPReg (void)
Void constructor for class MLPReg.

Method on MLPReg: MLPReg (const Alphabet *a, istream &in, const char *name)
Constructor for class MLPReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing the alphabet pointed to by a, and sets the name to the string pointed to by name. Other information on the regularizer, such as the pseudocounts, are read from stream in.

Method on MLPReg: MLPReg (const Alphabet *a, const float *ps, const char *name)
Constructor for class MLPReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing the alphabet pointed to by a, sets the pseudocounts to the values in the array ps, and sets the name to the string pointed to by name. The default value of name is 0.

Method on MLPReg: MLPReg (const AlphabetTuple *at, const float *ps, const char *name)
Constructor for class MLPReg. Sets the alphabet tuple of the regularizer to a copy of the alphabet tuple pointed to by a, sets the pseudocounts to the values in the array ps, and sets the name to the string pointed to by name. The default value of name is 0.

Method on MLPReg: ~MLPReg (void)
Destructor for class MLPReg.

Method on MLPReg: const float * pseudocounts (void) const
Returns a pointer to an array of floats containing the pseudocounts of the regularizer's prior distribution.

Method on MLPReg: void set_pseudocounts (const float *ps)
Sets the pseudocounts of the regularizer to the values contained in the array ps. The element of array ps indexed by i should correspond to the base tuple indexed by i according to the alphabet tuple's index() method.

Method on MLPReg: void freeze_dist (void)
Freezes the pseudocount distribution of the regularizer. After invoking this method, the relative sizes of the pseudocounts will remain the same, and only the sum of the pseudocounts can change, until the method unfreeze_dist is invoked.

Method on MLPReg: void unfreeze_dist (void)
Unfreezes the pseudocounts distribution of the regularizer, so that the relative values of the pseudocounts can be modified.

MLPReg commands

Along with the commands supported by its parent class Regularizer (see section Regularizer commands), MLPReg has one additional command:

MLPReg command: Pseudocounts = pc_1 pc_2 ... pc_n
Sets the pseudocounts of the MLPReg to the numbers pc_1 ... pc_n, with one number per base tuple of the regularizer's alphabet tuple. The order in which pseudocounts are associated with base tuples can be changed with the Order = command (see section Regularizer commands).

MLZReg Class

Class MLZReg implements a maximum likelihood regularizer with a uniform prior distribution. The posterior counts are computed from observed counts by adding a positive number (the zero offset) to each of the observed counts for base tuples. The number added to the observed counts is the same for each base tuple. Thus, an MLZReg is just a like an MLPReg whose pseudocounts are all the same.

MLZReg methods

In addition to the methods inherited from Regularizer (see section Regularizer methods), class MLZReg supports the following methods:

Method on MLZReg: MLZReg (void)
Void constructor for class MLZReg.

Method on MLZReg: MLZReg (const Alphabet *a, istream &in, const char *name)
Constructor for class MLZReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing the alphabet pointed to by a, and sets the name to the string pointed to by name. Other information for the regularizer, such as the zero offset, is read from stream in.

Method on MLZReg: MLZReg (const Alphabet *a, float *zofs, const char *name)
Constructor for class MLZReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing the alphabet pointed to by a, sets the zero offset to the value zofs, and sets the name to the string pointed to by name. The default value of name is 0, and the default value of zofs is 0.0001.

Method on MLZReg: MLZReg (const AlphabetTuple *at, float *zofs, const char *name)
Constructor for class MLZReg. Sets the alphabet tuple of the regularizer to a copy of the alphabet tuple pointed to by a, sets the zero offset to the value ps, and sets the name to the string pointed to by name. The default value of name is 0, and the default value of zofs is 0.0001.

Method on MLZReg: void set_zero_offset (float *zofs)
Sets the zero offfset of the regularizer to zofs.

Method on MLZReg: float zero_offset (void) const
Returns the zero offset of the regularizer.

MLZReg commands

In addition to the commands from Regularizer, MLZReg has the following command:

MLZReg command: ZeroOffset = zero_offset
Sets the zero offset of the MLZReg to zero_offset.

DirichletReg Class

Class DirichletReg implements regularizers that use Dirichlet mixtures for prior distributions. As mixtures, class DirichletReg requires parameters for each of the component distributions as well as a mixture coefficient for each of the components. Since the components are Dirichlet distributions, the parameters required of an individual component are the pseudocounts, one for each base tuple of the regularizer's alphabet tuple.

DirichletReg methods

In addition to the methods inherited from Regularizer (see section Regularizer methods), class DirichletReg supports the following methods:

Method on DirichletReg: DirichletReg (void)
Void constructor for class DirichletReg.

Method on DirichletReg: DirichletReg (const Alphabet *a, istream &in, const char *name)
Constructor for class DirichletReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing the alphabet pointed to by a, and sets the name to the string pointed to by name. Other information on the regularizer, such as the alphas and the mixture coefficients, are read from stream in.

Method on DirichletReg: DirichletReg (const Alphabet *a, const char *name, int size)
Constructor for class DirichletReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing the alphabet pointed to by a, sets the name to the string pointed to by name, and sets the number of components of the Dirichlet mixture to size. The default value of name is 0, and the default value of size is 0.

Method on DirichletReg: DirichletReg (const DirichletReg & dreg)
Copy constructor for class DirichletReg.

Method on DirichletReg: DirichletReg (const MLPReg & mlpreg)
Conversion constructor for class DirichletReg from class MLPReg. The prior distibution of mlpreg is converted in the sole component of a one component Dirichlet mixture.

Method on DirichletReg: DirichletReg (const MLZReg & mlzreg)
Conversion constructor for class DirichletReg from class MLZReg. The uniform prior distibution of mlzreg is converted in the sole component of a one component Dirichlet mixture.

Method on DirichletReg: ~DirichletReg (void)
Destructor for class DirichletReg.

Method on DirichletReg: DirichletReg * posterior_mixture (const float *TrainCounts)
Returns a pointer to a new DirichletReg representing the correct posterior distribution using the current DirichletReg as the prior and TrainCounts as an observed sample.

Method on DirichletReg: void AddComponent (float MixCoeff, const float *comp)
Adds a new component to the Dirichlet mixture with a mixture coefficient of MixCoeff, and pseudocounts from the array comp. Comp should be a pointer to an array of floats with length equal to the size of the alphabet of the regularizer.

Method on DirichletReg: void print_ordered_component (ostream &out, int comp_num) const
Print the bases of the alphabet of the regularizer sorted by how much each letter is favored by component comp_num.

Method on DirichletReg: void get_moments (const float *TrainCounts, double *ex_prob, double *ex_prob2)
Returns the expected values of the probabilities in the array ex_prob on seeing TrainCounts. If ex_prob2 is non-null, return the expected values of the squares of the probabilities in ex_prob2. The default value of ex_prob2 is 0.

Method on DirichletReg: int num_components (void) const
Returns the number of components of the Dirichlet mixture.

Method on DirichletReg: void set_component (int comp_num, int lett, float z)
Sets the pseudocount for the base with index lett in the component comp_num to the value z.

Method on DirichletReg: void scale_component (int comp_num, float multiplier)
Rescales the component numbered comp_num by the factor multiplier.

Method on DirichletReg: void delete_component (int comp_num)
Removes the component numbered comp_num from the Dirichlet mixure. If comp_num is not the highest component number of the mixture, the highest numbered component before the deletion replaces the deleted component at position comp_num.

Method on DirichletReg: void set_mixture (int comp_num, float mix_coeff)
Sets the mixture coefficient for the component numbered comp_num to the value mix_coeff.

Method on DirichletReg: float mixture_coeff (int comp_num) const
Returns the mixture coefficient for the component numbered comp_num.

Method on DirichletReg: double sum_component (int comp_num) const
Returns the sum of the pseudocounts for the component numbered comp_num.

Method on DirichletReg: const float * component (int comp_num) const
Returns a pointer to an array of floats containing the pseudocounts for the component numbered comp_num.

Method on DirichletReg: const float component (int comp_num, int lett) const
Returns the pseudocount for the base tuple with index lett for the component numbered comp_num.

Method on DirichletReg: void freeze_components (void)
Freezes the pseudocounts for each component of the mixture. After invoking this method, the relative sizes of the pseudocounts within each component remain the same, and only the sum of the pseudocounts for each component can be changed, until the method unfreeze_dist is invoked.

Method on DirichletReg: void unfreeze_components (void)
Unfreezes the pseudocounts of the components, so that the relative sizes of the pseudocounts can be modified.

Method on DirichletReg: void freeze_mixture (void)
Freezes the mixture coefficients of the mixture. After invoking this method, the relative sizes of the mixture coefficients remain the same, and only the sum of the mixture coefficients can be changed, until the method unfreeze_mixture is invoked.

Method on DirichletReg: void unfreeze_mixture (void)
Unfreezes the mixture coefficients of the mixture, so that the relative sizes of the coefficients can be modified.

Method on DirichletReg: void component_probs (const float * TrainCounts, double & SumTrainCounts, double *comp_probs, double *log_sum)
Returns the probability of each component of the mixture, given the observed sample from the array TrainCounts, in the array pointed to by comp_probs. The array pointed to by TrainCounts should be an array the size of the regularizer's alphabet tuple, and the array pointed to by comp_probs should be of length equal to the number of components of the mixture. The sum of the counts in the array TrainCounts is also returned in the argument SumTrainCounts. If log_sum is non-null, then the natural logarithm of the sum over all components of the mixture coefficient times the probability of TrainCounts given each component is returned in the location pointed at by log_sum. The default value of log_sum is 0.

Method on DirichletReg: const double * component_probs (void) const
Returns the component probabilities cached by the last call to get_modified_counts, get_moments, or log_probability.

Method on DirichletReg: double log_probability (const float *TrainCounts, float *deriv1, float *deriv2)
Returns the natural logarithm of the probability of the counts in the array TrainCounts. If deriv1 is non-null, returns the values of the first partial derivatives of the regularizer parameters in the array pointed to by deriv1. If deriv2 is non-null, returns the values of the second partial derivatives of the regularizer parameters in the array pointed to by deriv2. The default value of both deriv1 and deriv2 is 0.

Method on DirichletReg: Prob Probability (const float *TrainCounts)
Returns the probability of the counts in the array TrainCounts.

Method on DirichletReg: Prob UnorderedProbability (const float *TrainCounts)
Returns the probability of the counts in the array TrainCounts, regarding the counts of the array TrainCounts as an unordered vector.

DirichletReg commands

In addition to the commands supported by Regularizer (see section Regularizer commands), DirichletReg has commands for specifying the parameters of a Dirichlet mixture.

The following commands specify parameters global to a DirichletReg:

DirichletReg command: AlphaChar = alphabet_size
Specifies the size of the alphabet tuple of the regularizer. If alphabet_size does not match the size of the regularizer's alphabet tuple, an error is signalled.

DirichletReg command: NumDistr = num_components
Sets the number of components that the Dirichlet mixture of the regularizer should have to num_components.

The following commands are for specifying a component of the DirichletReg.

DirichletReg command: Number = component_num
Sets the number of the current component of the Dirichlet mixture to read from input to num_components. All data from succeeding commands in the file will be for the component component_num until another Number = command is encountered. The numbering of components starts at 0. This command should be followed immediately by the corresponding Mixture = and Alpha = commands for the component numbered component_num. The Number = commands should occur in ascending order by component_num. If they do not, a warning message is printed.

DirichletReg command: Mixture = mixture_coeff
Sets the mixture coefficient of the current component (the component of the last Number = command encountered) to the value mixture_coeff.

DirichletReg command: Alpha = pc_sum pc_1 pc_2 ... pc_n
Sets the pseudocounts of the current component to the values pc_1 ... pc_n, with one number pc_i per base tuple of the regularizer's alphabet tuple. These pseudocounts are associated with base tuples of the regularizer's alphabet tuple according to the regularizer's input order, which can be changed with the command Order = (see section Regularizer commands). Note that the first argument pc_sum should be the sum all of the following pc_i.

DirichletReg command: FullUpdate = comment
DirichletReg command: QUpdate = comment
DirichletReg command: StructID = comment
These commands are currently treated as comments.

GribskovReg Class

Class GribskovReg implements a regularizer using Gribskov's average score method. The parameters of a GribskovReg are the elements of the square alphabet_size()*alphabet_size() score matrix along with the background probabilties of each of the base tuples of the alphabet tuple. The modified count of the base tuple indexed by i is computed by multiplying the background probability of base tuple i times the exponential of the inner product of row i of the score matrix and the vector of observed counts.

GribskovReg methods

In addition to the methods inherited from class Regularizer (see section Regularizer methods), class GribskovReg provides the following methods:

Method on GribskovReg: GribskovReg (void)
Void constructor for class GribskovReg.

Method on GribskovReg: GribskovReg (const Alphabet *a, istream &in, const char *name, double l_base)
Constructor for class GribskovReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing alphabet a, and sets the name of the regularizer to name. Other information for constructing the regularizer, such as the score matrix and background probabilities, are read from the stream in. The argument l_base is the natural logarithm of the base in which numbers from the input from stream in are to be interpreted.

Method on GribskovReg: GribskovReg (const Alphabet *a, const char *name)
Constructor for class GribskovReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing alphabet a, and sets the name of the regularizer to name. The default value of name is 0.

Method on GribskovReg: GribskovReg (const AlphabetTuple *at, const char *name)
Constructor for class GribskovReg. Sets the alphabet tuple of the regularizer to a copy of the tuple at, and sets the name of the regularizer to name. The default value of name is 0.

Method on GribskovReg: ~GribskovReg (void)
Destructor for class GribskovReg.

Method on GribskovReg: double log_base (void) const
Returns the natural logarithm of the base that the input and output of the regularizer's score matrix are expressed in.

Method on GribskovReg: void set_log_base (double l_base)
Sets the natural logarithm of the base that the input and output of the regularizer's score matrix are expressed in.

Method on GribskovReg: float & element (int i, int j)
Returns a reference to the element at row i and column j in the regularizer's score matrix.

Method on GribskovReg: float element (int i, int j) const
Returns value of the element at row i and column j in the regularizer's score matrix.

Method on GribskovReg: float & background (int i)
Returns a reference to the background probability for the base tuple with index i.

Method on GribskovReg: float background (int i) const
Returns the value of the background probability for the base tuple with index i.

GribskovReg commands

In addition to the Regularizer commands (see section Regularizer commands), GribskovReg has the following commands:

GribskovReg command: Background = p_1 p_2 ... p_n
Sets the background probabilities of the base tuples for the regularizer to p_1 ... p_n, where n is the size of the regularizer's alphabet tuple. The order in which the probabilities are associated with base tuples can be changed using the Regularizer command Order = (see section Regularizer commands).

GribskovReg command: Scores = s_11 s_12 ... s_nn
Sets the scores of the score matrix for the regularizer to s_11 ... s_nn, where n is the size of the regularizer's alphabet tuple. The n^2 scores should be in row major order. The order in which the scores are associated with base tuples can be changed using the Regularizer command Order = (see section Regularizer commands).

GribskovReg command: LogBase = l_base
Sets the natural logarithm of the base of the logarithm in which the score matrix elements are expressed for input and output.

GribskovReg command: Base = base
Sets the base of the logarithm in which the score matrix elements are expressed for input and output.

SubstPseudoReg Class

The class SubstPseudoReg implements substitution matrixes as regularizers. In addition to the basic substitution matrix, there are options for adding pseudocounts and scaled counts Addition of pseudocounts can improve performance when the sample size is 0, while addition of scaled counts can help when the sample size is very large.

The parameters for a SubstPseudoReg include the elements of the substitution matrix, which has alphabet_size()*alphabet_size() many entries. The entry at row i and column j of the matrix should be the probability of base tuple i given a sample of size 1 containing base tuple j.

If the option to add pseudocounts is used, then the parameters also include one pseudo count for each base tuple of the alphabet tuple.

When neither the pseudocount nor the scaled counts options are used, the modified count of the base tuple with index i is computed as the inner product of row i of the matrix with the vector of observed counts. If the pseudocounts option is used, then the pseudocount corresponding to base tuple i is add to the above inner product to obtain the modified counts. When the scaled counts option is used, the observed count from the sample for base tuple i is scaled (multiplied) by the total size of the observed sample and added to the above inner product to get the modified counts.

SubstPseudoReg methods

In addition to the methods inherited from class Regularizer (see section Regularizer methods), class SubstPseudoReg supports the following methods:

Method on SubstPseudoReg: SubstPseudoReg (void)
Void constructor for class SubstPseudoReg.

Method on SubstPseudoReg: SubstPseudoReg (const Alphabet *a, istream &in, const char *name)
Constructor for class SubstPseudoReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing alphabet a, and sets the name of the regularizer to name. Other information for constructing the regularizer, such as the substitution matrix, are read from the stream in.

Method on SubstPseudoReg: SubstPseudoReg (const Alphabet *a, const char *name)
Constructor for class SubstPseudoReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing alphabet a, and sets the name of the regularizer to name. The default value of name is 0.

Method on SubstPseudoReg: SubstPseudoReg (const AlphabetTuple *at, const char *name)
Constructor for class SubstPseudoReg. Sets the alphabet tuple of the regularizer to a copy of the tuple at, and sets the name of the regularizer to name. The default value of name is 0.

Method on SubstPseudoReg: ~SubstPseudoReg (void)
Destructor for class SubstPseudoReg.

Method on SubstPseudoReg: int num_columns (void) const
Returns the number of columns of the substitution matrix for the regularizer. If the regularizer uses pseudocounts, then the number of columns is one greater than the size of the alphabet tuple. Otherwise, it is equal to the size of the alphabet tuple.

Method on SubstPseudoReg: int use_scaled_counts (void) const
Returns a nonzero integer if the regularizer is set to use scaled counts. Returns 0 otherwise.

Method on SubstPseudoReg: void use_scaled_counts (int i)
Turns on the use of scaled counts if i is non-zero. Turns off the use of scaled counts if i is 0.

Method on SubstPseudoReg: int use_pseudocounts (void) const
Returns a nonzero integer if the regularizer is set to use pseudocounts. Returns 0 otherwise.

Method on SubstPseudoReg: void use_pseusocounts (int i)
Turns on the use of pseudocounts if i is non-zero. Turns off the use of scaled counts if i is 0.

Method on SubstPseudoReg: void freeze_columns (void)
Freezes the columns of the substitution matrix so that optimization of the regularizer's parameters will only adjust sums of the columns of the matrix rather than the entire matrix.

Method on SubstPseudoReg: void unfreeze_columns (void)
Turns off freezing of columns on optimization of parameters.

Method on SubstPseudoReg: void freeze_pseudocounts (void)
Freezes the pseudocounts so that the relative sizes of the pseudocounts cannot be changed.

Method on SubstPseudoReg: void unfreeze_pseudocounts (void)
Turns off freezing of pseudocounts.

Method on SubstPseudoReg: float element (int row, int col) const
Returns the element at row row and column col of the substitution matrix.

Method on SubstPseudoReg: void set_element (int row, int col, float val)
Sets the element at row row and column col of the substitution matrix to the value val.

Method on SubstPseudoReg: float min_element (int row, int col) const
Returns the minimum allowable value for the element at row row and column col of the substitution matrix

Method on SubstPseudoReg: float sum_col (int col) const
Returns the sum across all rows of column col of the substitution matrix

SubstPseudoReg commands

In addition to the commands supported by Regularizer (see section Regularizer commands), SubstPseudoReg has the following commands:

SubstPseudoReg command: Order = bt_1 bt_2 ... bt_n option_word
Sets the input order of the regularizer as with the command of the same name from class Regularizer (see section Regularizer commands). In addition an option word should follow the base tuples arguments. The argument option_word should be one of the following:

  1. subst_only. Use neither of the pseudocounts nor the scaled counts option.
  2. pseudocounts. Use the pseudocounts option.
  3. scaled_counts. Use the scaled counts option.
  4. pseudocounts+scaled_counts. Use both the pseudocounts and scaled counts option.

SubstPseudoReg command: Subst = sm_11 sm_22 ... sm_nn
SubstPseudoReg command: Subst = sm_11 sm_22 ... sm_1n pc_1 ... sm_n1 sm_n2 ... sm_nn pc_n
Sets the elements of the substitution matrix to sm_11 ... sm_nn. If the regularizer uses pseudocounts, the second form of the command should be used. The substitution matrix elements should be in row major order. In the second form, the pseudocount for each row of the matrix follows the last element of the row.

FeatureReg Class

Class FeatureReg uses feature partitions as a regularizer. In the feature partitioning method, the alphabet of the regularizer is divided up into disjoint sets. Such a partitioning is referred to as a feature alphabet. The feature alphabet is then given a zero offset distribution. In a regularizer, many feature alphabets may be used to obtain the the posterior counts. The posterior counts for a base tuple are computed by multiplying the posterior counts of the base tuple from each of the separate feature alphabets. The posterior counts for a base tuple from each feature alphabet is in turn computed by adding the zero offset of the the feature alphabet to the sum of the observed counts of all the base tuples that are in the same set of the partition as the target base tuple whose posterior count we want to compute.

FeatureReg methods

In addition to the methods inherited from Regularizer (see section Regularizer methods), class FeatureReg supports the following methods:

Method on FeatureReg: FeatureReg (void)
Void constructor for class FeatureReg.

Method on FeatureReg: FeatureReg (const Alphabet *a, istream &in, const char *name)
Constructor for class FeatureReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing alphabet a, and sets the name of the regularizer to name. Other information for constructing the regularizer, such as the feature partitions, are read from the stream in.

Method on FeatureReg: FeatureReg (const Alphabet *a, const char *name)
Constructor for class FeatureReg. Sets the alphabet tuple of the regularizer to the singleton tuple containing alphabet a, and sets the name of the regularizer to name. The default value of name is 0.

Method on FeatureReg: FeatureReg (const AlphabetTuple *at, const char *name)
Constructor for class FeatureReg. Sets the alphabet tuple of the regularizer to a copy of the tuple at, and sets the name of the regularizer to name. The default value of name is 0.

Method on FeatureReg: ~FeatureReg (void)
Destructor for class FeatureReg.

Method on FeatureReg: int num_alphs (void) const
Returns the number of feature alphabets used by the regularizer.

Method on FeatureReg: void set_zero_offset (int i, int z)
Sets the zero offset for the feature alphabet with index i to the value z.

Method on FeatureReg: float zero_offset (int i) const
Returns the value of the zero offset for the feature alphabet with index i.

Method on FeatureReg: const FeaturePartition * partition (int i) const
Returns a pointer of type FeaturePartition * that points to the feature partition with the index i.

Method on FeatureReg: void add_partition (FeaturePartition * fp, float z)
Adds partition fp to the regularizer with a zero offset of z. The default value of z is 1.

Method on FeatureReg: FeaturePartition * pop_partition (int delete_this)
If delete_this is non-negative, then delete the feature partition with index delete_this from the set of feature partitions used by the regularizer. If delete_this is negative, then delete the feature partition with index delete_this + num_alphs(). Thus, the value of delete_this must be in the range -num_alphs() to num_alphs() - 1. The default value of delete_this is -1, so that the partition with the highest index is deleted by default. If the partition deleted was not of the highest index, then the partition of the highest index takes on the index of the deleted partition. The value returned is a pointer to the deleted partition.

Method on FeatureReg: void add_best_partition (const float * Summary, int min_features, int max_features)
Try to find a feature partition to add to the regularizer that will improve its performance on predicting probabilities from samples of size one. The number of features in the new partition will be at least min_features. If max_features is not 0, then number of features in the new partition will be at most max_features. The default value of min_features is 2, and the default value of max_features is 0. Argument Summary should be a pointer to an array of float with Summary[i*alphabet_size() + j] being the frequency of character i, having seen a sample containing character j.

FeatureReg commands

In addition to the commands available from class Regularizer (see section Regularizer commands), FeatureReg supports the following command:

FeatureReg command: Parition = ( ds_1 , ds_2, ..., ds_n ) zero_offset
Adds a partition to the regularizer with a zero offset of zero_offset. Each disjoint set ds_i should be list of base tuples separated by `+' signs. The sets ds_i should be separated by commas and enclosed in parentheses. Following the close parenthesis of the partition should be the partition's zero offset. An example of a Parition = command is

Partition = ( D + E, F + R + H, N + Q, S + T, I + L + V,
              F + W + Y, C, M, A + G, P )  0.764163

In this example partition of amino acids, there are ten disjoint sets in the partition, with a zero offset of 0.7641663. For each FeatureReg, there must be as many Partition = commands in its specification as the regularizer needs.

FeaturePartition class

The class FeaturePartition supports the implementation of the class FeatureReg by providing a representation of a feature partition. It has the following methods:

Method on FeaturePartition: FeaturePartition (const AlphabetTuple * at)
Constructor for class FeaturePartition. The argument at is the alphabet tuple that is partitioned in to disjoint sets by the FeaturePartition.

Method on FeaturePartition: FeaturePartition (const FeaturePartition & partition)
Copy constructor for class FeaturePartition.

Method on FeaturePartition: ~FeaturePartition (void)
Destructor for class FeaturePartition.

Method on FeaturePartition: int OK (void) const
Returns 1 if the all base tuples of the partition's alphabet tuple have been assigned to a disjoint set of the partition. Returns 0 if there are still some base tuples that have not been assigned to a disjoint set.

Method on FeaturePartition: const AlphabetTuple * alphabet_tuple (void) const
Returns the alphabet tuple of the partition.

Method on FeaturePartition: int which_feature (int i) const
Returns the number of the disjoint set that the base tuple with index i is contained in.

Method on FeaturePartition: int & which_feature (const BaseTuple bt)
Returns an integer reference to the number of the disjoint set that base tuple bt is contained in.

Method on FeaturePartition: void set_feature (int letter, int which)
Sets the disjoint set that letter is contained in to which.

Method on FeaturePartition: int num_features (void) const
Returns the number of features in the partition.

Method on FeaturePartition: void ReduceCounts (const float * counts, float zero_offset, float * reduced) const
Returns in the array pointed to by reduced the counts for a reduced partition. The length of the array pointed to by reduced should be equal to the number of features of the partition, while the array pointed to by counts should be of length equal to the alphabet size of the partition. The zero offset added to the reduced counts is zero_offset.

Method on FeaturePartition: void print (ostream & out)
Prints out the partition on stream out. The format output is as described for the FeatureReg command Partition = (see section FeatureReg commands).

Method on FeaturePartition: void read (ostream & in)
Reads in a partition from stream in. The input format is as described for the FeatureReg command Partition = (see section FeatureReg commands).

The following function can be used to optimize a FeatureReg:

Function: FeaturePartition * best_feature_partition (Regularizer *d, const float *summary, int min_features, int max_features, float &best_z)
Try to find a feature partition to add to the regularizer pointed to by d that will improve its performance on predicting probabilities from samples of size one. The number of features of the partition returned will be at least min_features and at most max_features. Summary is a pointer to an array with Summary[i*alphabet_size() * j] being the frequency of character i on seeing a sample containing character j. The best zero offset for the partition is returned in best_z.

Hash Table Functions

There are three hash table classes: SimpleHashClass, DictionaryClass, UserDefinitions. All are based on a safer derivative of the GNU string class. See section The StringListClass class, and section `The String Class' in Libg++ User's Guide. The DictionaryClass adds file reading to the basic hash table class, while UserDefinitions is a hash table for global initialization data read from an initialization file.

The SimpleHashClass class

A SimpleHashClass class implements a simple hash table using GNU string classes. The DictionaryClass is built on top of this class.

The data it holds are pairs of GNU String Classes, one part being the name to lookup with and the other being the value.

Constructors

Methods

The functions supported are to lookup on a name to see if it exists, and return its value if it does. Inserting a new name/value pair. Obtaining the number of stored items. Clearing all the contents from the table.

Method on SimpleHashClass: String lookup (String key)
Checks to see if key is a stored name. If so, the corresponding value is returned (value may be NULL). If there is no entry, the GNU NULL String value is returned.

Method on SimpleHashClass: long insert (String key, String value)
Inserts the key/value pair into the table. Overwrites the previous value. The return value is unused at present is is always 0.

Method on SimpleHashClass: long listSize (void)
Returns the number of stored items.

Method on SimpleHashClass: void reset (void)
Clears the table. Actually deletes the current internal data, and creates new empty state. This function does not free the Gnu Strings that are stored, so use of this is a potential memory leak.

Printing

You can dump the values to a stream. The format is name blank value newline.

Method on SimpleHashClass: void print (ostream& file)
Prints the table contents to the given file. One name/value pair per line. The name is separated from the value by a single space.

The DictionaryClass class

A DictionaryClass class implements a hash table with a file reader. The functions are the same as SimpleHashClass along with functions to read data from a stream or the environment.

The data it holds are pairs of GNU String Classes, one part being the name to lookup with and the other being the value. It functions like a hash table, you lookup using the name, and the return is the value. See section The SimpleHashClass class.

Constructors

Methods

The only functions supported are to lookup on a name to see if it exists, and return its value if it does. Inserting a new name/value pair, and obtaining the number of stored items.

Method on DictionaryClass: String lookup (String key)
Checks to see if key is a stored name. If so, the corresponding value is returned (value may be NULL). If there is no entry, the GNU NULL String value is returned.

Method on DictionaryClass: long insert (String key, String value)
Inserts the key/value pair into the table. Overwrites the previous value. The return value is unused at present is is always 0.

Method on DictionaryClass: long listSize (void)
Returns the number of stored items.

Printing

You can dump the values to a stream. The format is name blank value newline.

Method on DictionaryClass: void print (ostream& file)
Prints the table contents to the given file. One name/value pair per line. This is the same format that it reads in.

The UserDefinitions class

The UserDefinitions class holds the values that the user of the program set before program execution. It is intended that this be a unified way for programs to access standard startup information. It is intended that there only be one of these classes created, and it should be a global that is visible to everyone. It should be a static global, so it is created at startup time.

This class is a restricted version of the DictionaryClass. The data it holds are pairs of GNU String Classes, one part being the name to lookup with and the other being the value. It functions like a hash table, you lookup using the name, and the return is the value. See section The DictionaryClass class.

What is loaded into this class is the environment as it existed at program start, and the contents of a startup file.

Being a static object, it is not permitted to delete it. In fact the only valid things to do with this class are to lookup values, insert values, find out how many values are stored, and print the contents to an ostream.

Constructors

The UserDefinition class should initialized once in the main function. If you require the functionality, use a DictionaryClass.

Methods

The only functions supported are to lookup on a name to see if it exists, and return its value if it does. Inserting a new name/value pair, and obtaining the number of stored items.

Method on UserDefinitions: String lookup (String key)
Checks to see if key is a stored name. If so, the corresponding value is returned (value may be NULL). If there is no entry, the GNU NULL String value is returned.

Method on UserDefinitions: long insert (String key, String value)
Inserts the key/value pair into the table. Overwrites the previous value. The return value is unused at present is is always 0.

Method on UserDefinitions: long listSize (void)
Returns the number of stored items.

Printing

You can dump the values to a stream. The format is name blank value newline.

Method on UserDefinitions: void print (ostream& file)
Prints the table contents to file. One name/value pair per line. This is the same format that it reads in.

The StringListClass class

A StringListClass class is a simple wrapper for an array of GNU string classes. The emphasis is on runtime safety, not speed. It does checking of the indexes you give it. It dynamically grows if it needs to when you add a new string. It is 1 based, meaning that the index of the first item is 1, not 0. The SimpleHashClass is built on top of this class.

Constructors

Methods

The functions supported are: get an item at a given index; get the number of items in the listl; add a new item at the end or at a given index; remove an item from a given index or search to find the given item, and then remove it; and search the list for a given item and return the index. The value used for NULL data and returns, called CURRENTNULL, can be set.

Method on StringListClass: String operator [] (long index)
Method on StringListClass: String getItem (long index)
Returns the value at index. If index is out of range, then the value of the CURRENTNULL is returned. The default CURRENTNULL is the GNU NULL.

Method on StringListClass: void putAppend (String item)
Puts item at the end of the list. The end is always the value of listSize() + 1.

Method on StringListClass: long listSize (void)
Returns the number of stored items.

Method on StringListClass: void putItemAtIndex (long index, String item)
Puts item into the list at index. Overwrites the previous value. Does not delete the previous value if one existed.

Method on StringListClass: void removeItemAtIndex (long index)
Removes the item at the given index. What it does is to take the last item in the list and move it to the index of the item that is being removed. This operation does not preserve any sorted order. The list size is reduced by 1. Does not delete the previous value if one existed.

Method on StringListClass: void removeItem (String item)
Performs a search for item, and then removes it as above. Combination of indexOfItem, and removeItemAtIndex.

Method on StringListClass: long indexOfItem (String item)
Performs a search for item and returns its index. If item is not found, the return is 0 (the first element is 1). Search is linear starting at 1, and returns immediately upon finding a match.

Method on StringListClass: void setCurrentNull (String item=GNU NULL)
Change the current null item for this list. The default null item is the GNU NULL. This is the value that will be returned from getItem when indexes are out of range, and the value that will be put into the list to overwrite a removed item.

Method on StringListClass: String getCurrentNull (void)
Get the current null item for this list.

MasPar Support

The Baskin Center's MasPar MP-2204 has 4096 32-bit SIMD processing elements, each with 64 Kbytes of local memory, a mesh interconnection network, a global router, and 128 Mbytes of global memory connected to the router that can be used for parallel independent file access.

Documentation of the MasPar ganesha can be found in `/usr/maspar/doc', and an excellent tutorial on the DECmpp, another name for the MasPar, can be found in `~rph/220/mppdoc'.

Linear Array Support

Many biosequence projects fit well on a linear array of processing elements. Unfortunately, the MasPar x-net is not perfectly suited for providing a chain of processing elements rather than a square mesh. The following routines, provided in `mp_linear.m' and `mp_linear.h' in the `ultimate/include' and `ultimate/maspar' directories, provide the necessary functionality.

Linear Shift Operations

The following routines shift data in all processing elements: the active set is not obeyed. Also, if processing elements are grouped for more memory or computation power, more efficient variants on these routines using xnetpipe routines could speed operations by a factor proportional to the group size.

MasPar Macro: xnetE (BasicPluralVariable pvar)
MasPar Macro: xnetW (BasicPluralVariable pvar)
These two macros take the values of a basic plural variable (anything that can be used with the MPL xnet primitive) and shift them in all processing elements one element to the east or one element to the west, respectively. The shift treats the processor array as linear, meaning that in the former case, the value in processing element iproc is shifted to processing element iproc+1, which may be on a different row. The final element in the array, processing element nproc-1 is shifted to processing element 0. For shifting quantities larger than 64 bits (long long), see below.

MasPar Function: xnetShift (int dist, plural char *src, plural char *dest, int bytes)
Shift the plural block of memory src that is bytes long east or west according to a linear ordering of the MasPar processing elements. If dist is postitive, values are shifted to the SE, and if dist is negative to the NW. Values are moved from (iproc) to (iproc+dist) with wraparound. The result is stored in the plural block of memory starting pointed to by dest, may be the same as src. The routine is based on ss_xfetch, and could easily be modified for plural pointers to plural data. The source and the desination may be the same, and if dist is zero a plural bcopy is performed.

Linear Host to PE Array Functions

MasPar Function: BlockInLine (char *from, plural char *to, int start, int npe, int size)
This procedure transfers data from the host memory (not the ACU) to the PE array using a linear ordering of the processing elements. It is based on the MasPar routine BlockIn, which perform a similar function for rectangular subarrays of processing elements using the host to array DMA channel. Data is copied from the block of memory starting at from of length (size*npe). The first size bytes are copied to the length size block of memory starting at to in processor number start, the second block size bytes is copied to processor (start+1) starting at to, and so on until the final block of memory is copied to processing element (start+npe-1). The active set is ignored. If data is originating in a file, it may be faster to use parallel file access and the IORAM.

MasPar Function: BlockOutLine (plural char *from, char *to, int start, int npe, int size)
This procedure transfers data to the host memory (not the ACU) from the PE array using a linear ordering of the processing elements. It is based on the MasPar routine BlockOut, which perform a similar function for rectangular subarrays of processing elements using the host to array DMA channel. Data is copied to the block of memory starting at to of length (size*npe). The first size bytes are copied from the length size block of memory starting at from in processor number start, the second block of size bytes is copied from processor (start+1) starting at from, and so on until the final block of memory is copied to processing element (start+npe-1). The active set is ignored. If data is to be sent to a file, it may be faster to use parallel file access and the IORAM.

Probabilities in MPL

The MPL compiler does not support parallel C++, and is thus unable to make use of the Ultimate Probability class (see section The Prob, ShortProb, LargeReal, ShortLargeReal classes). The following routines in `mp_Prob.h' and `mp_Prob.m' in the `include' and `maspar' directories support probabilities with a hidden representation as log-probabilities stored in 32-bit integers. Addition is performed using table lookup on a plural table of 7600 short integers (using 1.5 kBytes of local PE memory on each PE). Future versions may optionally implement this table in IORAM to save space in the processor memory. This, of course, would be significantly slower.

The routines are all macros or inline function definitions, and include singular (ACU) and plural (DPU) variants. The plural versions are all defined as macros. The most involved, the group for adding probabilities using the lookup table, require temporary `register' arguments to be used for intermediate values. Make sure that these arguments really are registers, not memory locations, or the routines will grind to a halt.

The internal format of these probabilities may not be compatible with those of the Prob class (see section The Prob, ShortProb, LargeReal, ShortLargeReal classes). Code that links these routines to G++ (see section Linking G++ with MPL) will have to convert between formats. The simplest way to do this is to use the appropriate function to convert the probability to a double log-probability, and then create the new probability using that value section Member Functions.

The smallest representable probability, Prob_val (Prob_zero()), is exp(-20), or approximately 2.06E-9.

A call must be made to the Prob_init() function before using singular or plural probabilities to compute the table used for adding probabilities.

Note that MPL (and C in general) regards a typedef as an alias for a type, not a new type. Therefor, unlike the C++ probability class (see section The Prob, ShortProb, LargeReal, ShortLargeReal classes), statements like prob + 0.5 will not generate an error, and will certainly not be the same as prob + Prob_make (0.5).

Singular Probability Functions

MPL Type: Prob
The basic type of a singular probability within MPL.

MPL Function: Prob Prob_zero (void)
Returns a Prob of value 0.

MPL Function: Prob Prob_strong_zero (void)
Returns a Prob with a very strong value of zero. In log-prob terms, this fucntion returns a value much larger than the log-prob of the smallest representably number. Used in the protein HMM code for initialization of boundary conditions and such.

MPL Function: Prob Prob_one (void)
Returns a Prob of value unity.

MPL Function: Prob Prob_make (float prob_val)
Return the Prob corresponding to the 32-bit floating-point number prob_val.

MPL Function: Prob Prob_from_ln (float log_val)
Return the Prob corresponding to the 32-bit floating-point log-probability number log_val. This is based on the natural logarithm. Natural logarithms are generally more efficient as they are the primary source of logarithms (that is, logs in other bases, such as 2, are computed from the natural logarithm).

MPL Function: float Prob_val (Prob prob)
Return prob as a 32-bit floating-point number. This function call hides an exponentiation.

MPL Function: float Prob_ln (Prob prob)
Return the natural logarithm of probability prob.

MPL Function: int Prob_is_zero (Prob prob)
Return 1 if prob is zero according to the granularity of the internal representation.

MPL Function: Prob Prob_mult (Prob p1, Prob p2)
Return the product of two probabilities.

MPL Function: Prob Prob_add (Prob prob1,Prob prob2)
Return the sum of prob1 and prob2.

Plural Probability Functions

The plural functions are quite similar to the above singular functions, with the exception that, to aid hand optimization of MPL code (perhaps not as needed as originally, now that the -Omax compiler flag is available), many variants of the Prob_add function (see section Singular Probability Functions) are provided.

Plural versions of the constant functions are not provided. This type conversion (a data broadcast) can be performed by the MPL compiler automatically.

MPL Type: p_Prob
The basic type of a plural probability within MPL. This is current just a plural Prob, but is defined separately in case future changes are required.

MPL Function: p_Prob p_Prob_make (plural float prob_val)
Return the p_Prob corresponding to the 32-bit plural floating-point number prob_val.

MPL Function: p_Prob p_Prob_from_ln (plural float log_val)
Return the p_Prob corresponding to the 32-bit plural floating-point log-probability number log_val. This is based on the natural logarithm.

MPL Function: plural float p_Prob_val (p_Prob prob)
Return prob as a 32-bit plural floating-point number. This function call hides an exponentiation.

MPL Function: plural float p_Prob_ln (p_Prob prob)
Return the natural logarithm of plural probability prob.

MPL Function: plural int p_Prob_is_zero (p_Prob prob)
Return 1 if prob is zero according to the granularity of the internal representation.

MPL Function: p_Prob p_Prob_mult (p_Prob p1, p_Prob p2)
Return the product of two plural probabilities.

MPL Function: p_Prob p_Prob_add (p_Prob prob1,p_Prob prob2)
Return the sum of prob1 and prob2.

The following functions are defined as macros rather than inline functions, and all require temporary registers as arguments.

MPL Macro: p_Prob p_Prob_add3 (p_Prob p1, p_Prob p2, p_Prob p3, p_Prob tmp1, p_Prob tmp2, p_Prob tmp3)
Return the sum of probabilities p1, p2, and p3, given three temporary p_Prob registers. The three probabilities, assumed to be in memory, are copied into registers before computing on them.
MPL Macro: p_Prob p_Prob_add2 (p_Prob p1, p_Prob p2, p_Prob tmp1, p_Prob tmp2)
Return the sum of probabilities p1 and p2, given two temporary p_Prob registers. The first probability should be in memory, the second in a register.
MPL Macro: p_Prob p_Prob_add3X (p_Prob p1, p_Prob p2, p_Prob p3, p_Prob tmp1)
Return the sum of three probabilities, all contained in registers. This macro will destroy the value in p1.
MPL Macro: p_Prob p_Prob_increase (p_Prob p1, p_Prob p2, p_Prob tmp1, p_Prob tmp2, p_Prob tmp3)
Computes p1 += p2, with the probabilities initially residing in memory. The three temproary registers must be p_Prob registers.
MPL Macro: p_Prob p_Prob_increase_reg (p_Prob p1, p_Prob p2, p_Prob tmp1, p_Prob tmp2, p_Prob tmp3)

As above, except that the accumulator p1 is assumed to be a register.

Linking G++ with MPL

As of this writing, G++ (see section `Top' in Gcc User's Guide) is not supported on the Maspar.

However, it is possible to link G++ object code with MPL object code to create a program where the G++ part executes only on the front end DECstation, and the MPL runs normally. The key to this is the G++ compiler option -fno-gnu-linker (see section `Code Gen Options' in Gcc User's Guide).

To create a program that runs on the Maspar, the final linker has to be mpld. So the idea is to compile all the G++ first, link it all into one object file (using ld -r) and then use mpld as the final link step to merge the MPL and final G++ object file into a program.

This works except for static class objects. There has to be special initialization code for these that occurs before main is called, and the destructors for them have to be called upon exit. Constructors are pretty easy, but the destructors are not.

The GNU G++ compiler can be instructed to output the code that will call the static constructor and destructor code. This makes it possible to use a different linker (in this case, mpld) but still have the static object constructor and destructor code be operational. I believe that main must be in your C++ code for this all to work.

The key step here is to create the code that knows what the static objects are so they can be built. The program that does this is called findconstructors. It is a pretty simple operation, the standard GNU program collect2 does exactly the same thing (I got the routines from there, but it tries pretend to be the whole linker, and I decided to do this explicitly).

This program uses the output of nm to find the static object constructors and destructors, and then outputs a C source file that has a table of these functions. Upon linking, the names get turned into addresses to functions, and the startup (and exit) code will use this table to call the functions.

# compile all your G++ code
g++ -fno-gnu-linker -c (*.C)
# prelink all the G++ .o files, MUST USE -r option
g++ -fno-gnu-linker -r -o allC++.o  (*.o)
# now find the static objects
nm allC++.o | findconstructors > Constructors.c
# compile it as normal C, don't use G++!!
gcc -c Constructors.c
# now link it into the rest of the C++ code
gcc -r -o totalC++.o allC++.o Constructors.o
# now the C++ is done, link it in with the rest of your
# MPL code
mpld -o program  (MPL code .o) totalC++.o

Projects and other things left to do

Coming Attractions

Wish List

Some things that people have mentioned that they would like to see in the Ultimate Parser library, but for which there have not been any offers:

Index

&

  • & on SeqList:
  • & on Sequence:, & on Sequence:
  • *

  • *unindex on AlphabetTuple:
  • a

  • AA
  • abbrev on Alphabet:
  • add_best_partition on FeatureReg:
  • add_class on ClassNameRegistry:
  • add_instance on Hist
  • add_is_a on IdObject:
  • add_partition on FeatureReg:
  • AddComponent on DirichletReg:
  • AddName on NameToPtr:
  • Alph
  • Alpha =
  • Alphabet
  • Alphabet =
  • Alphabet Creation
  • alphabet on Alph:
  • Alphabet on Alphabet
  • alphabet on Sequence:
  • alphabet_size on Regularizer:
  • alphabet_tuple on FeaturePartition:
  • alphabet_tuple on Regularizer:
  • AlphabetPair =
  • AlphabetTriple =
  • AlphabetTuple
  • AlphabetTuple =
  • AlphabetTuple on AlphabetTuple:, AlphabetTuple on AlphabetTuple:, AlphabetTuple on AlphabetTuple:, AlphabetTuple on AlphabetTuple:, AlphabetTuple on AlphabetTuple:
  • AlphaChar =
  • AminoAlphabet
  • apply_all on IdObject:
  • ApplyAll on NameToPtr:, ApplyAll on NameToPtr:
  • approxEqual, approxEqual, approxEqual, approxEqual, approxEqual, approxEqual
  • ASN Sequences, ASN Sequences
  • AsnStream
  • b

  • Background =
  • background on GribskovReg:, background on GribskovReg:
  • bad on BaseStream:
  • bad_char on Base:
  • badbit
  • badtype on BaseStream:
  • badtypebit
  • Base
  • Base =
  • base_int
  • BaseStream
  • BaseTuple
  • BaseTuple on BaseTuple:
  • best_feature_partition
  • bigger
  • Biosequence
  • BlockInLine
  • BlockOutLine
  • BuildIndex on AsnStream:
  • c

  • canon on Base:
  • classID on Hist:
  • classID on NamedClass:
  • ClassName =
  • ClassNameRegistry
  • clear on BaseStream
  • clear on SeqList:
  • clip_log2
  • Command
  • Command on Command, Command on Command
  • command on Command:
  • command_table on Command:
  • Comment =
  • complement on NucleicAlphabet:
  • component on DirichletReg:, component on DirichletReg:
  • component_probs on DirichletReg:, component_probs on DirichletReg:
  • Configuration
  • Configuration, NCBI
  • copy on Regularizer:
  • copy on Sequence:
  • Counting, Histogram
  • counts on Hist:
  • countVec on Hist:
  • create on ClassNameRegistry:
  • create on IdObject:
  • Creation of Alphabets
  • d

  • Data Format, ASN
  • Data Format, NCBI
  • data on Sequence:, data on Sequence:
  • delete_component on DirichletReg:
  • DeleteName on NameToPtr:
  • describe on Alph
  • DictionaryClass
  • DirichletReg on DirichletReg:, DirichletReg on DirichletReg:, DirichletReg on DirichletReg:, DirichletReg on DirichletReg:, DirichletReg on DirichletReg:, DirichletReg on DirichletReg:
  • DNA
  • DNAAlphabet
  • dumpOn on Hist:
  • e

  • elem on Sequence:, elem on Sequence:
  • element on GribskovReg:, element on GribskovReg:
  • element on SubstPseudoReg:
  • encodingCostForColumnCounts on Regularizer:
  • EndClassName =
  • Entrez Sequences
  • EntrezStream
  • entropy on Prob:
  • EqualStrings
  • execute on Command:
  • ExtAA
  • ExtAminoAlphabet
  • ExtDNA
  • ExtDNAAlphabet
  • ExtractDescr on BaseStream:
  • ExtractTitle on BaseStream:
  • f

  • FeaturePartition on FeaturePartition:, FeaturePartition on FeaturePartition:
  • FeatureReg on FeatureReg:, FeatureReg on FeatureReg:, FeatureReg on FeatureReg:, FeatureReg on FeatureReg:
  • FindFilePos on AsnStream:
  • FindOldName on NameToPtr:
  • FindUIDSet on EntrezStream:
  • first_char on Alphabet:
  • first_var on Alphabet:
  • first_wc on Alphabet:
  • freeze_columns on SubstPseudoReg:
  • freeze_components on DirichletReg:
  • freeze_dist on MLPReg:
  • freeze_mixture on DirichletReg:
  • freeze_pseudocounts on SubstPseudoReg:
  • full_name on Alphabet:
  • FullUpdate =
  • g

  • G++, MPL
  • Ganesha, MasPar MP-2204
  • get_modified_counts on Regularizer:
  • get_moments on DirichletReg:
  • get_probs on Regularizer:
  • get_word, get_word
  • getCurrentNull on StringListClass:
  • getItem on StringListClass:
  • GoNextSeqSet on AsnStream:
  • good on BaseStream:
  • GribskovReg on GribskovReg:, GribskovReg on GribskovReg:, GribskovReg on GribskovReg:, GribskovReg on GribskovReg:
  • Groups of sequences
  • h

  • help on NamedObject:
  • Hist
  • Hist on Hist
  • Histogram
  • i

  • id on Alphabet:
  • ID on ClassNameRegistry:
  • id on IdObject:
  • IdObject
  • IdObject on IdObject:
  • ignore_case on NameToPtr:
  • index on Alphabet:
  • index on AlphabetTuple:
  • indexOfItem on StringListClass:
  • init_is_a on NamedClass:
  • Input
  • input_order on Regularizer:
  • insert on DictionaryClass:
  • insert on SimpleHashClass:
  • insert on UserDefinitions:
  • is_a on IdObject:
  • is_a on NamedClass:
  • is_amino on Alphabet:
  • is_nonneg on LargeReal:
  • is_normal on Base:
  • is_nucleic on Alphabet:
  • is_null on Base:
  • is_null_char on Base:
  • is_one on Prob:
  • is_positive on LargeReal:
  • is_pyrimidine on NucleicAlphabet:
  • is_rnucleic on Alphabet:
  • is_valid on Alphabet:
  • is_variant on Base:
  • is_wild on Base:
  • is_zero on LargeReal:
  • is_zero on Prob:
  • l

  • LargeReal
  • last_char on Alphabet:
  • last_var on Alphabet:
  • last_wc on Alphabet:
  • limits on Base
  • Linear parallel processing
  • Linking G++ and MPL
  • listSize on DictionaryClass:
  • listSize on SimpleHashClass:
  • listSize on StringListClass:
  • listSize on UserDefinitions:
  • LoadSequence on EntrezStream:
  • log, log
  • Log Probabilities, MPL
  • Log-Normal Smoothing
  • log2, log2
  • log_base on GribskovReg:
  • log_probability on DirichletReg:
  • LogBase =
  • LogGamma
  • LogGamma_1
  • LogGamma_derivs
  • LogGamma_print_summary
  • LogNormalHist
  • LogNormalHist on LogNormalHist
  • lookup on DictionaryClass:
  • lookup on SimpleHashClass:
  • lookup on UserDefinitions:
  • lookup_table on IdObject:
  • m

  • MasPar
  • matches on Alphabet:
  • max_num_base on Alphabet:
  • max_num_var_base on Alphabet:
  • mean on Hist:
  • min_element on SubstPseudoReg:
  • Mixture =
  • mixture_coeff on DirichletReg:
  • MLPReg on MLPReg:, MLPReg on MLPReg:, MLPReg on MLPReg:, MLPReg on MLPReg:
  • MLZReg on MLZReg:, MLZReg on MLZReg:, MLZReg on MLZReg:, MLZReg on MLZReg:
  • MPL, Probabilities
  • Multinomial, Multinomial
  • Multiple sequences
  • n

  • Name =
  • name on Alphabet:
  • name on NamedObject:
  • name_to_alphabet on Alph:
  • NamedClass
  • NamedObject
  • NamedObject on NamedObject:, NamedObject on NamedObject:
  • NameToPtr
  • NameToPtr on NameToPtr:
  • NCBI Configuration
  • NCBI Data Format
  • NO_VARS
  • no_wc_match on Alphabet:
  • no_wc_match on Base:, no_wc_match on Base:
  • non_prob on Prob:
  • norm_length on Alphabet:
  • norm_wc_length on Alphabet:
  • Normal Smoothing
  • NormalHist
  • NormalHist on NormalHist
  • normalize on Regularizer:
  • noseqset on BaseStream:
  • noseqsetbit
  • Nucleic
  • NucleicAlphabet
  • null on Alphabet:
  • null on Base:
  • null_char on Base:
  • NULL_NULL_FALSE
  • NULL_NULL_TRUE
  • num_alphabets on Alph:
  • num_alphabets on AlphabetTuple:
  • num_alphs on FeatureReg:
  • num_columns on SubstPseudoReg:
  • num_components on DirichletReg:
  • num_features on FeaturePartition:
  • num_matches on Alphabet:
  • num_normal on AlphabetTuple:
  • Number =
  • NumDistr =
  • o

  • offset on Sequence:
  • OK on FeaturePartition:
  • operator <<
  • operator Base * on Sequence
  • operator const Base * on Sequence
  • operator on AlphabetTuple:
  • operator on BaseTuple:, operator on BaseTuple:
  • operator on Hist:
  • operator on SeqList:
  • operator on Sequence:, operator on Sequence:, operator on Sequence:
  • operator on StringListClass:
  • operator+= on Hist
  • operator-= on Hist
  • operator<<, operator<<, operator<<, operator<<, operator<<
  • operator<< on Hist:
  • operator= on SeqList:
  • operator= on Sequence:, operator= on Sequence:
  • operator>>, operator>>, operator>>, operator>>
  • operator>> on BaseStream:
  • operator>> on Hist:
  • operator~ on Prob:
  • Order =, Order =
  • OrderSort
  • p

  • p_Prob
  • p_Prob_add
  • p_Prob_add2
  • p_Prob_add3
  • p_Prob_add3X
  • p_Prob_from_ln
  • p_Prob_increase
  • p_Prob_increase_reg
  • p_Prob_is_zero
  • p_Prob_ln
  • p_Prob_make
  • p_Prob_mult
  • p_Prob_val
  • Parallel computation
  • Parallel processing, linear
  • Parition =
  • partition on FeatureReg:
  • plotOn on Hist:
  • pop_partition on FeatureReg:
  • posterior_mixture on DirichletReg:
  • pow, pow, pow
  • pow_double
  • print on DictionaryClass:
  • print on FeaturePartition:
  • print on SimpleHashClass:
  • print on UserDefinitions:
  • print_command on AlphabetTuple:
  • print_info on Regularizer:
  • print_order on Regularizer:
  • print_ordered_component on DirichletReg:
  • print_unindex on AlphabetTuple:
  • printOn on Hist:
  • printOn on Sequence:
  • Prob, Prob
  • prob on Hist:
  • Prob_add
  • Prob_from_ln
  • Prob_is_zero
  • Prob_ln
  • Prob_make
  • Prob_mult
  • Prob_one
  • Prob_strong_zero
  • Prob_val
  • Prob_zero
  • Probabilities, MPL
  • Probability on DirichletReg:
  • Pseudocounts =
  • pseudocounts on MLPReg:
  • Psi
  • putAppend on StringListClass:
  • putItemAtIndex on StringListClass:
  • q

  • QUpdate =
  • r

  • raw_counts on Hist:
  • raw_countVec on Hist:
  • raw_int on Base:
  • raw_prob on Hist:
  • read on AsnStream:
  • read on FeaturePartition:
  • read_AlphabetTuple
  • read_command on Command:
  • read_knowing_type on NamedClass:
  • read_line
  • read_name on NamedObject:
  • read_new on NamedClass:
  • read_new on Regularizer:, read_new on Regularizer:
  • read_order on Regularizer:
  • read_script on Command:
  • ReadIndex on AsnStream:
  • Rectangular Smoothing
  • RectHist
  • RectHist on RectHist
  • ReduceCounts on FeaturePartition:
  • RegisterClass
  • Regularizer
  • Regularizer on Regularizer:, Regularizer on Regularizer:, Regularizer on Regularizer:
  • Rehash on NameToPtr:
  • remove_from_table on Command:
  • removeItem on StringListClass:
  • removeItemAtIndex on StringListClass:
  • reset on SimpleHashClass:
  • restoreFrom on Hist:
  • ret_default on Alph:
  • ret_double on Prob:
  • ret_epsilon on Prob:
  • ret_log on Prob:
  • ret_mag on LargeReal:
  • RetHashSize on NameToPtr:
  • RetNumNames on NameToPtr:
  • RNA
  • RNAAlphabet
  • s

  • same_as on AlphabetTuple:
  • same_group on NucleicAlphabet:
  • scale_component on DirichletReg:
  • scanFrom on Hist:
  • scanFrom on Sequence:
  • Scores =
  • SeqList
  • SeqList on SeqList, SeqList on SeqList, SeqList on SeqList
  • Sequence
  • Sequence on Sequence, Sequence on Sequence, Sequence on Sequence, Sequence on Sequence, Sequence on Sequence, Sequence on Sequence, Sequence on Sequence
  • Sequence Streams
  • Sequences, sets of
  • set_alphabet on Regularizer:
  • set_alphabet_tuple on Regularizer:
  • set_component on DirichletReg:
  • set_const on Prob:
  • set_default on Alph:
  • set_double on Prob:
  • set_element on SubstPseudoReg:
  • set_feature on FeaturePartition:
  • set_help on NamedObject:
  • set_int on Base:
  • set_kernel_width on Hist, set_kernel_width on Hist
  • set_log on Prob:
  • set_log_base on GribskovReg:
  • set_mixture on DirichletReg:
  • set_name on NamedObject:
  • set_pseudocounts on MLPReg:
  • set_zero_offset on FeatureReg:
  • set_zero_offset on MLZReg:
  • setCurrentNull on StringListClass:
  • Sets of sequences
  • SetupToRead on BaseStream:
  • sgn on LargeReal:
  • ShortLargeReal
  • ShortProb
  • silent_convert on Alph:
  • SimpleHashClass
  • size on SeqList:
  • size on Sequence:
  • SkipSeparators, SkipSeparators
  • smooth on Hist
  • SmoothHist
  • smoothing on Hist, smoothing on Hist, smoothing on Hist
  • Smoothing, Log-Normal
  • Smoothing, Normal
  • Smoothing, Rectangular
  • Statistics, Histograms
  • storeOn on Hist:
  • String on operator
  • StringListClass
  • StructID =
  • sub_instance on Hist
  • SubSequence on Sequence:
  • Subset matching, wildcard
  • Subst =, Subst =
  • SubstPseudoReg on SubstPseudoReg:, SubstPseudoReg on SubstPseudoReg:, SubstPseudoReg on SubstPseudoReg:, SubstPseudoReg on SubstPseudoReg:
  • sum_col on SubstPseudoReg:
  • sum_component on DirichletReg:
  • t

  • to_base on Alphabet:
  • to_char on Alphabet:
  • total_size on SeqList:
  • type on Hist:
  • type on NamedClass:
  • u

  • unfreeze_columns on SubstPseudoReg:
  • unfreeze_components on DirichletReg:
  • unfreeze_dist on MLPReg:
  • unfreeze_mixture on DirichletReg:
  • unfreeze_pseudocounts on SubstPseudoReg:
  • unindex on Alphabet:
  • UnorderedProbability on DirichletReg:
  • use_pseudocounts on SubstPseudoReg:
  • use_pseusocounts on SubstPseudoReg:
  • use_scaled_counts on SubstPseudoReg:, use_scaled_counts on SubstPseudoReg:
  • UserDefinitions
  • v

  • valid on LargeReal:
  • valid on Prob:
  • valid_or_null on Alphabet:
  • variance on Hist:
  • VARS
  • verify_partials1 on Regularizer:
  • verify_partials2 on Regularizer:
  • verify_word
  • virtual on Alphabet
  • void on Alphabet, void on Alphabet, void on Alphabet, void on Alphabet, void on Alphabet
  • void on SeqList, void on SeqList
  • void on Sequence, void on Sequence, void on Sequence
  • w

  • wc_length on Alphabet:
  • wc_match on Alphabet:
  • wc_match on Base:
  • wc_subset on Alphabet:
  • wc_subset on Base:
  • which_feature on FeaturePartition:, which_feature on FeaturePartition:
  • Wildcard subset matching
  • write on NamedClass:
  • write_knowing_type on NamedClass:
  • write_name on NamedObject:
  • WriteIndex on AsnStream:
  • x

  • xnetE
  • xnetShift
  • xnetW
  • z

  • zero_offset on FeatureReg:
  • zero_offset on MLZReg:
  • ZeroOffset =
  • ~

  • ~AlphabetTuple on AlphabetTuple:
  • ~BaseTuple on BaseTuple:
  • ~DirichletReg on DirichletReg:
  • ~FeaturePartition on FeaturePartition:
  • ~FeatureReg on FeatureReg:
  • ~GribskovReg on GribskovReg:
  • ~IdObject on IdObject:
  • ~MLPReg on MLPReg:
  • ~NamedObject on NamedObject:
  • ~NameToPtr on NameToPtr:
  • ~Regularizer on Regularizer:
  • ~SubstPseudoReg on SubstPseudoReg:

  • This document was generated on 28 October 1996 using the texi2html translator version 1.51.