GEN_SEQUENCE
an open-source library

Gen_sequence is a program for generating random sequences of amino acids with lengths and compositions typical of those found in real protein databases.

The program comes with a small library of open-source routines for generating random variates according to normal, beta, Dirichlet, and mixture of Dirichlet distributions.

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; version 2.1 of the License. This library is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU Lesser General Public License for more details.

The random-variate algorithms in this library were selected more for robustness and simplicity of implementation than raw speed. Despite that, the generation seems to be quite efficient, taking about 1 microsecond per beta generation and 0.6 per normal generation on a DEC alpha xp1000.

The random number generator can be changed by changing the DRAND macro definitions in the .c files. Since all the generators rely on successive pairs of uniformly distributed random numbers, a high-quality generator should be used. The additive random number generator "random" in the standard UNIX libraries is such a generator, so was chosen for this application.

Test programs are provided for each of the generators. The tests are far from exhaustive, checking only the first two moments a few parameter values (covering each of the different algorithms for gen_beta). The test programs do not make a decision about whether the generators are working or not---they simply report the first and second moments from the sample and what they should be analytically. It is up to the user to decide whether this match is adequate. Although the test programs were written for debugging the random-variate generators, their main function now is to determine the speed of the generators.

Access to source code

The whole package in gzipped tar format
README
The README file is mostly redundant with this web page. The documentation for the individual routines is in the source code itself.
gen_beta.c
Beta variate generator
gen_beta.h
Header for gen_beta.
test_beta.c
Test program for gen_beta.
gen_dirch.c
Generator for vectors distributed according to a Dirichlet density. This routine requires gen_beta also.
gen_dirch.h
Header for the gen_dirch.
test_dirch.c
Test program for gen_dirch.
gen_dirch_mix.c
Generator for vectors distributed according to a mixture of Dirichlet densities. This routine requires gen_dirch and gen_beta also.
gen_dirch_mix.h
Header for gen_dirch.
test_dirch_mix.c
Test program for gen_dirch.
gen_norm.c
Generator for random numbers distributed according to the standard normal density function.
gen_norm.h
Header for gen_norm.
test_norm.c
Test program for gen_norm.
gen_sequence.c
Main program for generating random sequences of amino acids in FASTA format. This program requires gen_beta, gen_dirch, gen_dirch_mix, and gen_norm.

The length of each sequence is taken from a discretized log-normal distribution that was fit to the sequences in RSDB-60 (see Park J, Holm L, Heger A, Chothia C
RSDB: representative protein sequence databases have high information content
Bioinformatics 2000 May;16(5):458-64 ).

The amino acids of a sequence are generated by an independent, identically distributed process. The probabilities for that distribution are selected from a mixture of Dirichlet densities. The mixture of Dirichlet densities was also trained on RSDB-60. The particular mixture chosen here is not the best fit, but a compromise between the number of components and the fit.