This directory contains predictions for all the S. cerevisiae ORFs, using the SAM-T02 prediction method. orf_coding.fasta.gz has the DNA for all the orfs It is not used anywhere else on this site. orf_trans.fasta.gz has the protein sequences for all 6360 orfs, as originally obtained from SGD. corrected.seqs 5 sequences with in-frame stops in SGD, corrected to full-length form from Swissprot or TREMBL. kumar.seqs Short genes found by Anuj Kumar. kumar_trans.fasta translated in first reading frame. orf_trans has corrected sequences from Swissprot plus Anuj Kumar's sequences. (Indexed for NCBI-BLAST) orf_trans.ids has the names of the orfs extracted from orf_trans prefix.counts counts the number of subdirectories and maximum number of orfs per subdirectory we would get if we split the set of orfs according to the first 1, 2, 3, .. letters of their names. We have chosen to use the first four letters as the dividing point, so there are 95 main directories with up to 113 ORFS in each. scripts/ scripts used for the predictions (note: some scripts are installed elsewhere---this directory contains only the ones unique to this set of predictions) Makefile instructions for making the orf_trans.ids files, distributing the sequences to subdirectories and other top-level operations. Make.main shared instructions for doing the individual predictions starter-directory master files copied into the individual subdiretories (with the id inserted appropriately) mail communication with Nathan Baker and Olga Troyanskaya (not public) protein-protein.rdb RDB file indexed by pair of proteins, providing min_{t a template} (max (Evalue(prot1,T), Evalue(prot2,T))) This file is useful for clustering yeast proteins by predicted structural similarity. protein-protein-sorted.rdb same info as protein-protein.rdb, sorted by E_value. diagonal.rdb the diagonal elements of protein-protein.rdb, which is the same as the minimum E_value of any template for that target protein. Note that about 44% the yeast proteins have E_values < 1.e-06 (strong predictions) and 43% have E_values > 1.0 (very weak predictions). id-subsets/* files with subsets of the ids, used in selecting which chain ids to build for. redone-recent-*.ids chains that had searches redone because of similarity to new PDB files used-adpstyle1.ids long chains that triggered out-of-memory problems in hmmscore with posterior decoding. Used Viterbi alignment instead for final step of target2k script. kumar.ids sequences provided by Anuj Kumar (not in SGD) corrected.ids sequences with in-frame stop codons for which full-length sequences were found in Swissprot ---------------------------------------- TEMPORARY FILES---these will be discarded when no longer needed: *.log Log of the scripts/condor-all script, to observe the progress of the build. ------------------------------------------------------------ 26 Feb 2004 Kevin Karplus The monthly updates to the yeast predictions are beginning to get quite slow. One problem is the huge number of protein kinases that get updated every time there is a new protein kinase release in PDB. Perhaps I need to update fewer targets each month, only updating those whose best hit in the new PDB files is better than the best previous hit in PDB. To save recomputation, this would require caching a "best pdb blast hit" for each target. 22 April 2004 Kevin Karplus Another problem with the updates is that there may be some good templates missed for the targets that have fairly weak predictions. For example, I just discovered this week that there are better targets for YKL149C (the lariat-debranching enzyme) that did not get picked up in the updates. Perhaps when we get the updates runnable on the kilokluster, I can do a massive update of all the predictions. Tue Dec 28 20:57:15 PST 2004 Kevin Karplus I'm concerned that 1624 warnings for "chain not found" were issued for recent pdb files---that's most of recent-pdb.ids! PDBFinder has record of the files, and they seem to be on the PDB web site, but they aren't in either the pdbaa or dunbrack-pdbaa files. (Well, some are in dunbrack-pdbaa under different names, but I'm not worried about those.) I need to check on how much delay is involved in getting PDB files into the dunbrack-pdbaa files. For example, 1q13.pdb has a REVDAT of 16-Nov-04, but was not in the 21 Dec pdbaa. Perhaps RCSB was slow about releasing files, and the REVDAT is not accurate for when it was actually released? Sat Jan 1 17:50:25 PST 2005 Kevin Karplus It may just be an update problem. The 1q13A chain is present in the pdbaa file from Dunbrack's web site. Wait---I seem to be getting it from the old ftp site, which hasn't been updated since 13 Nov 2004. I should probably get a new list of yeast ids to update, based on recent PDB files, but remove any from the list that have just been updated. Sun Jan 2 13:23:58 PST 2005 Kevin Karplus I updated the dunbrack-pdb file yesterday, and selected new targets to redo today. I deliberately avoided including any targets that were just updated this week, so the next update should probably include pdb files as far back as 20 Dec 2004. Mon Jan 3 12:46:53 PST 2005 Kevin Karplus I updated the Make.main file yesterday to have a new MAX_NUM_BEST limit of 50 templates, but I realized today that this may cause some problems with multi-domain proteins. For example, for YPL231W, there are templates for 5 or 6 different domains, but there are many templates for the top-scoring domain, so the tighter limit on MAX_NUM_BEST may crowd out some of the still-good alignments to the other domains. I may have to loosen MAX_NUM_BEST, but to do so, I'll want to speed up the way that pairwise alignments are checked---the recursive make is far too slow. Sun Jan 30 10:00:05 PST 2005 Kevin Karplus Today's remake will only use new templates in the library, and not new PDB files, since the PDBFIND.TXT.gz file on the ftp server I fetch it from does not seem to have been updated since 11 Dec 2004. If I can get a newer version of PDBFIND.TXT.gz, I may have to redo the yeast update, (probably taking the difference in the IDS list so that I don't redo targets unnecessarily). Mon Jan 31 03:44:13 PST 2005 Kevin Karplus The bug in PDBFIND.TXT.gz updates was fixed, so another 83 proteins are having the predictions redone starting today. There were some very strong hits (including two yeast proteins themselves) that were too close to existing structures to have been included in the SAM template library, but which quite likely will improve the structures reported on the web pages. Sun Nov 5 20:15:10 PST 2006 Kevin Karplus I updated the Make.main file to use the protocol used in CASP7, and am testing it first on YFL026W (STE2). If it works, I'll gradually start redoing the pages with the new protocol (and REDO_ALL=1). The old Make.main file is preserved as Make.main-preNov2006