CHANGE the blastall in the Makefile for recent.blast to use blastall -F -m 9 Fix the conversion from eps to pdf so that long sequence logos are not truncated. Fix the sequences from kumar.seqs (the target sequences have a * at the end, though this has been fixed in the kumar.seqs file). Alternatively, fix undertaker to accept and ignore a * character in a sequence. Change Make.main and associated scripts to use latest neural nets and scripts for building web pages (see pce/protein-predict/SAM_T05test) Revise Makefile to use a more selective search for targets to redo---BLAST e-value should be no worse than 100 time the E-value recorded in new-diagonal.rdb. DONE:
We need to devise an update procedure for updating the predictions periodically as new sequences and structures are added to the underlying databases. For example, we could score the yeast sequences with each new addition to the template library. If any of the yeast sequences score well, we could mark them for re-prediction. Once a month we could re-do the predictions for the marked sequences based on the new template library, perhaps using the existing t2k alignments, perhaps generating new t2k alignments. There are other ways we could trigger re-prediction---manual request by users (to be confirmed by us, to avoid massive recomputations), BLAST searches with new PDB sequences as keys, ... . The find-recent-ids script in /projects/compbio/data/pdbfinder may be useful in selecting the subset of pdb to be considered new. Big chunks of the procedure have been put into the Makefile in the main yeast directory (see the URL above or look in /projects/compbio/experiments/protein-predict/yeast/Makefile), including extracting the recent PDB sequences, blasting against the yeast ORFs, and collecting the hits as an IDs file. I still have not automated the decision about what constitutes "recent" nor the submitting of ids to be searched again. This is all pretty easy---a bigger problem is figuring out what to do about the delays between PDB releasing a structure and it appearing in the PDBAA non-redundant list that we search (about a month) or our template library (a week to a month). Based on a search done at the end of December 2002, it looks like we'll need to redo about 60 searches a month---about 1% of the database.
I've more or less automated the update of the yeast predictions now, so there isn't much to do here. [April 2003]
Once a year we should probably update the t2k alignments. Rather than having a massive annual update, it might be better to have a rolling update, redoing the alignments a few at a time (about 2% a week), oldest alignments first. Note that this maintenance is much more expensive than the "recent PDB" maintenance, both per sequence and in number of sequences.
When we do a reprediction, we would have to make sure that whatever database is used for indexing the results is properly updated. Having some timestamps for the t2k alignment and the structure prediction in the database would probably help in devising a good update procedure.