Neural Networks:

Use the Weka program to analyze structure data from pdb. You should select an annotated structure of your choices and make predictions for it.

You may want to look at this documentation:

You can also use the FANN library for your experiments.

Deliverables:

Post a written report on your Assignments page using the Lab Report Guidelines. Your report should include the following:
  1. A description of your experiences and the accuracy you were able to achieve with Weka.
  2. The pdb structure you analyzed (see if you can find the gene you identified as being differentially expressed in Down syndrome).
  3. The Weka settings and statistics.

If youre results are poor (I got around 30% accuracy with the default values), then include more hidden layers (10,10,10 in the hiddenLayers field), a slower learning rate, more samples etc.

You should use more than one protein from pdb to make sure your network is not specific to a single protein. You can look proteins up at pdb or NCBI Structure . At pdb you will want want to click on the Sequence tab and down in the Chain Display Section you will find a link to [Sequence & DSSP]. Copy and paste the results(minus the title at the top) into a pdb file. You can find secondary structure annotations for all PDB structures at http://www.rcsb.org/pdb/files/ss.txt. A copy of this file is in /usr/local/data/ss.txt on psoda4. If you dont want to go through the anotations by hand, you can use the stride application to retreive the annotations from pdb files.

Datasets

The attached hiv.pdb file was copied and pasted from the pdb database. You can run the command "perl makearff.pl hiv.pdb hiv.arff" to create an arff file. You should edit the makearff.pl file if you want to change the number of inputs to more than 13 (or less).

Another method for making your arff files is to use the ss.txt file referred to above. Copy and paste a few of the sequence and structure pairs into a file. An example file of this is ss1.txt. Don't mess with their formatting. Then run the command "perl ssparser.pl ss1.txt ss1.arff" to create an arff file. You are free to edit the ssparser file to change the number of inputs. The ss.txt file is huge so you can use it to get a lot of training data.

The pdb files will have the following symbols: H=helix; B=residue in isolated beta bridge; E=extended beta strand; G=310 helix; I=pi helix; T=hydrogen bonded turn; S=bend; N=Nothing (my perl code inserts this into the arff file)

Amino Acids A,R,D,N,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V