Commit 40086dd8 by junyu_yao

protein-sol全部改成当前路径

parents
October 2017
Sequence-based prediction code used in University of Manchester protein-sol server.
Available from download tab at www.protein-sol.manchester.ac.uk.
see Hebdtich et al (2017) Bioinformatics 33:3098-3100.
Jim Warwicker and Mex Hebditch, Manchester
*****************
Copyright (C) 2017 Jim Warwicker and Max Hebdtich
These programs are free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License.
These programs are distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
The code is available 'as is', we are planning further developments to the server
and code, and here is just a snapshot that should allow interested users to make
calculations with multiple fasta sequences (as opposed to the single sequence
operation of the web server).
*****************
CODE is in various perl scripts (.pl)
fasta_seq_reformat_export.pl
seq_compositions_perc_pipeline_export.pl
server_prediction_seq_export.pl
seq_props_ALL_export.pl
profiles_gather_export.pl
RUN is initiated with
'multiple_prediction_wrapper_export.sh sequence_input_file'
sequence_input_file has something like a fasta format, it will convert to
paired records of ID and sequence for each entry:
>blah11
MVKVYAPASSANMSVGFDVLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDKLPSEPRENIVYQCWERFCQE
>blah22
MVKVYAPASSANMSVGFDVLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDKLPSEPRENIVYQCWERFCQE
but regular fasta should be OK (fasta_seq_reformat_export to convert).
OTHER INPUT (DATA) FILES:
ss_propensities.txt (sec struc propensities)
seq_reference_data_NIWA.txt (fitting to experimental solubility data)
RUNNING, as set up, will occur with all files in the local directort/
OUTPUT
seq_prediction.txt (CSV) contains the data relating to that provided on the server:
LEGEND records - brief information on features used for the predictions
HEADERS records - matched with keywords for the data output SEQUENCE records:
HEADERS PREDICTIONS with SEQUENCE PREDICTIONS
percent-sol, scaled-sol, population-sol, pI
HEADERS FEATURES ORIGINAL gives definitions of the features
HEADERS FEATURES PLOT give short-hand names for the features
both of these HEADERS lines are matched columns with SEQUENCE WEIGHTS and SEQUENCE DEVIATIONS
lines, where:
SEQUENCE WEIGHTS - weights from the fit to experimental data, only 10 features are non-zero
SEQUENCE DEVIATIONS - these are the z-score (see publication) deviation for each feature
then follows PROFILE data across successive 21 amino acid windows for:
Kyte-Doolittle
Uversky fold value - plotted on server
sequence entropy
windowed net charge - plotted on server
TM REGION PREDICTED for >sp|P61626|LYSC_HUMAN - with non-normalised KD of 1.78 compared with threshold of 1.6
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for y is 0.355555555555556
KDnorm for e is 0.111111111111111
KDnorm for c is 0.777777777777778
KDnorm for Q is 0.111111111111111
KDnorm for W is 0.4
KDnorm for n is 0.111111111111111
KDnorm for q is 0.111111111111111
KDnorm for m is 0.711111111111111
KDnorm for N is 0.111111111111111
KDnorm for f is 0.811111111111111
KDnorm for x is 0.5
KDnorm for X is 0.5
KDnorm for F is 0.811111111111111
KDnorm for l is 0.922222222222222
KDnorm for v is 0.966666666666667
KDnorm for K is 0.0666666666666667
KDnorm for s is 0.411111111111111
KDnorm for g is 0.455555555555555
KDnorm for P is 0.322222222222222
KDnorm for H is 0.144444444444444
KDnorm for R is 0
KDnorm for D is 0.111111111111111
KDnorm for r is 0
KDnorm for I is 1
KDnorm for G is 0.455555555555555
KDnorm for a is 0.7
KDnorm for L is 0.922222222222222
KDnorm for M is 0.711111111111111
KDnorm for h is 0.144444444444444
KDnorm for S is 0.411111111111111
KDnorm for A is 0.7
KDnorm for p is 0.322222222222222
KDnorm for E is 0.111111111111111
KDnorm for C is 0.777777777777778
KDnorm for w is 0.4
KDnorm for T is 0.422222222222222
KDnorm for k is 0.0666666666666667
KDnorm for d is 0.111111111111111
KDnorm for Y is 0.355555555555556
KDnorm for t is 0.422222222222222
KDnorm for i is 1
KDnorm for V is 0.966666666666667
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for P is 0.322222222222222
KDnorm for D is 0.111111111111111
KDnorm for Y is 0.355555555555556
KDnorm for x is 0.5
KDnorm for N is 0.111111111111111
KDnorm for w is 0.4
KDnorm for e is 0.111111111111111
KDnorm for H is 0.144444444444444
KDnorm for R is 0
KDnorm for l is 0.922222222222222
KDnorm for i is 1
KDnorm for a is 0.7
KDnorm for y is 0.355555555555556
KDnorm for C is 0.777777777777778
KDnorm for M is 0.711111111111111
KDnorm for F is 0.811111111111111
KDnorm for r is 0
KDnorm for n is 0.111111111111111
KDnorm for k is 0.0666666666666667
KDnorm for Q is 0.111111111111111
KDnorm for I is 1
KDnorm for q is 0.111111111111111
KDnorm for W is 0.4
KDnorm for E is 0.111111111111111
KDnorm for S is 0.411111111111111
KDnorm for d is 0.111111111111111
KDnorm for A is 0.7
KDnorm for v is 0.966666666666667
KDnorm for V is 0.966666666666667
KDnorm for m is 0.711111111111111
KDnorm for h is 0.144444444444444
KDnorm for s is 0.411111111111111
KDnorm for L is 0.922222222222222
KDnorm for f is 0.811111111111111
KDnorm for G is 0.455555555555555
KDnorm for g is 0.455555555555555
KDnorm for c is 0.777777777777778
KDnorm for p is 0.322222222222222
KDnorm for X is 0.5
KDnorm for K is 0.0666666666666667
KDnorm for T is 0.422222222222222
KDnorm for t is 0.422222222222222
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for i is 1
KDnorm for s is 0.411111111111111
KDnorm for F is 0.811111111111111
KDnorm for g is 0.455555555555555
KDnorm for m is 0.711111111111111
KDnorm for Y is 0.355555555555556
KDnorm for S is 0.411111111111111
KDnorm for v is 0.966666666666667
KDnorm for q is 0.111111111111111
KDnorm for D is 0.111111111111111
KDnorm for w is 0.4
KDnorm for P is 0.322222222222222
KDnorm for M is 0.711111111111111
KDnorm for V is 0.966666666666667
KDnorm for W is 0.4
KDnorm for R is 0
KDnorm for t is 0.422222222222222
KDnorm for c is 0.777777777777778
KDnorm for Q is 0.111111111111111
KDnorm for y is 0.355555555555556
KDnorm for h is 0.144444444444444
KDnorm for C is 0.777777777777778
KDnorm for f is 0.811111111111111
KDnorm for H is 0.144444444444444
KDnorm for a is 0.7
KDnorm for l is 0.922222222222222
KDnorm for X is 0.5
KDnorm for T is 0.422222222222222
KDnorm for L is 0.922222222222222
KDnorm for N is 0.111111111111111
KDnorm for K is 0.0666666666666667
KDnorm for k is 0.0666666666666667
KDnorm for r is 0
KDnorm for p is 0.322222222222222
KDnorm for E is 0.111111111111111
KDnorm for d is 0.111111111111111
KDnorm for I is 1
KDnorm for A is 0.7
KDnorm for n is 0.111111111111111
KDnorm for x is 0.5
KDnorm for G is 0.455555555555555
KDnorm for e is 0.111111111111111
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for g is 0.455555555555555
KDnorm for R is 0
KDnorm for N is 0.111111111111111
KDnorm for Y is 0.355555555555556
KDnorm for c is 0.777777777777778
KDnorm for i is 1
KDnorm for K is 0.0666666666666667
KDnorm for G is 0.455555555555555
KDnorm for x is 0.5
KDnorm for e is 0.111111111111111
KDnorm for D is 0.111111111111111
KDnorm for p is 0.322222222222222
KDnorm for a is 0.7
KDnorm for s is 0.411111111111111
KDnorm for M is 0.711111111111111
KDnorm for v is 0.966666666666667
KDnorm for X is 0.5
KDnorm for l is 0.922222222222222
KDnorm for t is 0.422222222222222
KDnorm for I is 1
KDnorm for m is 0.711111111111111
KDnorm for h is 0.144444444444444
KDnorm for P is 0.322222222222222
KDnorm for S is 0.411111111111111
KDnorm for r is 0
KDnorm for q is 0.111111111111111
KDnorm for H is 0.144444444444444
KDnorm for E is 0.111111111111111
KDnorm for f is 0.811111111111111
KDnorm for y is 0.355555555555556
KDnorm for L is 0.922222222222222
KDnorm for F is 0.811111111111111
KDnorm for C is 0.777777777777778
KDnorm for n is 0.111111111111111
KDnorm for W is 0.4
KDnorm for Q is 0.111111111111111
KDnorm for T is 0.422222222222222
KDnorm for A is 0.7
KDnorm for w is 0.4
KDnorm for d is 0.111111111111111
KDnorm for V is 0.966666666666667
KDnorm for k is 0.0666666666666667
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for a is 0.7
KDnorm for L is 0.922222222222222
KDnorm for i is 1
KDnorm for V is 0.966666666666667
KDnorm for D is 0.111111111111111
KDnorm for d is 0.111111111111111
KDnorm for w is 0.4
KDnorm for A is 0.7
KDnorm for m is 0.711111111111111
KDnorm for N is 0.111111111111111
KDnorm for g is 0.455555555555555
KDnorm for c is 0.777777777777778
KDnorm for n is 0.111111111111111
KDnorm for s is 0.411111111111111
KDnorm for P is 0.322222222222222
KDnorm for G is 0.455555555555555
KDnorm for W is 0.4
KDnorm for K is 0.0666666666666667
KDnorm for v is 0.966666666666667
KDnorm for T is 0.422222222222222
KDnorm for H is 0.144444444444444
KDnorm for l is 0.922222222222222
KDnorm for p is 0.322222222222222
KDnorm for e is 0.111111111111111
KDnorm for x is 0.5
KDnorm for k is 0.0666666666666667
KDnorm for S is 0.411111111111111
KDnorm for h is 0.144444444444444
KDnorm for f is 0.811111111111111
KDnorm for Q is 0.111111111111111
KDnorm for M is 0.711111111111111
KDnorm for t is 0.422222222222222
KDnorm for E is 0.111111111111111
KDnorm for X is 0.5
KDnorm for F is 0.811111111111111
KDnorm for Y is 0.355555555555556
KDnorm for y is 0.355555555555556
KDnorm for r is 0
KDnorm for I is 1
KDnorm for C is 0.777777777777778
KDnorm for q is 0.111111111111111
KDnorm for R is 0
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for H is 0.144444444444444
KDnorm for P is 0.322222222222222
KDnorm for E is 0.111111111111111
KDnorm for K is 0.0666666666666667
KDnorm for l is 0.922222222222222
KDnorm for e is 0.111111111111111
KDnorm for T is 0.422222222222222
KDnorm for m is 0.711111111111111
KDnorm for t is 0.422222222222222
KDnorm for F is 0.811111111111111
KDnorm for p is 0.322222222222222
KDnorm for d is 0.111111111111111
KDnorm for S is 0.411111111111111
KDnorm for a is 0.7
KDnorm for Y is 0.355555555555556
KDnorm for f is 0.811111111111111
KDnorm for L is 0.922222222222222
KDnorm for G is 0.455555555555555
KDnorm for D is 0.111111111111111
KDnorm for N is 0.111111111111111
KDnorm for M is 0.711111111111111
KDnorm for s is 0.411111111111111
KDnorm for A is 0.7
KDnorm for X is 0.5
KDnorm for k is 0.0666666666666667
KDnorm for x is 0.5
KDnorm for v is 0.966666666666667
KDnorm for V is 0.966666666666667
KDnorm for i is 1
KDnorm for g is 0.455555555555555
KDnorm for r is 0
KDnorm for R is 0
KDnorm for c is 0.777777777777778
KDnorm for q is 0.111111111111111
KDnorm for I is 1
KDnorm for y is 0.355555555555556
KDnorm for Q is 0.111111111111111
KDnorm for h is 0.144444444444444
KDnorm for W is 0.4
KDnorm for C is 0.777777777777778
KDnorm for n is 0.111111111111111
KDnorm for w is 0.4
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for W is 0.4
KDnorm for T is 0.422222222222222
KDnorm for P is 0.322222222222222
KDnorm for Q is 0.111111111111111
KDnorm for v is 0.966666666666667
KDnorm for I is 1
KDnorm for g is 0.455555555555555
KDnorm for N is 0.111111111111111
KDnorm for k is 0.0666666666666667
KDnorm for p is 0.322222222222222
KDnorm for y is 0.355555555555556
KDnorm for K is 0.0666666666666667
KDnorm for C is 0.777777777777778
KDnorm for i is 1
KDnorm for R is 0
KDnorm for L is 0.922222222222222
KDnorm for H is 0.144444444444444
KDnorm for m is 0.711111111111111
KDnorm for e is 0.111111111111111
KDnorm for t is 0.422222222222222
KDnorm for h is 0.144444444444444
KDnorm for n is 0.111111111111111
KDnorm for r is 0
KDnorm for Y is 0.355555555555556
KDnorm for l is 0.922222222222222
KDnorm for X is 0.5
KDnorm for S is 0.411111111111111
KDnorm for E is 0.111111111111111
KDnorm for a is 0.7
KDnorm for M is 0.711111111111111
KDnorm for G is 0.455555555555555
KDnorm for c is 0.777777777777778
KDnorm for F is 0.811111111111111
KDnorm for V is 0.966666666666667
KDnorm for x is 0.5
KDnorm for d is 0.111111111111111
KDnorm for D is 0.111111111111111
KDnorm for w is 0.4
KDnorm for A is 0.7
KDnorm for s is 0.411111111111111
KDnorm for f is 0.811111111111111
KDnorm for q is 0.111111111111111
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for X is 0.5
KDnorm for q is 0.111111111111111
KDnorm for C is 0.777777777777778
KDnorm for Y is 0.355555555555556
KDnorm for h is 0.144444444444444
KDnorm for t is 0.422222222222222
KDnorm for D is 0.111111111111111
KDnorm for c is 0.777777777777778
KDnorm for A is 0.7
KDnorm for a is 0.7
KDnorm for G is 0.455555555555555
KDnorm for H is 0.144444444444444
KDnorm for T is 0.422222222222222
KDnorm for y is 0.355555555555556
KDnorm for E is 0.111111111111111
KDnorm for g is 0.455555555555555
KDnorm for s is 0.411111111111111
KDnorm for I is 1
KDnorm for P is 0.322222222222222
KDnorm for F is 0.811111111111111
KDnorm for i is 1
KDnorm for K is 0.0666666666666667
KDnorm for w is 0.4
KDnorm for v is 0.966666666666667
KDnorm for x is 0.5
KDnorm for m is 0.711111111111111
KDnorm for W is 0.4
KDnorm for R is 0
KDnorm for e is 0.111111111111111
KDnorm for M is 0.711111111111111
KDnorm for V is 0.966666666666667
KDnorm for k is 0.0666666666666667
KDnorm for p is 0.322222222222222
KDnorm for l is 0.922222222222222
KDnorm for r is 0
KDnorm for n is 0.111111111111111
KDnorm for Q is 0.111111111111111
KDnorm for f is 0.811111111111111
KDnorm for N is 0.111111111111111
KDnorm for d is 0.111111111111111
KDnorm for L is 0.922222222222222
KDnorm for S is 0.411111111111111
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for E is 0.111111111111111
KDnorm for y is 0.355555555555556
KDnorm for G is 0.455555555555555
KDnorm for m is 0.711111111111111
KDnorm for l is 0.922222222222222
KDnorm for N is 0.111111111111111
KDnorm for f is 0.811111111111111
KDnorm for V is 0.966666666666667
KDnorm for w is 0.4
KDnorm for S is 0.411111111111111
KDnorm for I is 1
KDnorm for c is 0.777777777777778
KDnorm for R is 0
KDnorm for M is 0.711111111111111
KDnorm for d is 0.111111111111111
KDnorm for T is 0.422222222222222
KDnorm for t is 0.422222222222222
KDnorm for Q is 0.111111111111111
KDnorm for s is 0.411111111111111
KDnorm for h is 0.144444444444444
KDnorm for a is 0.7
KDnorm for P is 0.322222222222222
KDnorm for v is 0.966666666666667
KDnorm for p is 0.322222222222222
KDnorm for X is 0.5
KDnorm for W is 0.4
KDnorm for i is 1
KDnorm for q is 0.111111111111111
KDnorm for n is 0.111111111111111
KDnorm for H is 0.144444444444444
KDnorm for D is 0.111111111111111
KDnorm for k is 0.0666666666666667
KDnorm for Y is 0.355555555555556
KDnorm for r is 0
KDnorm for C is 0.777777777777778
KDnorm for x is 0.5
KDnorm for L is 0.922222222222222
KDnorm for g is 0.455555555555555
KDnorm for F is 0.811111111111111
KDnorm for e is 0.111111111111111
KDnorm for A is 0.7
KDnorm for K is 0.0666666666666667
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for e is 0.111111111111111
KDnorm for a is 0.7
KDnorm for n is 0.111111111111111
KDnorm for x is 0.5
KDnorm for W is 0.4
KDnorm for I is 1
KDnorm for s is 0.411111111111111
KDnorm for S is 0.411111111111111
KDnorm for N is 0.111111111111111
KDnorm for X is 0.5
KDnorm for T is 0.422222222222222
KDnorm for v is 0.966666666666667
KDnorm for M is 0.711111111111111
KDnorm for h is 0.144444444444444
KDnorm for H is 0.144444444444444
KDnorm for l is 0.922222222222222
KDnorm for k is 0.0666666666666667
KDnorm for Q is 0.111111111111111
KDnorm for C is 0.777777777777778
KDnorm for A is 0.7
KDnorm for P is 0.322222222222222
KDnorm for G is 0.455555555555555
KDnorm for t is 0.422222222222222
KDnorm for m is 0.711111111111111
KDnorm for y is 0.355555555555556
KDnorm for d is 0.111111111111111
KDnorm for c is 0.777777777777778
KDnorm for V is 0.966666666666667
KDnorm for R is 0
KDnorm for q is 0.111111111111111
KDnorm for Y is 0.355555555555556
KDnorm for F is 0.811111111111111
KDnorm for r is 0
KDnorm for D is 0.111111111111111
KDnorm for E is 0.111111111111111
KDnorm for L is 0.922222222222222
KDnorm for w is 0.4
KDnorm for f is 0.811111111111111
KDnorm for g is 0.455555555555555
KDnorm for K is 0.0666666666666667
KDnorm for p is 0.322222222222222
KDnorm for i is 1
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for S is 0.411111111111111
KDnorm for r is 0
KDnorm for W is 0.4
KDnorm for A is 0.7
KDnorm for a is 0.7
KDnorm for f is 0.811111111111111
KDnorm for R is 0
KDnorm for L is 0.922222222222222
KDnorm for h is 0.144444444444444
KDnorm for Q is 0.111111111111111
KDnorm for e is 0.111111111111111
KDnorm for N is 0.111111111111111
KDnorm for V is 0.966666666666667
KDnorm for s is 0.411111111111111
KDnorm for d is 0.111111111111111
KDnorm for C is 0.777777777777778
KDnorm for k is 0.0666666666666667
KDnorm for q is 0.111111111111111
KDnorm for c is 0.777777777777778
KDnorm for I is 1
KDnorm for F is 0.811111111111111
KDnorm for P is 0.322222222222222
KDnorm for M is 0.711111111111111
KDnorm for D is 0.111111111111111
KDnorm for G is 0.455555555555555
KDnorm for t is 0.422222222222222
KDnorm for v is 0.966666666666667
KDnorm for E is 0.111111111111111
KDnorm for g is 0.455555555555555
KDnorm for x is 0.5
KDnorm for T is 0.422222222222222
KDnorm for K is 0.0666666666666667
KDnorm for w is 0.4
KDnorm for i is 1
KDnorm for m is 0.711111111111111
KDnorm for p is 0.322222222222222
KDnorm for H is 0.144444444444444
KDnorm for l is 0.922222222222222
KDnorm for X is 0.5
KDnorm for Y is 0.355555555555556
KDnorm for y is 0.355555555555556
KDnorm for n is 0.111111111111111
HIS will be set at charge 0
column o/p file for each sequence = no - set with extra_output env variable
KDnorm for n is 0.111111111111111
KDnorm for d is 0.111111111111111
KDnorm for r is 0
KDnorm for y is 0.355555555555556
KDnorm for e is 0.111111111111111
KDnorm for D is 0.111111111111111
KDnorm for T is 0.422222222222222
KDnorm for s is 0.411111111111111
KDnorm for f is 0.811111111111111
KDnorm for a is 0.7
KDnorm for Q is 0.111111111111111
KDnorm for N is 0.111111111111111
KDnorm for x is 0.5
KDnorm for h is 0.144444444444444
KDnorm for i is 1
KDnorm for p is 0.322222222222222
KDnorm for W is 0.4
KDnorm for t is 0.422222222222222
KDnorm for L is 0.922222222222222
KDnorm for V is 0.966666666666667
KDnorm for F is 0.811111111111111
KDnorm for P is 0.322222222222222
KDnorm for l is 0.922222222222222
KDnorm for X is 0.5
KDnorm for q is 0.111111111111111
KDnorm for K is 0.0666666666666667
KDnorm for A is 0.7
KDnorm for I is 1
KDnorm for c is 0.777777777777778
KDnorm for S is 0.411111111111111
KDnorm for m is 0.711111111111111
KDnorm for H is 0.144444444444444
KDnorm for k is 0.0666666666666667
KDnorm for M is 0.711111111111111
KDnorm for G is 0.455555555555555
KDnorm for R is 0
KDnorm for g is 0.455555555555555
KDnorm for v is 0.966666666666667
KDnorm for w is 0.4
KDnorm for Y is 0.355555555555556
KDnorm for C is 0.777777777777778
KDnorm for E is 0.111111111111111
#!/usr/bin/perl
use FindBin;
$file_here = "$FindBin::Bin/blah.txt";
printf $file_here;
open (BLAH, ">>$file_here") or die "cannot open blah.txt\n"; # file for comments, throughout server code
open (IN, "< $FindBin::Bin/reformat.in") or die "cannot open reformat_in\n";
@in=<IN>;
close (IN);
$file_here = "$FindBin::Bin/reformat_out";
open (OUT, ">$file_here") or die "cannot open reformat_out\n";
$seq_stored = 'no';
$skip_seq = 'no';
foreach $line (@in) {
chomp $line;
@words = split (" ",$line);
if ((substr ($line,0,1) eq '>') or ($words[0] eq '//')) { # new sequence or a // terminator
if ($seq_stored eq 'yes') { # process and write out the previous sequence, if no errors
$seqUC = uc($seq_here); # uppercase the sequence
$seqlen = length($seqUC);
$seq_new = "";
$naa = 0;
$n = 1; # ?? VERY ODD does the loop not fix starting n at 1 ??
for ($n==1; $n<=$seqlen; $n++) {
$char_here = substr($seqUC,$n-1,1);
if (($char_here ne ' ') and ($char_here ne '*')) { # ignore blanks and asterisks
if ($char_here =~ /[ACDEFGHIKLMNPQRSTVWY]/) {
$seq_new .= $char_here;
$naa++;
# } else { # not happy, but we can get <cr> at end, and easier to take only aa than skip allowed non-aa
# printf BLAH "**** stopping - fasta_seq_reformat - non 20 aa detected for protein ID = $current_id\n";
# $skip_seq = 'yes';
}
}
}
if ($skip_seq ne 'yes') {
if ($naa >= 21) {
printf OUT "$current_id\n"; # print the stored id
printf OUT "$seq_new\n";
} else { # no aas
printf BLAH "**** stopping - fasta_seq_reformat - shorter than 21 aas for protein ID = $current_id\n";
}
}
$skip_seq = 'no';
}
if (substr ($line,0,1) eq '>') {
$current_id = $line; # store the new seq ID - printer later, if seq is cleared
$seq_stored = 'yes'; # set up for new sequence
$seq_here = '';
}
} else { # add on to curr seq, if not blank
$len_here = length($line);
if ($len_here != 0) {
$seq_here .= $line;
}
}
}
if ($seq_stored eq 'yes') { # print the last seq if it conforms
$seqUC = uc($seq_here); # uppercase the sequence
$seqlen = length($seqUC);
$seq_new = "";
$naa = 0;
$n = 1; # ?? VERY ODD does the loop not fix starting n at 1 ??
for ($n==1; $n<=$seqlen; $n++) {
$char_here = substr($seqUC,$n-1,1);
if (($char_here ne ' ') and ($char_here ne '*')) { # ignore blanks and asterisks
if ($char_here =~ /[ACDEFGHIKLMNPQRSTVWY]/) {
$seq_new .= $char_here;
$naa++;
}
}
}
if ($skip_seq ne 'yes') {
if ($naa >= 21) {
printf OUT "$current_id\n"; # print the stored id
printf OUT "$seq_new\n";
} else { # no aas
printf BLAH "**** stopping - fasta_seq_reformat - shorter than 21 aas for protein ID = $current_id\n";
}
}
$skip_seq = 'no';
}
close (OUT);
close (BLAH);
exit;
#!/bin/bash
FASTA_in=$1
SCRIPTDIR="/opt/workspace/D3/src/static/protein-sol"
cp $FASTA_in $SCRIPTDIR/reformat.in
perl $SCRIPTDIR/fasta_seq_reformat_export.pl > $SCRIPTDIR/run.log
mv $SCRIPTDIR/reformat_out $FASTA_in
cp $FASTA_in $SCRIPTDIR/composition.in
perl $SCRIPTDIR/seq_compositions_perc_pipeline_export.pl >> $SCRIPTDIR/run.log
mv $SCRIPTDIR/composition_all.out $SCRIPTDIR/seq_composition.txt
perl $SCRIPTDIR/server_prediction_seq_export.pl >> $SCRIPTDIR/run.log
cp $FASTA_in $SCRIPTDIR/seq_props.in
perl $SCRIPTDIR/seq_props_ALL_export.pl >> $SCRIPTDIR/run.log
mv $SCRIPTDIR/seq_prediction.txt $SCRIPTDIR/seq_prediction_OLD.txt
perl $SCRIPTDIR/profiles_gather_export.pl > $SCRIPTDIR/run.log
#rm bins.txt reformat.in seq_props.in seq_props.out STYprops.out composition.in seq_prediction_OLD.txt
#!/usr/bin/perl -w
use FindBin;
$file_here = "$FindBin::Bin/blah.txt";
open (BLAH, ">>$file_here"); # file for comments, throughout server code
open (my $PROPS, "<", "$FindBin::Bin/STYprops.out") or die "cannot open STYprops.out profiles_gather\n";
@props = <$PROPS>;
close ($PROPS);
$nseq = 0;
foreach $line (@props) {
chomp $line;
@words = split (" ",$line);
$nwords = @words;
if (exists $words[0]) { # non-null line
if ($words[0] eq 'STARTID') { # new sequence - store the ID
$id_here = $words[1];
$nseq++;
$seqs{$id_here} = $nseq;
$nwin = 0;
} elsif ($words[0] eq 'NONP') { # a window line, assume associate with id_here
$nwin++;
$KD[$nseq][$nwin] = $words[5];
$FI[$nseq][$nwin] = $words[6];
$ent[$nseq][$nwin] = $words[7];
$meanQ[$nseq][$nwin] = $words[10];
} elsif ($words[0] eq 'ENDID') { # have the profiles
if ($words[1] ne $id_here) {
printf BLAH "**** STOPPING - IDs mismatch in profiles_gather.pl\n";
exit;
}
$numwin[$nseq] = $nwin; # store the number of windows (naa - 20) for this sequence
}
}
}
# read in the seq_prediction_OLD.txt file and write out the updated version with profiles added (interstitial)
$file_here = "$FindBin::Bin/seq_prediction_OLD.txt";
open (OLD, "<$file_here") or die "cannot open seq_prediction_OLD.txt";
@old = <OLD>;
close (OLD);
$file_here = "$FindBin::Bin/../../tmp/seq_prediction.txt";
open (NEW, ">$file_here") or die "cannot open seq_prediction.txt";
foreach $line (@old) {
chomp $line;
@words = split (",",$line);
$nwords = @words;
if (exists $words[0]) { # non-null line
if ($words[0] eq 'SEQUENCE DEVIATIONS') { # write matched seq profiles after this line
printf NEW "$line\n"; # write the SEQUENCE DEVIATIONS line
$id_here = $words[1];
if (exists $seqs{$id_here}) {
$nseq_here = $seqs{$id_here};
$nwin_here = $numwin[$nseq_here];
printf NEW "SEQUENCE PROFILE KyteDoolittle,$id_here";
for ($n=1; $n<=$nwin_here; $n++) { printf NEW ",$KD[$nseq_here][$n]"; }
printf NEW "\n";
printf NEW "SEQUENCE PROFILE Uversky??,$id_here";
for ($n=1; $n<=$nwin_here; $n++) { printf NEW ",$FI[$nseq_here][$n]"; }
printf NEW "\n";
printf NEW "SEQUENCE PROFILE entropy,$id_here";
for ($n=1; $n<=$nwin_here; $n++) { printf NEW ",$ent[$nseq_here][$n]"; }
printf NEW "\n";
printf NEW "SEQUENCE PROFILE charge,$id_here";
for ($n=1; $n<=$nwin_here; $n++) { printf NEW ",$meanQ[$nseq_here][$n]"; }
printf NEW "\n";
} else {
# ?? - sequence unmatched - what to do?
}
} else {
printf NEW "$line\n"; # all other (than SEQUENCE DEVIATIONS) lines write and carry on
}
} else { # null line write
printf NEW "$line\n";
}
}
close (NEW);
close (BLAH);
exit;
This source diff could not be displayed because it is too large. You can view the blob instead.
#!/usr/bin/perl -w
use FindBin;
$file_here = "$FindBin::Bin/blah.txt";
open (BLAH, ">>$file_here"); # file used for comments throughout server code
$window[1] = 21;
$window_half[1] = 10;
$window[2] = 51;
$window_half[2] = 25;
$window_flag[1] = "WIN21";
$window_flag[2] = "WIN51";
@aa_single = ("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y");
if ((exists $ENV{MEMBRANE_EXCLUDE}) and ($ENV{MEMBRANE_EXCLUDE} eq "YES")) {
$mem_exclude = "yes";
$KD_threshold = 1.6; # value of 19 win avg (-4.5 to 4.5) KD suggested by KD - we'll use for 21 win
printf BLAH "membrane proteins will be excluded, based on KD of 21 aa window > ?? value absent \n";
} else {
$mem_exclude = "no";
$KD_threshold = 1.6; # value of 19 win avg (-4.5 to 4.5) KD suggested by KD - we'll use for 21 win, still use even if not excl
}
# e.g. for excel correlation calc, if we have the abundance/solub data in the ID line
if ((exists $ENV{WHOLE_SEQ_OUT}) and ($ENV{WHOLE_SEQ_OUT} eq "yes")) {
$whole_seq_out = "yes";
printf BLAH "WHOLE_SEQ_OUT set: for WHOLE-SEQ only, output file ID line with whole seq data, in whole_seq_data.txt\n";
} else {
$whole_seq_out = "no";
}
$file_here = "$FindBin::Bin/composition.in";
open (COMP_IN, "<$file_here") or die "cannot open composition.in\n";
$nseq = 0;
$first_seq = 'yes';
while ($line = <COMP_IN>) {
chomp $line;
@words = split ("",$line);
@fields = split (" ",$line);
if (exists $words[0]) {
if ($words[0] eq ">") { # new protein ID
if ($first_seq eq 'yes') { # first ID, no seqbuf
$seqbuf = "";
$idbuf = $fields[0];
$textbuf = $line;
$first_seq = 'no';
} else {
if ($seqbuf ne "") { # only process further if non-null seq
$nseq++;
$seqtext[$nseq] = $textbuf;
$seqid[$nseq] = $idbuf;
$seqtmp{$seqid[$nseq]} = $seqbuf; # store away the sequence
}
$idbuf = $fields[0];
$textbuf = $line;
$seqbuf = "";
}
} else {
if ($first_seq eq 'yes') {
printf BLAH "did not see first seq ID\n";
exit;
}
$seqbuf .= $line;
}
}
}
if ($seqbuf ne "") { # catch the last seq, if non-null
$nseq++;
$seqtext[$nseq] = $textbuf;
$seqid[$nseq] = $idbuf;
$seqtmp{$seqid[$nseq]} = $seqbuf; # store away the sequence
}
$nseqs = $nseq;
close (COMP_IN);
$file_here = "$FindBin::Bin/composition_all.out";
open (COMP_OUT, ">$file_here") or die "cannot open composition_all.out\n";
if ($whole_seq_out eq "yes") {
$file_here = "$FindBin::Bin/whole_seq_data.txt";
open (DATA_OUT, ">$file_here") or die "cannot open whole_seq_data.txt\n";
}
printf COMP_OUT "\nALL WIN21-MAX, WIN21-MIN, WIN51-MAX, WIN51-MIN data lines for each sequence\n";
printf COMP_OUT "output header lines flagged by each data type for grep purposes\n\n";
@flags = ('WHOLE-SEQ', 'WIN21-MAX', 'WIN21-MIN', 'WIN51-MAX', 'WIN51-MIN');
foreach $win_flag (@flags) {
printf COMP_OUT "$win_flag, ORF-ID,K-R,D-E,naa,totperc,";
printf COMP_OUT "A,C,D,E,F,G,H,I,K,L,";
printf COMP_OUT "M,N,P,Q,R,S,T,V,W,Y,";
printf COMP_OUT "K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,";
printf COMP_OUT "KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity\n\n";
if (($whole_seq_out eq 'yes') and ($win_flag eq 'WHOLE-SEQ')) {
printf DATA_OUT "$win_flag, ORF-ID,K-R,D-E,naa,totperc,";
printf DATA_OUT "A,C,D,E,F,G,H,I,K,L,";
printf DATA_OUT "M,N,P,Q,R,S,T,V,W,Y,";
printf DATA_OUT "K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,";
printf DATA_OUT "KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity\n";
}
}
$Kall = 0;
$Rall = 0;
$Dall = 0;
$Eall = 0;
$Fall = 0;
$Wall = 0;
$Yall = 0;
$naa_all = 0;
$nmem_exclude = 0;
$nall_exclude = 0;
for ($nseq=1; $nseq<=$nseqs; $nseq++) {
$seqhere = $seqtmp{$seqid[$nseq]};
$lentmp = length $seqhere;
$KD_max_original = 999;
@words = split ("",$seqhere);
$naa = 0;
$Kcount = 0;
$Rcount = 0;
$Dcount = 0;
$Ecount = 0;
$Fcount = 0;
$Wcount = 0;
$Ycount = 0;
for ($aa=1; $aa<=20; $aa++) {
$aa_count[$aa-1] = 0;
}
@seq_for_pI = (); # hopefully this zeroes the array
$nseq_for_pI = 0;
for ($n=1; $n<=$lentmp; $n++) {
if ($words[$n-1] ne " ") {
if ($words[$n-1] eq "*") {
if ($naa == 0) {
printf BLAH "STOPPING - first seq char is a * - Plation\n";
exit;
}
$phosupd[$naa] = "P";
} else {
$seq_for_pI[$nseq_for_pI] = $words[$n-1];
$nseq_for_pI++;
$naa++;
$sequpd[$naa] = $words[$n-1];
$phosupd[$naa] = " ";
if ($sequpd[$naa] eq 'K') { $Kcount++; }
if ($sequpd[$naa] eq 'R') { $Rcount++; }
if ($sequpd[$naa] eq 'D') { $Dcount++; }
if ($sequpd[$naa] eq 'E') { $Ecount++; }
if ($sequpd[$naa] eq 'F') { $Fcount++; }
if ($sequpd[$naa] eq 'W') { $Wcount++; }
if ($sequpd[$naa] eq 'Y') { $Ycount++; }
for ($aa=1; $aa<=20; $aa++) {
if ($sequpd[$naa] eq $aa_single[$aa-1]) { $aa_count[$aa-1]++; }
}
}
}
}
$pI_all = protein_pI (@seq_for_pI);
$null_return = 'no';
$KD_all = protein_KD (@seq_for_pI);
if ($KD_all == 999) { $null_return = 'yes'; }
$Q_all = protein_Q (@seq_for_pI);
if ($Q_all == 999) { $null_return = 'yes'; }
$absQ_all = abs ($Q_all);
$foldind_all = 2.785*$KD_all - $absQ_all - 1.151;
$disorder_all = protein_disorder (@seq_for_pI);
if ($disorder_all == 999) { $null_return = 'yes'; }
$ent_all = protein_ent (@seq_for_pI);
if ($ent_all == 999) { $null_return = 'yes'; }
$betastrand_all = protein_betastrand (@seq_for_pI);
if ($betastrand_all == 999) { $null_return = 'yes'; }
$perc_total = 0; # work out and output the various quantities for whole sequence
for ($aa=1; $aa<=20; $aa++) {
$aa_perc[$aa-1] = (100*$aa_count[$aa-1]/$naa);
$perc_total += $aa_perc[$aa-1];
}
if ($naa != 0) {
$Kcomp = $Kcount/$naa;
$Rcomp = $Rcount/$naa;
$Dcomp = $Dcount/$naa;
$Ecomp = $Ecount/$naa;
$Fcomp = $Fcount/$naa;
$Wcomp = $Wcount/$naa;
$Ycomp = $Ycount/$naa;
$KmR_perc = 100*($Kcomp - $Rcomp);
$DmE_perc = 100*($Dcomp - $Ecomp);
} else {
$Kcomp = 0; # not really 0 of course
$Rcomp = 0; # also not really 0
$Dcomp = 0; # not really 0 of course
$Ecomp = 0; # also not really 0
$Fcomp = 0;
$Wcomp = 0;
$Ycomp = 0;
$KmR_perc = 0;
$DmE_perc = 0;
}
$KR_perc = 100*($Kcomp+$Rcomp);
$DE_perc = 100*($Dcomp+$Ecomp);
$FWY_perc = 100*($Fcomp+$Wcomp+$Ycomp);
$KR_DE_perc = $KR_perc - $DE_perc;
$KRplusDE_perc = $KR_perc + $DE_perc;
$Kall += $Kcount;
$Rall += $Rcount;
$Dall += $Dcount;
$Eall += $Ecount;
$Fall += $Fcount;
$Wall += $Wcount;
$Yall += $Ycount;
$naa_all += $naa;
if ($null_return ne 'no') {
printf COMP_OUT "\nSTOPPING - PROBLEM,$seqid[$nseq],null_return\n";
exit;
}
if ($naa != 0) { # now for the 21 and 51 aa windows, maxima and minima
for ($nwin=1; $nwin<=2; $nwin++) { # 1 = 21 and 2 = 51
$window2 = $window_half[$nwin];
$win_flag = $window_flag[$nwin];
$at_least_one = "no"; # overwritten if we see a full length window
$KR_perc_max = -100;
$KR_perc_min = 100;
$DE_perc_max = -100;
$DE_perc_min = 100;
$KR_DE_perc_max = -100;
$KR_DE_perc_min = 100;
$KRplusDE_perc_max = -100;
$KRplusDE_perc_min = 100;
$KmR_perc_max = -100;
$KmR_perc_min = 100;
$DmE_perc_max = -100;
$DmE_perc_min = 100;
$FWY_perc_max = -100;
$FWY_perc_min = 100;
for ($aa=1; $aa<=20; $aa++) {
$aa_perc_min[$aa-1] = 0;
$aa_perc_max[$aa-1] = 0;
}
$pI_max = 1;
$pI_min = 14;
$KD_max = 0;
$KD_min = 1;
$absQ_max = 0;
$absQ_min = 1;
$foldind_max = -2;
$foldind_min = 2;
$disorder_max = -1;
$disorder_min = 1;
$ent_max = 0;
$ent_min = 5;
$betastrand_max = -1;
$betastrand_min = 1;
for ($maa=1; $maa<=$naa; $maa++) {
$Kwin = 0;
$Rwin = 0;
$Dwin = 0;
$Ewin = 0;
$Fwin = 0;
$Wwin = 0;
$Ywin = 0;
for ($aa=1; $aa<=20; $aa++) {
$aa_count_win[$aa-1] = 0;
}
$win_start = $maa - $window2;
if ($win_start < 1) {$win_start = 1; }
$win_end = $maa + $window2;
if ($win_end > $naa) {$win_end = $naa; }
@seq_for_pI = (); # hopefully this zeroes the array
$nseq_for_pI = 0;
for ($win=$win_start; $win<=$win_end; $win++) {
if ($sequpd[$win] eq 'K') { $Kwin++; }
if ($sequpd[$win] eq 'R') { $Rwin++; }
if ($sequpd[$win] eq 'D') { $Dwin++; }
if ($sequpd[$win] eq 'E') { $Ewin++; }
if ($sequpd[$win] eq 'F') { $Fwin++; }
if ($sequpd[$win] eq 'W') { $Wwin++; }
if ($sequpd[$win] eq 'Y') { $Ywin++; }
for ($aa=1; $aa<=20; $aa++) {
if ($sequpd[$win] eq $aa_single[$aa-1]) { $aa_count_win[$aa-1]++; }
}
$seq_for_pI[$nseq_for_pI] = $sequpd[$win];
$nseq_for_pI++;
}
$num_win = $win_end - $win_start + 1;
if ($num_win == $window[$nwin]) { # process
$at_least_one = "yes"; # we have a full length window
$perc_total_win = 0;
for ($aa=1; $aa<=20; $aa++) {
$aa_perc_win[$aa-1] = (100*$aa_count_win[$aa-1]/$num_win);
$perc_total_win += $aa_perc_win[$aa-1];
}
$Kcomp_win = $Kwin/$num_win;
$Rcomp_win = $Rwin/$num_win;
$Dcomp_win = $Dwin/$num_win;
$Ecomp_win = $Ewin/$num_win;
$Fcomp_win = $Fwin/$num_win;
$Wcomp_win = $Wwin/$num_win;
$Ycomp_win = $Ywin/$num_win;
$KmR_perc_win = 100*($Kcomp_win - $Rcomp_win);
$DmE_perc_win = 100*($Dcomp_win - $Ecomp_win);
$KR_perc_win = 100*($Kcomp_win+$Rcomp_win);
$DE_perc_win = 100*($Dcomp_win+$Ecomp_win);
$KR_DE_perc_win = $KR_perc_win - $DE_perc_win;
$KRplusDE_perc_win = $KR_perc_win + $DE_perc_win;
$FWY_perc_win = 100*($Fcomp_win+$Wcomp_win+$Ycomp_win);
if ($KR_perc_win > $KR_perc_max) { $KR_perc_max = $KR_perc_win; }
if ($KR_perc_win < $KR_perc_min) { $KR_perc_min = $KR_perc_win; }
if ($DE_perc_win > $DE_perc_max) { $DE_perc_max = $DE_perc_win; }
if ($DE_perc_win < $DE_perc_min) { $DE_perc_min = $DE_perc_win; }
if ($KR_DE_perc_win > $KR_DE_perc_max) { $KR_DE_perc_max = $KR_DE_perc_win; }
if ($KR_DE_perc_win < $KR_DE_perc_min) { $KR_DE_perc_min = $KR_DE_perc_win; }
if ($KRplusDE_perc_win > $KRplusDE_perc_max) { $KRplusDE_perc_max = $KRplusDE_perc_win; }
if ($KRplusDE_perc_win < $KRplusDE_perc_min) { $KRplusDE_perc_min = $KRplusDE_perc_win; }
if ($KmR_perc_win > $KmR_perc_max) { $KmR_perc_max = $KmR_perc_win; }
if ($KmR_perc_win < $KmR_perc_min) { $KmR_perc_min = $KmR_perc_win; }
if ($DmE_perc_win > $DmE_perc_max) { $DmE_perc_max = $DmE_perc_win; }
if ($DmE_perc_win < $DmE_perc_min) { $DmE_perc_min = $DmE_perc_win; }
if ($FWY_perc_win > $FWY_perc_max) { $FWY_perc_max = $FWY_perc_win; }
if ($FWY_perc_win < $FWY_perc_min) { $FWY_perc_min = $FWY_perc_win; }
for ($aa=1; $aa<=20; $aa++) {
if ($aa_perc_win[$aa-1] > $aa_perc_max[$aa-1]) { $aa_perc_max[$aa-1] = $aa_perc_win[$aa-1]; }
if ($aa_perc_win[$aa-1] < $aa_perc_min[$aa-1]) { $aa_perc_min[$aa-1] = $aa_perc_win[$aa-1]; }
}
$pI_win = protein_pI (@seq_for_pI);
if ($pI_win > $pI_max) { $pI_max = $pI_win; }
if ($pI_win < $pI_min) { $pI_min = $pI_win; }
$null_return = 'no';
$KD_win = protein_KD (@seq_for_pI);
if ($KD_win == 999) { $null_return = 'yes'; }
$Q_win = protein_Q (@seq_for_pI);
if ($Q_win == 999) { $null_return = 'yes'; }
$absQ_win = abs ($Q_win);
$foldind_win = 2.785*$KD_win - $absQ_win - 1.151;
$disorder_win = protein_disorder (@seq_for_pI);
if ($disorder_win == 999) { $null_return = 'yes'; }
$ent_win = protein_ent (@seq_for_pI);
if ($ent_win == 999) { $null_return = 'yes'; }
$betastrand_win = protein_betastrand (@seq_for_pI);
if ($betastrand_win == 999) { $null_return = 'yes'; }
if ($KD_win > $KD_max) {
$KD_max = $KD_win;
$KD_max_original = (9*$KD_max) - 4.5 # rescale back to KD proper, use if we're exlcuding mem proteins
}
if ($KD_win < $KD_min) { $KD_min = $KD_win; }
if ($absQ_win > $absQ_max) { $absQ_max = $absQ_win; }
if ($absQ_win < $absQ_min) { $absQ_min = $absQ_win; }
if ($foldind_win > $foldind_max) { $foldind_max = $foldind_win; }
if ($foldind_win < $foldind_min) { $foldind_min = $foldind_win; }
if ($disorder_win > $disorder_max) { $disorder_max = $disorder_win; }
if ($disorder_win < $disorder_min) { $disorder_min = $disorder_win; }
if ($ent_win > $ent_max) { $ent_max = $ent_win; }
if ($ent_win < $ent_min) { $ent_min = $ent_win; }
if ($betastrand_win > $betastrand_max) { $betastrand_max = $betastrand_win; }
if ($betastrand_win < $betastrand_min) { $betastrand_min = $betastrand_win; }
}
} # end of this (21 51) window sliding along for this sequence
if ($nwin == 1) { # 21 aa window
# if ($mem_exclude eq "yes") { # put lower down to give flag pred TM, even if not skipping it
if ($win_flag ne "WIN21") {
printf BLAH "STOPPING - win_flag should be 21 for KD-membrane protein check\n";
exit;
}
if ($null_return ne 'no') { goto SKIPPED; } # skip protein since we cannot KD test a null return
if ($at_least_one ne "yes") { goto SKIPPED; } # skip protein since no window to KD test
if ($KD_max_original == 999) {
printf BLAH "STOPPING - KD_max_original unset\n";
} else {
if ($KD_max_original > $KD_threshold) { # any KD looks like mem window, report, count and outta here
if ($mem_exclude eq "yes") {
$nmem_exclude++;
printf BLAH "MEM EXCLUDE - $seqid[$nseq] - with KD, threshold = $KD_max_original $KD_threshold\n";
goto SKIPPED;
} else {
printf BLAH "\nTM REGION PREDICTED for $seqid[$nseq] - with non-normalised KD of ";
printf BLAH "%6.2f", $KD_max_original;
printf BLAH " compared with threshold of $KD_threshold\n";
}
}
}
printf COMP_OUT "\n";
printf COMP_OUT "WHOLE-SEQ,$seqid[$nseq],";
printf COMP_OUT "%7.2f,%7.2f,%5.0f,%7.2f", $KmR_perc, $DmE_perc, $naa, $perc_total;
for ($aa=1; $aa<=20; $aa++) { printf COMP_OUT ",%6.2f", $aa_perc[$aa-1]; }
printf COMP_OUT ",%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%6.2f", $KR_perc, $DE_perc, $KR_DE_perc, $KRplusDE_perc, $FWY_perc, $pI_all;
printf COMP_OUT ",%8.3f,%8.3f,%8.3f,%8.3f, %8.3f, %8.3f", $KD_all, $absQ_all, $foldind_all, $disorder_all, $ent_all, $betastrand_all;
printf COMP_OUT "\n";
if ($whole_seq_out eq 'yes') {
# printf DATA_OUT "\n";
printf DATA_OUT "WHOLE-SEQ,$seqtext[$nseq],"; # seqtext (rather than seqid) should be the whole ID line
printf DATA_OUT "%7.2f,%7.2f,%5.0f,%7.2f", $KmR_perc, $DmE_perc, $naa, $perc_total;
for ($aa=1; $aa<=20; $aa++) { printf DATA_OUT ",%6.2f", $aa_perc[$aa-1]; }
printf DATA_OUT ",%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%6.2f", $KR_perc, $DE_perc, $KR_DE_perc, $KRplusDE_perc, $FWY_perc, $pI_all;
printf DATA_OUT ",%8.3f,%8.3f,%8.3f,%8.3f, %8.3f, %8.3f", $KD_all, $absQ_all, $foldind_all, $disorder_all, $ent_all, $betastrand_all;
printf DATA_OUT "\n";
}
} # end nwin=1 check
if ($null_return eq 'no') { # at present have no reporting of null returns for windows
if ($at_least_one eq "yes") { # we have at least one full length window
# OTHERWISE we simply MISS OUT this write
# so NOT ALL SEQS will have writes for both windows
$flag_here = $win_flag."-MAX";
printf COMP_OUT "$flag_here,$seqid[$nseq],";
printf COMP_OUT "%7.2f,%7.2f,%5.0f,%7.2f", $KmR_perc_max, $DmE_perc_max, $window[$nwin], $perc_total_win;
for ($aa=1; $aa<=20; $aa++) { printf COMP_OUT ",%6.2f", $aa_perc_max[$aa-1]; }
printf COMP_OUT ",%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%6.2f", $KR_perc_max, $DE_perc_max, $KR_DE_perc_max, $KRplusDE_perc_max, $FWY_perc_max, $pI_max;
printf COMP_OUT ",%8.3f,%8.3f,%8.3f,%8.3f, %8.3f, %8.3f", $KD_max, $absQ_max, $foldind_max, $disorder_max, $ent_max, $betastrand_max;
printf COMP_OUT "\n";
$flag_here = $win_flag."-MIN";
printf COMP_OUT "$flag_here,$seqid[$nseq],";
printf COMP_OUT "%7.2f,%7.2f,%5.0f,%7.2f", $KmR_perc_min, $DmE_perc_min, $window[$nwin], $perc_total_win;
for ($aa=1; $aa<=20; $aa++) { printf COMP_OUT ",%6.2f", $aa_perc_min[$aa-1]; }
printf COMP_OUT ",%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%6.2f", $KR_perc_min, $DE_perc_min, $KR_DE_perc_min, $KRplusDE_perc_min, $FWY_perc_min, $pI_min;
printf COMP_OUT ",%8.3f,%8.3f,%8.3f,%8.3f, %8.3f, %8.3f", $KD_min, $absQ_min, $foldind_min, $disorder_min, $ent_min, $betastrand_min;
printf COMP_OUT "\n";
}
}
} # end nwin = 1, 2 loop
SKIPPED:
$nall_exclude++;
}
}
$KmR_all_perc = 100*(($Kall - $Rall)/$naa_all); # a few things over the whole sequence set
$DmE_all_perc = 100*(($Dall - $Eall)/$naa_all);
$Kall_perc = 100*($Kall/$naa_all);
$Rall_perc = 100*($Rall/$naa_all);
$Dall_perc = 100*($Dall/$naa_all);
$Eall_perc = 100*($Eall/$naa_all);
# $FWYall_perc = 100*(($Fall + $Wall + $Yall)/$naa_all);
printf COMP_OUT "\n";
printf COMP_OUT "SUMS, for this set of seqs, overall K minus R and D minus E percentages = ";
printf COMP_OUT "%6.3f %6.3f\n\n", $KmR_all_perc, $DmE_all_perc;
printf COMP_OUT "SUMS, for this set of seqs, overall K R D E percentages = ";
printf COMP_OUT "%6.4f %6.4f %6.4f %6.4f\n", $Kall_perc, $Rall_perc, $Dall_perc, $Eall_perc;
if ($mem_exclude eq "yes") {
printf COMP_OUT "\n\nMEMBRANE_EXCLUDE YES set with nmem_excluded = $nmem_exclude and nall_excluded = $nall_exclude\n";
printf BLAH "\n\nMEMBRANE_EXCLUDE yes set with nmem_excluded = $nmem_exclude and nall_excluded = $nall_exclude\n";
if ($whole_seq_out eq 'yes') {
printf DATA_OUT "\n\nMEMBRANE_EXCLUDE YES set with nmem_excluded = $nmem_exclude and nall_excluded = $nall_exclude\n";
}
}
close (COMP_OUT);
if ($whole_seq_out eq 'yes') {
close (DATA_OUT);
}
close (BLAH);
exit;
#-------------------------------------------------------------------------
#----------------------
# SUBROUTINE protein_pI
#----------------------
# pass a list of single letter amino acid codes, return the pI (pH at
# neutral charge) for this protein
sub protein_pI {
my @aa_seq;
my $aa;
my %aapK;
my %aafactor;
my %aanum;
my @aa_keys;
my $pI;
my $protein_charge;
my $protein_charge_last;
my $fac;
my $pH;
my $pH_dec;
my $loop;
my $pH_pK;
my $charge_inc;
# put the passed sequence into the aa_seq array
@aa_seq = @_;
# data initialisation --------------------------------------------------
# set up the ionisable group pKs, note that we are leaving out Cys
# (assume that cysteines are oxidised in disulphides)
# phosphates (X) are -1 always and -1 to -2 with pK 7
%aapK = (
"D" => 4.0,
"E" => 4.4,
"H" => 6.3,
"K" => 10.4,
"R" => 12.0,
"X" => 7.0,
"Y" => 10.1);
# set up the -1 or +1 factor that gives us the charge sign (acid=-,
# base=+), and is used to get the right sign of (pH-pK)
%aafactor = (
"D" => -1.0,
"E" => -1.0,
"H" => 1.0,
"K" => 1.0,
"R" => 1.0,
"X" => -1.0,
"Y" => -1.0);
# initialise counter of amino acid types (for those with pKs of interest)
%aanum = (
"D" => 0,
"E" => 0,
"H" => 0,
"K" => 0,
"R" => 0,
"X" => 0,
"Y" => 0);
# -----------------------------------------------------------------------
# add the number of each ionisable aa type of interest
my $ionisable_aas = 0;
foreach $aa (@aa_seq) {
if (exists $aapK{$aa}) {
$aanum{$aa}++;
$ionisable_aas++;
}
}
@aa_keys = keys %aanum;
# for any ionisable group, using the aafactor of -1 or +1 as set above,
# the charge at pH is: (aafactor) x { 1/(1+10**[(aafactor)(pH-pK)]) }
# and the overall protein charge is the sum of such over ionisable groups
# our strategy is to come down from a high pH, look for charge sign
# swap (- to +), and then linearly interpolate, (as in the 2CPS notes)
$loop = 0;
$pH = 14;
$pH_dec = 0.5;
$protein_charge = -100; # dummy value, calculate as we go
while (($protein_charge < 0) and ($pH > 0.9)) {
$loop++;
$protein_charge_last = $protein_charge;
$protein_charge = 0.0;
foreach $aa (@aa_keys) {
if ($aanum{$aa} > 0) {
$fac = $aafactor{$aa};
$pH_pK = $pH - $aapK{$aa};
if ($aa eq "X") {
$charge_inc = -1.0 + $fac * (1/(1+10**($fac*$pH_pK)));
} else {
$charge_inc = $fac * (1/(1+10**($fac*$pH_pK)));
}
$protein_charge += $aanum{$aa}*$charge_inc;
}
}
$pH -= $pH_dec; # i.e. $pH - $pH-$pH_dec
}
# There are various conditions to look out for and signal on return:
# with no ionisable groups, neutral at all pH: return 999 - NO, just return pH 7.0
# if pI > 14 (e.g. only basic aas): return 14
# if pI < 1 (e.g. only acidic aas): return 1
if ($ionisable_aas == 0) {
# $pI = 999;
$pI = 7;
} elsif ($loop == 1) {
$pI= 14;
} elsif ($loop == 27) {
$pI = 1;
} else {
$pI = ($pH+$pH_dec) + $pH_dec*($protein_charge/($protein_charge-$protein_charge_last));
}
$pH = 8.75; # add a specific charge calculation for selected pH
$charge_here = 0;
foreach $aa (@aa_keys) {
if ($aanum{$aa} > 0) {
$fac = $aafactor{$aa};
$pH_pK = $pH - $aapK{$aa};
if ($aa eq "X") {
$charge_inc = -1.0 + $fac * (1/(1+10**($fac*$pH_pK)));
} else {
$charge_inc = $fac * (1/(1+10**($fac*$pH_pK)));
}
$charge_here += $aanum{$aa}*$charge_inc;
}
}
# printf OUT "\n$pdbwords[0] $pH ";
# printf OUT "%7.3f\n", $charge_here;
return $pI;
}
#-------------------------------------------------------------------------
#-------------------------------------------------------------------------
#----------------------
# SUBROUTINE protein_KD
#----------------------
# pass a list of single letter amino acid codes, return the pI (pH at
# neutral charge) for this protein
sub protein_KD {
# KD hash is Kyte-Doolittle hydropathy index : JMB (1982) 157:105
# for FOld Index, scale/norm following Uversky : Proteins (2000) 41:415, (-4.5,+4.5) to (0,1)
# use KDnorm to get KDnorm_mean = sum(KDnorm) / winlen
$KD {'A'} = 1.8;
$KD {'C'} = 2.5;
$KD {'D'} = -3.5;
$KD {'E'} = -3.5;
$KD {'F'} = 2.8;
$KD {'G'} = -0.4;
$KD {'H'} = -3.2;
$KD {'I'} = 4.5;
$KD {'K'} = -3.9;
$KD {'L'} = 3.8;
$KD {'M'} = 1.9;
$KD {'N'} = -3.5;
$KD {'P'} = -1.6;
$KD {'Q'} = -3.5;
$KD {'R'} = -4.5;
$KD {'S'} = -0.8;
$KD {'T'} = -0.7;
$KD {'V'} = 4.2;
$KD {'W'} = -0.9;
$KD {'Y'} = -1.3;
$KD {'X'} = 0.0;
$KD {'a'} = 1.8;
$KD {'c'} = 2.5;
$KD {'d'} = -3.5;
$KD {'e'} = -3.5;
$KD {'f'} = 2.8;
$KD {'g'} = -0.4;
$KD {'h'} = -3.2;
$KD {'i'} = 4.5;
$KD {'k'} = -3.9;
$KD {'l'} = 3.8;
$KD {'m'} = 1.9;
$KD {'n'} = -3.5;
$KD {'p'} = -1.6;
$KD {'q'} = -3.5;
$KD {'r'} = -4.5;
$KD {'s'} = -0.8;
$KD {'t'} = -0.7;
$KD {'v'} = 4.2;
$KD {'w'} = -0.9;
$KD {'y'} = -1.3;
$KD {'x'} = 0.0;
# Normalise the KD hydropathy values.
@KD_keys = keys %KD;
foreach $keyhere (@KD_keys) {
$KDtmp = $KD{$keyhere} + 4.5;
$KDnorm{$keyhere} = $KDtmp / 9.0;
# printf BLAH "KDnorm for $keyhere is $KDnorm{$keyhere}\n";
}
# put the passed sequence into the aa_seq array
@aa_seq = @_;
$KDnorm_tot = 0;
$n_KD_residues = 0;
foreach $aa (@aa_seq) {
if (exists $KD{$aa}) {
$KDnorm_tot = $KDnorm_tot + $KDnorm{$aa};
$n_KD_residues++;
}
}
if ($n_KD_residues != 0) {
$KDnorm_average = $KDnorm_tot / $n_KD_residues;
} else {
$KDnorm_average = 999; # no residues, value unset, pick up in parent code
}
return $KDnorm_average;
}
#-------------------------------------------------------------------------
#----------------------
# SUBROUTINE protein_Q
#----------------------
# pass a list of single letter amino acid codes, return the net charge per residue
# ?? his_charged ENV variable can be used to set his to +1 rather than zero, not used currently ??
sub protein_Q {
$chr {'A'} = 0.0;
$chr {'C'} = 0.0;
$chr {'D'} = -1.0;
$chr {'E'} = -1.0;
$chr {'F'} = 0.0;
$chr {'G'} = 0.0;
$chr {'H'} = 0.0;
$chr {'I'} = 0.0;
$chr {'K'} = 1.0;
$chr {'L'} = 0.0;
$chr {'M'} = 0.0;
$chr {'N'} = 0.0;
$chr {'P'} = 0.0;
$chr {'Q'} = 0.0;
$chr {'R'} = 1.0;
$chr {'S'} = 0.0;
$chr {'T'} = 0.0;
$chr {'V'} = 0.0;
$chr {'W'} = 0.0;
$chr {'Y'} = 0.0;
$chr {'X'} = 0.0;
$chr {'a'} = 0.0;
$chr {'c'} = 0.0;
$chr {'d'} = -1.0;
$chr {'e'} = -1.0;
$chr {'f'} = 0.0;
$chr {'g'} = 0.0;
$chr {'h'} = 0.0;
$chr {'i'} = 0.0;
$chr {'k'} = 1.0;
$chr {'l'} = 0.0;
$chr {'m'} = 0.0;
$chr {'n'} = 0.0;
$chr {'p'} = 0.0;
$chr {'q'} = 0.0;
$chr {'r'} = 1.0;
$chr {'s'} = 0.0;
$chr {'t'} = 0.0;
$chr {'v'} = 0.0;
$chr {'w'} = 0.0;
$chr {'y'} = 0.0;
$chr {'x'} = 0.0;
if ((exists $ENV{his_charged}) and ($ENV{his_charged} eq 'yes')) {
$chr {'H'} = 1.0;
$chr {'h'} = 1.0;
}
# put the passed sequence into the aa_seq array
@aa_seq = @_;
$Q_tot = 0;
$n_Q_residues = 0;
foreach $aa (@aa_seq) {
if (exists $chr{$aa}) {
$Q_tot = $Q_tot + $chr{$aa};
$n_Q_residues++;
}
}
if ($n_Q_residues == 0) {
$Q_mean = 999;
} else {
$Q_mean = $Q_tot / $n_Q_residues;
}
return $Q_mean;
}
#-------------------------------------------------------------------------
#----------------------------
# SUBROUTINE protein_disorder
#----------------------------
# pass a list of single letter amino acid codes, return the disorder calculation
sub protein_disorder {
# dis hash is Rune and Linding disorder propensity : GlobPlot : NAR (2003) 31:3701
$dis {'A'} = -0.26154;
$dis {'C'} = -0.01515;
$dis {'D'} = 0.22763;
$dis {'E'} = -0.20469;
$dis {'F'} = -0.22557;
$dis {'G'} = 0.43323;
$dis {'H'} = -0.00122;
$dis {'I'} = -0.42224;
$dis {'K'} = -0.10009;
$dis {'L'} = -0.33793;
$dis {'M'} = -0.22590;
$dis {'N'} = 0.22989;
$dis {'P'} = 0.55232;
$dis {'Q'} = -0.18768;
$dis {'R'} = -0.17659;
$dis {'S'} = 0.14288;
$dis {'T'} = 0.00888;
$dis {'V'} = -0.38618;
$dis {'W'} = -0.24338;
$dis {'Y'} = -0.20751;
$dis {'X'} = 0.0;
$dis {'a'} = -0.26154;
$dis {'c'} = -0.01515;
$dis {'d'} = 0.22763;
$dis {'e'} = -0.20469;
$dis {'f'} = -0.22557;
$dis {'g'} = 0.43323;
$dis {'h'} = -0.00122;
$dis {'i'} = -0.42224;
$dis {'k'} = -0.10009;
$dis {'l'} = -0.33793;
$dis {'m'} = -0.22590;
$dis {'n'} = 0.22989;
$dis {'p'} = 0.55232;
$dis {'q'} = -0.18768;
$dis {'r'} = -0.17659;
$dis {'s'} = 0.14288;
$dis {'t'} = 0.00888;
$dis {'v'} = -0.38618;
$dis {'w'} = -0.24338;
$dis {'y'} = -0.20751;
$dis {'x'} = 0.0;
# put the passed sequence into the aa_seq array
@aa_seq = @_;
$dis_tot = 0;
$ndis_tot = 0;
foreach $aa (@aa_seq) {
if (exists $dis{$aa}) {
$dis_tot += $dis{$aa};
$ndis_tot++;
}
}
if ($ndis_tot == 0) {
$dis_mean = 999;
} else {
$dis_mean = $dis_tot / $ndis_tot;
}
return $dis_mean;
}
#-------------------------------------------------------------------------
#-----------------------
# SUBROUTINE protein_ent
#-----------------------
# pass a list of single letter amino acid codes, return the pI (pH at
# neutral charge) for this protein
sub protein_ent {
@aa_single_here = ("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y");
# put the passed sequence into the aa_seq array
@aa_seq = @_;
for $aa (@aa_single_here) {
$aa_count_here{$aa} = 0;
}
$n_ent = 0;
foreach $aa (@aa_seq) {
$AA_UC = uc ($aa); # should be upper case already, but only want to count 20 aas, not 40
if (exists $aa_count_here{$AA_UC}) {
$aa_count_here{$AA_UC}++;
$n_ent++;
}
}
if ($n_ent == 0) {
$ent_sum = 999;
} else {
$ent_sum = 0;
$frac_sum = 0;
for $aa (@aa_single_here) {
if ($aa_count_here{$aa} != 0) {
$frac_here = $aa_count_here{$aa} / $n_ent;
$incr_here = -$frac_here * (log($frac_here) / log(2.0));
$ent_sum += $incr_here;
$frac_sum += $frac_here;
}
}
}
return $ent_sum;
}
#-------------------------------------------------------------------------
#------------------------------
# SUBROUTINE protein_betastrand
#------------------------------
# just returning the mean beta propensity - ?? could do more with 3 states
sub protein_betastrand {
# a very simple Chou-Fasman type ss prediction - see Costantini A et al (2006) BBRC 342:441-451
# read in the propensities
$sp92 = ' ';
for ($s=1; $s<=91; $s++) { $sp92 .= ' '; }
$sp81 = ' ';
for ($s=1; $s<=80; $s++) { $sp81 .= ' '; }
# no need for time_stamp - this is data, for read only
$file_here = "$FindBin::Bin/ss_propensities.txt";
open (PROPENSITIES, "<$file_here") or die "cannot open $file_here\n";
$naa_here = 0;
while ($line = <PROPENSITIES>) {
chomp $line;
my @words = split (" ",$line);
if (exists $words[0]) {
if ($words[0] eq 'data') { # amino acid data line
$naa_here++;
$aa = $words[2];
$p_alpha{$aa} = $words[4];
$p_beta{$aa} = $words[5];
$p_coil{$aa} = $words[6];
}
}
}
# printf BLAH "propensities read in for $naa_here amino acids\n";
close (PROPENSITIES);
# put the passed sequence into the aa_seq array
@aa_seq = @_;
$sum_alpha = 0;
$sum_beta = 0;
$sum_coil = 0;
$nseen = 0;
foreach $aa (@aa_seq) {
if (exists $p_alpha{$aa}) {
$sum_alpha += $p_alpha{$aa};
$sum_beta += $p_beta{$aa};
$sum_coil += $p_coil{$aa};
$nseen++;
}
}
if ($nseen == 0) {
$beta_mean = 999;
} else {
$beta_mean = $sum_beta / $nseen;
}
return $beta_mean;
}
#!/usr/bin/perl -w
use FindBin;
$file_here = "$FindBin::Bin/blah.txt";
open (BLAH, ">>$file_here"); # file for comments, throughout server code
$winlenP = 21; # window for calculation of all props except searching for charge border
$winlenP2 = 10; # = (winlenP-1)/2
if ((2*$winlenP2+1) != $winlenP) {
printf BLAH "window or half window wrong\n";
exit;
}
# winlenB/2 used to search the charge border props around STY sites
$winlenB = 5; #
$winlenB2 = 2; #
if ((2*$winlenB2+1) != $winlenB) {
printf BLAH "window or half window wrong\n";
exit;
}
$nbins_fi = 30;
$nbins_dis = $nbins_fi;
# env variable option to neutralise around STY in the charge border analysis ie test importance of Plation motif
if (exists $ENV{motifneutral}) {
$bord_neutralise = $ENV{motifneutral};
} else {
$bord_neutralise = "no";
}
# $len_neutral = 5; # not used currently
$len_neutral2 = 2;
# env variable to plot vertical lists of properties for each seq, for subsequent data handling/plotting
if (exists $ENV{extra_output}) {
$extra = $ENV{extra_output};
} else {
$extra = "no";
}
if ((exists $ENV{his_charged}) and ($ENV{his_charged} eq 'yes')) {
printf BLAH "\nHIS will be set at charge +1\n\n";
} else {
printf BLAH "\nHIS will be set at charge 0\n\n";
}
printf BLAH "column o/p file for each sequence = $extra - set with extra_output env variable\n\n";
# List the amino acids, for seq entropy (will convert to uc here at least)
@aa_single = ("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y");
$dis {'A'} = -0.26154;
$dis {'C'} = -0.01515;
$dis {'D'} = 0.22763;
$dis {'E'} = -0.20469;
$dis {'F'} = -0.22557;
$dis {'G'} = 0.43323;
$dis {'H'} = -0.00122;
$dis {'I'} = -0.42224;
$dis {'K'} = -0.10009;
$dis {'L'} = -0.33793;
$dis {'M'} = -0.22590;
$dis {'N'} = 0.22989;
$dis {'P'} = 0.55232;
$dis {'Q'} = -0.18768;
$dis {'R'} = -0.17659;
$dis {'S'} = 0.14288;
$dis {'T'} = 0.00888;
$dis {'V'} = -0.38618;
$dis {'W'} = -0.24338;
$dis {'Y'} = -0.20751;
$dis {'X'} = 0.0;
$chr {'A'} = 0.0;
$chr {'C'} = 0.0;
$chr {'D'} = -1.0;
$chr {'E'} = -1.0;
$chr {'F'} = 0.0;
$chr {'G'} = 0.0;
$chr {'H'} = 0.0;
$chr {'I'} = 0.0;
$chr {'K'} = 1.0;
$chr {'L'} = 0.0;
$chr {'M'} = 0.0;
$chr {'N'} = 0.0;
$chr {'P'} = 0.0;
$chr {'Q'} = 0.0;
$chr {'R'} = 1.0;
$chr {'S'} = 0.0;
$chr {'T'} = 0.0;
$chr {'V'} = 0.0;
$chr {'W'} = 0.0;
$chr {'Y'} = 0.0;
$chr {'X'} = 0.0;
$dis {'a'} = -0.26154;
$dis {'c'} = -0.01515;
$dis {'d'} = 0.22763;
$dis {'e'} = -0.20469;
$dis {'f'} = -0.22557;
$dis {'g'} = 0.43323;
$dis {'h'} = -0.00122;
$dis {'i'} = -0.42224;
$dis {'k'} = -0.10009;
$dis {'l'} = -0.33793;
$dis {'m'} = -0.22590;
$dis {'n'} = 0.22989;
$dis {'p'} = 0.55232;
$dis {'q'} = -0.18768;
$dis {'r'} = -0.17659;
$dis {'s'} = 0.14288;
$dis {'t'} = 0.00888;
$dis {'v'} = -0.38618;
$dis {'w'} = -0.24338;
$dis {'y'} = -0.20751;
$dis {'x'} = 0.0;
$chr {'a'} = 0.0;
$chr {'c'} = 0.0;
$chr {'d'} = -1.0;
$chr {'e'} = -1.0;
$chr {'f'} = 0.0;
$chr {'g'} = 0.0;
$chr {'h'} = 0.0;
$chr {'i'} = 0.0;
$chr {'k'} = 1.0;
$chr {'l'} = 0.0;
$chr {'m'} = 0.0;
$chr {'n'} = 0.0;
$chr {'p'} = 0.0;
$chr {'q'} = 0.0;
$chr {'r'} = 1.0;
$chr {'s'} = 0.0;
$chr {'t'} = 0.0;
$chr {'v'} = 0.0;
$chr {'w'} = 0.0;
$chr {'y'} = 0.0;
$chr {'x'} = 0.0;
if ((exists $ENV{his_charged}) and ($ENV{his_charged} eq 'yes')) {
$chr {'H'} = 1.0;
$chr {'h'} = 1.0;
}
$KD {'A'} = 1.8;
$KD {'C'} = 2.5;
$KD {'D'} = -3.5;
$KD {'E'} = -3.5;
$KD {'F'} = 2.8;
$KD {'G'} = -0.4;
$KD {'H'} = -3.2;
$KD {'I'} = 4.5;
$KD {'K'} = -3.9;
$KD {'L'} = 3.8;
$KD {'M'} = 1.9;
$KD {'N'} = -3.5;
$KD {'P'} = -1.6;
$KD {'Q'} = -3.5;
$KD {'R'} = -4.5;
$KD {'S'} = -0.8;
$KD {'T'} = -0.7;
$KD {'V'} = 4.2;
$KD {'W'} = -0.9;
$KD {'Y'} = -1.3;
$KD {'X'} = 0.0;
$KD {'a'} = 1.8;
$KD {'c'} = 2.5;
$KD {'d'} = -3.5;
$KD {'e'} = -3.5;
$KD {'f'} = 2.8;
$KD {'g'} = -0.4;
$KD {'h'} = -3.2;
$KD {'i'} = 4.5;
$KD {'k'} = -3.9;
$KD {'l'} = 3.8;
$KD {'m'} = 1.9;
$KD {'n'} = -3.5;
$KD {'p'} = -1.6;
$KD {'q'} = -3.5;
$KD {'r'} = -4.5;
$KD {'s'} = -0.8;
$KD {'t'} = -0.7;
$KD {'v'} = 4.2;
$KD {'w'} = -0.9;
$KD {'y'} = -1.3;
$KD {'x'} = 0.0;
# Normalise the KD hydropathy values.
@KD_keys = keys %KD;
foreach $keyhere (@KD_keys) {
$KDtmp = $KD{$keyhere} + 4.5;
$KDnorm{$keyhere} = $KDtmp / 9.0;
printf BLAH "KDnorm for $keyhere is $KDnorm{$keyhere}\n";
}
# get the sequences
$file_here = "$FindBin::Bin/seq_props.in";
open (SEQS_IN, "$file_here") or die "cannot open seq_props.in\n";
my $nseq = 0;
while ($line = <SEQS_IN>) {
chomp $line;
my @words = split ("",$line);
if (exists $words[0]) {
if ($words[0] eq ">") { # new protein
$nseq++;
$seqid[$nseq] = $line;
$seqtmp{$seqid[$nseq]} = "";
}
else {
if ($nseq == 0) {
printf BLAH "did not see first seq id\n";
exit;
}
$seqtmp{$seqid[$nseq]} .= $line;
}
}
}
$nseqs = $nseq;
close (SEQS_IN);
for ($nseq=1; $nseq<=$nseqs; $nseq++) { # now put * Plations into flags
$seqhere = $seqtmp{$seqid[$nseq]};
$lentmp = length $seqhere;
@words = split ("",$seqhere);
$naa = 0;
for ($n=1; $n<=$lentmp; $n++) {
if ($words[$n-1] ne " ") {
if ($words[$n-1] eq "*") {
if ($naa == 0) {
printf BLAH "STOPPING - first seq char is a * - Plation\n";
exit;
}
$phosupd[$naa] = "P";
} else {
$naa++;
$sequpd[$naa] = $words[$n-1];
$phosupd[$naa] = " ";
}
}
}
$seqstr = "";
$phostr = "";
for ($n=1; $n<=$naa; $n++) {
$seqstr .= $sequpd[$n];
$phostr .= $phosupd[$n];
}
$seq{$seqid[$nseq]} = $seqstr;
$pho{$seqid[$nseq]} = $phostr;
}
# calculate the P (by motif) and Order (by windowed propensity) values.
# also calculate the overall hydropathy and net charge values, per aa, output with overall seq info
$naasTOT = 0;
$ndisorderTOT = 0;
open (PSEQS, ">", "$FindBin::Bin/STYprops.out") or die "cannot open STYprops.out\n";
#printf PSEQS "P/unP dis netQ fracQ meanQ KD FldInd ent chk1 bord\n";
for ($h=1; $h<=$nbins_dis; $h++) {
$n_phos[$h] = 0;
$n_unphos[$h] = 0;
$net_phos[$h] = 0;
$net_unphos[$h] = 0;
$frac_phos[$h] = 0;
$frac_unphos[$h] = 0;
$n_nf_phos[$h] = 0; # n_nf counters where we have complete windows around STY
$n_nf_unphos[$h] = 0;
$nbord_phos[$h] = 0;
$nbord_unphos[$h] = 0;
$meanQ_phos[$h] = 0;
$meanQ_unphos[$h] = 0;
$KDnorm_phos[$h] = 0;
$KDnorm_unphos[$h] = 0;
$ent_phos[$h] = 0;
$ent_unphos[$h] = 0;
}
for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
$nfi_phos[$hfi] = 0;
$nfi_unphos[$hfi] = 0;
$nfi_nf_phos[$h] = 0;
$nfi_nf_unphos[$h] = 0;
$nbordfi_phos[$hfi] = 0;
$nbordfi_unphos[$hfi] = 0;
$entfi_phos[$hfi] = 0;
$entfi_unphos[$hfi] = 0;
$nFQ_nf_phos[$h] = 0;
$nFQ_nf_unphos[$h] = 0;
$nbordFQ_phos[$hfi] = 0;
$nbordFQ_unphos[$hfi] = 0;
$nMQ_nf_phos[$h] = 0;
$nMQ_nf_unphos[$h] = 0;
$nbordMQ_phos[$hfi] = 0;
$nbordMQ_unphos[$hfi] = 0;
$nKD_nf_phos[$h] = 0;
$nKD_nf_unphos[$h] = 0;
$nbordKD_phos[$hfi] = 0;
$nbordKD_unphos[$hfi] = 0;
}
for ($nseq=1; $nseq<=$nseqs; $nseq++) { # OPEN SEQUENCES LOOP
printf PSEQS "STARTID $seqid[$nseq]\n";
printf PSEQS "P/unP dis netQ fracQ meanQ KD FldInd ent chk1 bord meanQ_PM\n";
$seqhere = $seq{$seqid[$nseq]};
$phohere = $pho{$seqid[$nseq]};
$seqlen[$nseq] = length $seqhere;
@words = split ("",$seqhere);
@wordsP = split ("",$phohere);
$propcharhere = "";
$signhere = "";
# $chrhere = "";
$chrhereA = "";
$chrhereB = "";
$ndisorder = 0;
$KDnorm_tot = 0;
$chr_tot = 0;
$dis_tot = 0;
$ndis_tot = 0;
for $key (@aa_single) {
$aa_count{$key} = 0;
}
# $motifcharhere = "";
# $skip = 0;
# $nmotifs = 0;
for ($naa=1; $naa<=$seqlen[$nseq]; $naa++) { # OPEN AAs in SEQs LOOP
$naasTOT++;
if (exists $dis{$words[$naa-1]}) {
$dis_tot += $dis{$words[$naa-1]};
$ndis_tot++;
}
#boxing - loop below added for debugging
if (exists $KDnorm{$words[$naa-1]}) {
$KDnorm_tot = $KDnorm_tot + $KDnorm{$words[$naa-1]};
$chr_tot = $chr_tot + $chr{$words[$naa-1]};
$aa_lc = $words[$naa-1];
$AA_UC = uc ($aa_lc);
if (exists $aa_count{$AA_UC}) {
$aa_count{$AA_UC}++;
} else {
# printf BLAH "STOPPING AA (UC) not known in entropy count\n";
}
} else {
printf BLAH " KDnorm PROBLEM naa words = $naa $words[$naa-1]\n";
}
$propsum = 0;
$nprop = 0;
$wlow = $naa - $winlenP2;
$wtop = $naa + $winlenP2;
$chrsum = 0;
for ($nw=$wlow; $nw<=$wtop; $nw++){ # second the disorder
if (($nw>=1) and ($nw<=$seqlen[$nseq])) {
if (exists $dis{$words[$nw-1]}) {
$propsum += $dis{$words[$nw-1]};
$nprop++;
}
if (exists $chr{$words[$nw-1]}) {
if ($bord_neutralise eq "yes") {
$ndist = $naa - $nw;
$ndistabs = abs($ndist);
if ($ndistabs > $winlenP2) {
printf BLAH "STOPPING ndist gt winlenP2\n";
exit;
}
if ($ndistabs > $len_neutral2) {
$chrsum += $chr{$words[$nw-1]};
}
} else {
$chrsum += $chr{$words[$nw-1]};
}
}
}
}
if ($nprop != 0) {
$prophere = $propsum/$nprop;
$windis[$naa] = $prophere;
if ($prophere > 0) {
$propcharhere .= "*";
$ndisorder++;
$ndisorderTOT++;
} else {
$propcharhere .= " ";
}
} else {
$propcharhere .= " ";
}
if ($chrsum < 0) {
$signinc = "-";
} elsif ($chrsum > 0) {
$signinc = "+";
} else {
$signinc = " ";
}
$signhere .= $signinc;
$abschr = abs($chrsum);
if ($abschr > 9) {
$chrhereA .= int($abschr/10);
$chrhereB .= $abschr - 10*int($abschr/10);
} else {
$chrhereA .= " ";
$chrhereB .= $abschr;
}
$all_winQ[$nseq][$naa] = $chrsum;
} # CLOSE AAs in SEQs LOOP
# $motifchar{$seqid[$nseq]} = $motifcharhere;
$propchar{$seqid[$nseq]} = $propcharhere;
$signchar{$seqid[$nseq]} = $signhere;
$chrcharA{$seqid[$nseq]} = $chrhereA;
$chrcharB{$seqid[$nseq]} = $chrhereB;
$KDnorm_seqs{$seqid[$nseq]} = $KDnorm_tot / $seqlen[$nseq];
$chrmean_seqs{$seqid[$nseq]} = abs ($chr_tot) / $seqlen[$nseq];
$foldind_seqs{$seqid[$nseq]} = 2.785 * $KDnorm_seqs{$seqid[$nseq]} - $chrmean_seqs{$seqid[$nseq]} - 1.151;
$dis_seqs{$seqid[$nseq]} = $dis_tot / $ndis_tot;
$chr_tot_seqs{$seqid[$nseq]} = $chr_tot / $seqlen[$nseq];
$ent_sum = 0;
$frac_sum = 0;
for $key (@aa_single) {
if ($aa_count{$key} != 0) {
$frac_here = $aa_count{$key} / $seqlen[$nseq];
$incr_here = -$frac_here * (log($frac_here) / log(2.0));
$ent_sum += $incr_here;
$frac_sum += $frac_here
}
}
$ent_seqs{$seqid[$nseq]} = $ent_sum;
# $frac_seqs{$seqid[$nseq]} = $frac_sum; # not used currently
@wordsPROP = split ("",$propcharhere);
@wordsSIGN = split ("",$signhere);
# @wordsCHAR = split ("",$chrhere);
# at the moment we're only processing S, T, Y phos and non-phos - of course non-phos is well-dodgy
for ($h=1; $h<=$nbins_dis; $h++) {
$n_phos_here[$h] = 0;
$n_unphos_here[$h] = 0;
$net_phos_here[$h] = 0;
$net_unphos_here[$h] = 0;
$frac_phos_here[$h] = 0;
$frac_unphos_here[$h] = 0;
$n_nf_phos_here[$h] = 0;
$n_nf_unphos_here[$h] = 0;
$nbord_phos_here[$h] = 0;
$nbord_unphos_here[$h] = 0;
$meanQ_phos_here[$h] = 0;
$meanQ_unphos_here[$h] = 0;
$KDnorm_phos_here[$h] = 0;
$KDnorm_unphos_here[$h] = 0;
$ent_phos_here[$h] = 0;
$ent_unphos_here[$h] = 0;
}
for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
$nfi_phos_here[$hfi] = 0;
$nfi_unphos_here[$hfi] = 0;
$nfi_nf_phos_here[$hfi] = 0;
$nfi_nf_unphos_here[$hfi] = 0;
$nbordfi_phos_here[$hfi] = 0;
$nbordfi_unphos_here[$hfi] = 0;
$entfi_phos_here[$hfi] = 0;
$entfi_unphos_here[$hfi] = 0;
$nFQ_nf_phos_here[$hfi] = 0;
$nFQ_nf_unphos_here[$hfi] = 0;
$nbordFQ_phos_here[$hfi] = 0;
$nbordFQ_unphos_here[$hfi] = 0;
$nMQ_nf_phos_here[$hfi] = 0;
$nMQ_nf_unphos_here[$hfi] = 0;
$nbordMQ_phos_here[$hfi] = 0;
$nbordMQ_unphos_here[$hfi] = 0;
$nKD_nf_phos_here[$hfi] = 0;
$nKD_nf_unphos_here[$hfi] = 0;
$nbordKD_phos_here[$hfi] = 0;
$nbordKD_unphos_here[$hfi] = 0;
}
for ($naa=1; $naa<=$seqlen[$nseq]; $naa++) { # OPEN AAs in SEQs LOOP
# CHARGE SIGN CHANGES in winlenB ooooooooooooooooooooooooooooooo
$nb1 = $naa - $winlenB2;
$nb2 = $naa + $winlenB2;
if (($nb1 >= 1) and ($nb2 <= $seqlen[$nseq])) {
$pos = 'no';
$neg = 'no';
for ($nb=$nb1; $nb<=$nb2; $nb++) {
if ($wordsSIGN[$nb-1] eq '+') {
$pos = 'yes';
} elsif ($wordsSIGN[$nb-1] eq '-') {
$neg = 'yes';
}
}
if (($pos eq 'yes') and ($neg eq 'yes')) {
$bord = 'yes';
} else {
$bord = 'no';
}
}
# oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
# NET CHARGE AND CHARGE FRACTION in P WINDOW ooooooooooooooooooooo
$nb1 = $naa - $winlenP2;
$nb2 = $naa + $winlenP2;
if (($nb1 >= 1) and ($nb2 <= $seqlen[$nseq])) {
$winP = 'yes';
$sumQ = 0;
$netQ = 0;
$KDnorm_tot = 0;
$chr_tot = 0;
for $key (@aa_single) {
$aa_count{$key} = 0;
}
$subseq = "";
for ($nb=$nb1; $nb<=$nb2; $nb++) {
$subseq .= $words[$nb-1];
if (exists $chr{$words[$nb-1]}) {
$netQ += $chr{$words[$nb-1]};
$pm = $chr{$words[$nb-1]};
if (abs($pm) > 0.9) {
$sumQ++;
}
}
# boxing
if (exists $KDnorm{$words[$nb-1]}) {
$KDnorm_tot = $KDnorm_tot + $KDnorm{$words[$nb-1]};
} else {
printf BLAH "KDnorm PROBLEM 2 nb-1 words[nb-1] = $nb-1 $words[$nb-1]\n ";
}
$aa_lc = $words[$nb-1];
$AA_UC = uc ($aa_lc);
if (exists $aa_count{$AA_UC}) {
$aa_count{$AA_UC}++;
} else {
# printf BLAH "STOPPING AA (UC) not known in entropy count\n";
}
}
$fracQ = $sumQ / $winlenP;
$meanQ_win = abs ($netQ) / $winlenP;
$meanQ_winNET = ($netQ) / $winlenP;
$KDnorm_win = $KDnorm_tot / $winlenP;
$ent_sum = 0;
$frac_sum = 0;
for $key (@aa_single) {
if ($aa_count{$key} != 0) {
$frac_here = $aa_count{$key} / $winlenP;
$incr_here = -$frac_here * (log($frac_here) / log(2.0));
$ent_sum += $incr_here;
$frac_sum += $frac_here
}
}
$ent_win = $ent_sum;
$frac_win = $frac_sum;
} else {
$winP = 'no';
}
# oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
# $am = $words[$naa-1]; # not used currently
# if (($am eq "S") or ($am eq "T") or ($am eq "Y") or ($am eq "s") or ($am eq "t") or ($am eq "y")) {
if ((exists $words[$naa-1]) and ($winP eq 'yes')) {
$prophere = $wordsPROP[$naa-1];
$bin = 1 + int (($windis[$naa]+0.5)*$nbins_dis);
if ($bin < 1) { $bin = 1; } # cludge to avoid oub
if ($bin > $nbins_dis) { $bin = $nbins_dis; } # cludge to avoid oub
if (($bin < 1) or ($bin > $nbins_dis)) {
printf BLAH "STOPPING - bin for disorder not in 1 to $nbins_dis range\n";
exit;
}
# bin Uversky fold from theoretical min, max = -2.151, 1.634 from the Prilusky paper
$foldind_here = 2.785 * $KDnorm_win - $meanQ_win - 1.151;
$binfi = 1 + int (($foldind_here + 2.151)/((1.634+2.151)/$nbins_fi));
if (($binfi < 1) or ($binfi > $nbins_fi)) {
printf BLAH "STOPPING - bin for Uversky fold not in 1 to $nbins_fi range\n";
exit;
}
# for bins of fracQ (FQ), meanQ (MQ), KD; assume each has max range of 0 to 1
$binFQ = 1 + int (($fracQ)*$nbins_dis);
if ($binFQ == $nbins_dis+1) { $binFQ = $nbins_dis; }
if (($binFQ < 1) or ($binFQ > $nbins_dis)) {
printf BLAH "STOPPING - bin for fracQ not in 1 to $nbins_dis range; $binFQ $fracQ\n";
exit;
}
$binMQ = 1 + int (($meanQ_win)*$nbins_dis);
if ($binMQ == $nbins_dis+1) { $binMQ = $nbins_dis; }
if (($binMQ < 1) or ($binMQ > $nbins_dis)) {
printf BLAH "STOPPING - bin for meanQ not in 1 to $nbins_dis range\n";
exit;
}
$binKD = 1 + int (($KDnorm_win)*$nbins_dis);
if (($binKD < 1) or ($binKD > $nbins_dis)) {
printf BLAH "STOPPING - bin = $binKD for KD not in 1 to $nbins_dis range, KDnorm_win=$KDnorm_win\n";
exit;
}
if ($wordsP[$naa-1] eq "P") {
$n_phos_here[$bin]++;
$nfi_phos_here[$binfi]++;
if ($winP eq 'yes') {
if ($bord eq 'yes') { # only increment bord bins where winp='yes'
$nbord_phos_here[$bin]++; # i.e. binned bord props should be normalised with n_nf...
$nbordfi_phos_here[$binfi]++;
$nbordFQ_phos_here[$binFQ]++;
$nbordMQ_phos_here[$binMQ]++;
$nbordKD_phos_here[$binKD]++;
$bord_write = 1;
} else {
$bord_write = 0;
}
$n_nf_phos_here[$bin]++;
$net_phos_here[$bin] += $netQ;
$frac_phos_here[$bin] += $fracQ;
$meanQ_phos_here[$bin] += $meanQ_win;
$KDnorm_phos_here[$bin] += $KDnorm_win;
$ent_phos_here[$bin] += $ent_win;
$nfi_nf_phos_here[$binfi]++;
$entfi_phos_here[$binfi] += $ent_win;
$nFQ_nf_phos_here[$binFQ]++;
$nMQ_nf_phos_here[$binMQ]++;
$nKD_nf_phos_here[$binKD]++;
$foldind = 2.785 * $KDnorm_win - $meanQ_win - 1.151;
printf PSEQS "PHOS ";
printf PSEQS "%7.3f %7.3f %7.3f", $windis[$naa], $netQ, $fracQ;
printf PSEQS "%8.3f %7.3f %8.3f %7.3f %6.3f", $meanQ_win, $KDnorm_win, $foldind, $ent_win, $frac_win;
printf PSEQS "%7.3f ", $bord_write;
printf PSEQS "%7.3f ", $meanQ_winNET;
printf PSEQS "$subseq\n";
}
} else {
$n_unphos_here[$bin]++;
$nfi_unphos_here[$binfi]++;
if ($winP eq 'yes') { # only increment bord bins where winP='yes'
if ($bord eq 'yes') {
$nbord_unphos_here[$bin]++;
$nbordfi_unphos_here[$binfi]++;
$nbordFQ_unphos_here[$binFQ]++;
$nbordMQ_unphos_here[$binMQ]++;
$nbordKD_unphos_here[$binKD]++;
$bord_write = 1;
} else {
$bord_write = 0;
}
$n_nf_unphos_here[$bin]++;
$net_unphos_here[$bin] += $netQ;
$frac_unphos_here[$bin] += $fracQ;
$meanQ_unphos_here[$bin] += $meanQ_win;
$KDnorm_unphos_here[$bin] += $KDnorm_win;
$ent_unphos_here[$bin] += $ent_win;
$nfi_nf_unphos_here[$binfi]++;
$entfi_unphos_here[$binfi] += $ent_win;
$nFQ_nf_unphos_here[$binFQ]++;
$nMQ_nf_unphos_here[$binMQ]++;
$nKD_nf_unphos_here[$binKD]++;
$foldind = 2.785 * $KDnorm_win - $meanQ_win - 1.151;
printf PSEQS "NONP ";
printf PSEQS "%7.3f %7.3f %7.3f", $windis[$naa], $netQ, $fracQ;
printf PSEQS "%8.3f %7.3f %8.3f %7.3f %6.3f", $meanQ_win, $KDnorm_win, $foldind, $ent_win, $frac_win;
printf PSEQS "%7.3f ", $bord_write;
printf PSEQS "%7.3f ", $meanQ_winNET;
printf PSEQS "$subseq\n";
}
}
}
} # CLOSE AAs in SEQs LOOP
$nSTYunphos_processed = 0;
$nSTYphos_processed = 0;
$nbordunphos_processed = 0;
$nbordphos_processed = 0;
for ($h=1; $h<=$nbins_dis; $h++) { # get sequence numbers of phos-STY and bord/phos-STY
$nSTYphos_processed += $n_nf_phos_here[$h];
$nSTYunphos_processed += $n_nf_unphos_here[$h];
$nbordphos_processed += $nbord_phos_here[$h];
$nbordunphos_processed += $nbord_unphos_here[$h];
}
$nSTYhere = $nSTYphos_processed + $nSTYunphos_processed;
if ($nSTYhere != 0) {
$frac_phos_seqs{$seqid[$nseq]} = $nSTYphos_processed / $nSTYhere;
$frac_bord_seqs{$seqid[$nseq]} = ($nbordphos_processed + $nbordunphos_processed) / $nSTYhere;
} else {
$frac_phos_seqs{$seqid[$nseq]} = 99;
$frac_bord_seqs{$seqid[$nseq]} = 99;
}
printf PSEQS "KD meanQ FldInd Dis Ent FracPhos FracBord = ";
# printf PSEQS "%6.3f %6.3f ", $KDnorm_seqs{$seqid[$nseq]}, $chrmean_seqs{$seqid[$nseq]};
printf PSEQS "%6.3f %6.3f ", $KDnorm_seqs{$seqid[$nseq]}, $chr_tot_seqs{$seqid[$nseq]};
printf PSEQS "%6.3f %6.3f %6.3f ", $foldind_seqs{$seqid[$nseq]}, $dis_seqs{$seqid[$nseq]}, $ent_seqs{$seqid[$nseq]};
printf PSEQS "%6.3f %6.3f ", $frac_phos_seqs{$seqid[$nseq]}, $frac_bord_seqs{$seqid[$nseq]};
printf PSEQS "\n";
printf PSEQS "ENDID $seqid[$nseq]\n\n";
for ($h=1; $h<=$nbins_dis; $h++) {
$n_phos[$h] += $n_phos_here[$h];
$n_unphos[$h] += $n_unphos_here[$h];
$net_phos[$h] += $net_phos_here[$h];
$net_unphos[$h] += $net_unphos_here[$h];
$frac_phos[$h] += $frac_phos_here[$h];
$frac_unphos[$h] += $frac_unphos_here[$h];
$n_nf_phos[$h] += $n_nf_phos_here[$h];
$n_nf_unphos[$h] += $n_nf_unphos_here[$h];
$nbord_phos[$h] += $nbord_phos_here[$h];
$nbord_unphos[$h] += $nbord_unphos_here[$h];
$meanQ_phos[$h] += $meanQ_phos_here[$h];
$meanQ_unphos[$h] += $meanQ_unphos_here[$h];
$KDnorm_phos[$h] += $KDnorm_phos_here[$h];
$KDnorm_unphos[$h] += $KDnorm_unphos_here[$h];
$ent_phos[$h] += $ent_phos_here[$h];
$ent_unphos[$h] += $ent_unphos_here[$h];
}
for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
$nfi_phos[$hfi] += $nfi_phos_here[$hfi];
$nfi_unphos[$hfi] += $nfi_unphos_here[$hfi];
$nfi_nf_phos[$hfi] += $nfi_nf_phos_here[$hfi];
$nfi_nf_unphos[$hfi] += $nfi_nf_unphos_here[$hfi];
$nbordfi_phos[$hfi] += $nbordfi_phos_here[$hfi];
$nbordfi_unphos[$hfi] += $nbordfi_unphos_here[$hfi];
$entfi_phos[$hfi] += $entfi_phos_here[$hfi];
$entfi_unphos[$hfi] += $entfi_unphos_here[$hfi];
$nFQ_nf_phos[$hfi] += $nFQ_nf_phos_here[$hfi];
$nFQ_nf_unphos[$hfi] += $nFQ_nf_unphos_here[$hfi];
$nbordFQ_phos[$hfi] += $nbordFQ_phos_here[$hfi];
$nbordFQ_unphos[$hfi] += $nbordFQ_unphos_here[$hfi];
$nMQ_nf_phos[$hfi] += $nMQ_nf_phos_here[$hfi];
$nMQ_nf_unphos[$hfi] += $nMQ_nf_unphos_here[$hfi];
$nbordMQ_phos[$hfi] += $nbordMQ_phos_here[$hfi];
$nbordMQ_unphos[$hfi] += $nbordMQ_unphos_here[$hfi];
$nKD_nf_phos[$hfi] += $nKD_nf_phos_here[$hfi];
$nKD_nf_unphos[$hfi] += $nKD_nf_unphos_here[$hfi];
$nbordKD_phos[$hfi] += $nbordKD_phos_here[$hfi];
$nbordKD_unphos[$hfi] += $nbordKD_unphos_here[$hfi];
}
} # CLOSE SEQUENCES LOOP
close (PSEQS);
$file_here = "$FindBin::Bin/bins.txt";
open (BINS, ">$file_here") or die "cannot open bins.txt\n";
printf BINS "dis_bin win_netQ_P win_netQ_U win_fracQ_P win_fracQ_U ";
printf BINS "Q_diffPU fracQ_diffPU ";
printf BINS "meanQ_P meanQ_U meanKD_P meanKD_U\n";
for ($h=1; $h<=$nbins_dis; $h++) {
$meanQ_phos[$h] += $meanQ_phos_here[$h];
$meanQ_unphos[$h] += $meanQ_unphos_here[$h];
$KDnorm_phos[$h] += $KDnorm_phos_here[$h];
$KDnorm_unphos[$h] += $KDnorm_unphos_here[$h];
$ent_phos[$h] += $ent_phos_here[$h];
$ent_unphos[$h] += $ent_unphos_here[$h];
# $mean_frac_unphos = $frac_unphos[$h] / $n_unphos[$h];
if ($n_unphos[$h] != 0) {
$n_PUratio[$h] = $n_phos[$h] / $n_unphos[$h];
} else {
$n_PUratio[$h] = 0;
}
if ($n_nf_phos[$h] != 0) {
$mean_net_phos = $net_phos[$h] / $n_nf_phos[$h];
$mean_frac_phos = $frac_phos[$h] / $n_nf_phos[$h];
$mean_meanQ_phos[$h] = $meanQ_phos[$h] / $n_nf_phos[$h];
$mean_KDnorm_phos[$h] = $KDnorm_phos[$h] / $n_nf_phos[$h];
$mean_ent_phos[$h] = $ent_phos[$h] / $n_nf_phos[$h];
} else {
$mean_net_phos = 0;
$mean_frac_phos = 0;
$mean_meanQ_phos[$h] = 0;
$mean_KDnorm_phos[$h] = 0;
$mean_ent_phos[$h] = 0;
}
if ($n_nf_unphos[$h] != 0) {
$mean_net_unphos = $net_unphos[$h] / $n_nf_unphos[$h];
$mean_frac_unphos = $frac_unphos[$h] / $n_nf_unphos[$h];
$mean_meanQ_unphos[$h] = $meanQ_unphos[$h] / $n_nf_unphos[$h];
$mean_KDnorm_unphos[$h] = $KDnorm_unphos[$h] / $n_nf_unphos[$h];
$mean_ent_unphos[$h] = $ent_unphos[$h] / $n_nf_unphos[$h];
} else {
$mean_net_unphos = 0;
$mean_frac_unphos = 0;
$mean_meanQ_unphos[$h] = 0;
$mean_KDnorm_unphos[$h] = 0;
$mean_ent_unphos[$h] = 0;
}
$netQ_diff[$h] = $mean_net_phos - $mean_net_unphos;
#changed below line due to printf error newer perl Max/Jim/Desnoyers 020518
#$fracQ_diff[$h] = $mean_frac_phos - $mean_frac_unphos;
printf BINS "%7.0f", $h;
printf BINS "%11.3f %10.3f", $mean_net_phos, $mean_net_unphos;
printf BINS "%12.3f %11.3f", $mean_frac_phos, $mean_frac_unphos;
printf BINS "%9.3f %12.3f", $h, $netQ_diff[$h];
#changed below line due to printf error newer perl Max/Jim/Desnoyers 020518
#printf BINS "%9.3f %12.3f", $h, $netQ_diff[$h], $fracQ_diff[$h];
printf BINS "%8.3f %7.3f", $mean_meanQ_phos[$h], $mean_meanQ_unphos[$h];
printf BINS "%10.3f %8.3f\n", $mean_KDnorm_phos[$h], $mean_KDnorm_unphos[$h];
}
# Newer dis bins below to mirror the Uversky fold Table
$n_phos_tot = 0;
$n_unphos_tot = 0;
$n_nf_phos_tot = 0;
$n_nf_unphos_tot = 0;
for ($h=1; $h<=$nbins_dis; $h++) {
$n_phos_tot += $n_phos[$h];
$n_unphos_tot += $n_unphos[$h];
$n_nf_phos_tot += $n_nf_phos[$h];
$n_nf_unphos_tot += $n_nf_unphos[$h];
}
printf BINS "\n\n phos_tot, unphos_tot, nf_phos_tot, nf_unphos_tot = $n_phos_tot $n_unphos_tot $n_nf_phos_tot $n_nf_unphos_tot\n";
printf BINS "\n NOTE that other than enrich_PU, all enr/enrich props use 'nf' - complete wins around STY";
printf BINS "\n NOTE that enrich = 1.0 where a bin P/U ratio for that prop = P/U ratio overall all";
printf BINS "\n\nDis_bin Dis n_P n_U n_nf_P n_nf_U ratio_PU enrich_PU n_bord_P n_bord_U bord/STY rel_brdPU";
printf BINS " ent_P ent_U ent/STY rel_entPU\n";
for ($h=1; $h<=$nbins_dis; $h++) {
$top = $n_nf_phos[$h] * $n_nf_unphos_tot; # calc enrich with nwins P/U
$bot = $n_nf_unphos[$h] * $n_nf_phos_tot;
if (($top != 0) and ($bot != 0)) {
$enrich_nf_PU_here = $top/$bot;
} else {
$enrich_nf_PU_here = 1.0;
}
if ($n_nf_unphos[$h] != 0) {
$ratio_nf_PU_here = $n_nf_phos[$h] / $n_nf_unphos[$h];
} else {
$ratio_nf_PU_here = 1;
}
$bot = $n_nf_phos[$h] + $n_nf_unphos[$h];
$top = $nbord_phos[$h] + $nbord_unphos[$h];
if (($top != 0) and ($bot != 0)) {
$STY_in_bord_here = $top/$bot;
} else {
$STY_in_bord_here = 1;
}
$bot = $n_nf_phos[$h] + $n_nf_unphos[$h];
$top = $ent_phos[$h] + $ent_unphos[$h];
if (($top != 0) and ($bot != 0)) {
$STY_ent_here = $top/$bot;
} else {
$STY_ent_here = 1;
}
$top = $nbord_phos[$h]*$n_nf_unphos[$h];
$bot = $nbord_unphos[$h]*$n_nf_phos[$h];
if (($top != 0) and ($bot != 0)) {
$rel_enr_PU_bord_here = $top/$bot;
} else {
$rel_enr_PU_bord_here = 1.0;
}
$top = $ent_phos[$h]*$n_nf_unphos[$h];
$bot = $ent_unphos[$h]*$n_nf_phos[$h];
if (($top != 0) and ($bot != 0)) {
$rel_enr_PU_ent_here = $top/$bot;
} else {
$rel_enr_PU_ent_here = 1.0;
}
$dis_here = (($h-0.5)/$nbins_dis) - 0.5;
printf BINS "%7.0f %6.3f %9.0f %9.0f", $h, $dis_here, $n_phos[$h], $n_unphos[$h];
printf BINS "% 9.0f %9.0f %9.4f %9.3f", $n_nf_phos[$h], $n_nf_unphos[$h], $ratio_nf_PU_here, $enrich_nf_PU_here;
printf BINS " %9.0f %8.0f %9.4f %9.3f", $nbord_phos[$h], $nbord_unphos[$h], $STY_in_bord_here, $rel_enr_PU_bord_here;
printf BINS " %9.0f %8.0f %9.3f %9.3f\n", $ent_phos[$h], $ent_unphos[$h], $STY_ent_here, $rel_enr_PU_ent_here;
}
printf BINS "\n";
printf BINS " See the equivalent Table for more information on these properties\n";
printf BINS " Note: eg winlenB=11 get higher PtoU bord for higher dis - ?? charge props of Plation spec\n";
printf BINS " But overall trend is fewer STY border at higher dis, perhaps due to more charge at higher dis\n";
printf BINS " So if subs seqs have more Q, then enr_bordPU would go other way\n";
$nfi_phos_tot = 0;
$nfi_unphos_tot = 0;
$nfi_nf_phos_tot = 0;
$nfi_nf_unphos_tot = 0;
for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
$nfi_phos_tot += $nfi_phos[$hfi];
$nfi_unphos_tot += $nfi_unphos[$hfi];
$nfi_nf_phos_tot += $nfi_nf_phos[$hfi];
$nfi_nf_unphos_tot += $nfi_nf_unphos[$hfi];
}
printf BINS "\n phos_tot, unphos_tot, nf_phos_tot, nf_unphos_tot = $nfi_phos_tot $nfi_unphos_tot $nfi_nf_phos_tot $nfi_nf_unphos_tot\n";
printf BINS "\n NOTE that other than enrich_PU, all enr/enrich props use 'nf' - complete wins around STY";
printf BINS "\n NOTE that enrich = 1.0 where a bin P/U ratio for that prop = P/U ratio overall all";
printf BINS "\n\n FI_bin FldInd n_P n_U n_nf_P n_nf_U ratio_PU enrich_PU n_bord_P n_bord_U bord/STY rel_brdPU";
printf BINS " ent_P ent_U ent/STY rel_entPU\n";
for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
$top = $nfi_nf_phos[$hfi] * $nfi_nf_unphos_tot; # calc enrich with nwins P/U
$bot = $nfi_nf_unphos[$hfi] * $nfi_nf_phos_tot;
if (($top != 0) and ($bot != 0)) {
$enrich_nf_PU_here = $top/$bot;
} else {
$enrich_nf_PU_here = 1.0;
}
if ($nfi_nf_unphos[$hfi] != 0) {
$ratio_nf_PU_here = $nfi_nf_phos[$hfi] / $nfi_nf_unphos[$hfi];
} else {
$ratio_nf_PU_here = 1;
}
$bot = $nfi_nf_phos[$hfi] + $nfi_nf_unphos[$hfi];
$top = $nbordfi_phos[$hfi] + $nbordfi_unphos[$hfi];
if (($top != 0) and ($bot != 0)) {
$STY_in_bord_here = $top/$bot;
} else {
$STY_in_bord_here = 1;
}
$bot = $nfi_nf_phos[$hfi] + $nfi_nf_unphos[$hfi];
$top = $entfi_phos[$hfi] + $entfi_unphos[$hfi];
if (($top != 0) and ($bot != 0)) {
$STY_ent_here = $top/$bot;
} else {
$STY_ent_here = 1;
}
$top = $nbordfi_phos[$hfi]*$nfi_nf_unphos[$hfi];
$bot = $nbordfi_unphos[$hfi]*$nfi_nf_phos[$hfi];
if (($top != 0) and ($bot != 0)) {
$rel_enr_PU_bord_here = $top/$bot;
} else {
$rel_enr_PU_bord_here = 1.0;
}
$top = $entfi_phos[$hfi]*$nfi_nf_unphos[$hfi];
$bot = $entfi_unphos[$hfi]*$nfi_nf_phos[$hfi];
if (($top != 0) and ($bot != 0)) {
$rel_enr_PU_ent_here = $top/$bot;
} else {
$rel_enr_PU_ent_here = 1.0;
}
$fi_here = -2.151 + ($hfi-0.5)*((1.634+2.151)/$nbins_fi);
printf BINS "%7.0f %6.3f %9.0f %9.0f", $hfi, $fi_here, $nfi_phos[$hfi], $nfi_unphos[$hfi];
printf BINS "% 9.0f %9.0f %9.4f %9.3f", $nfi_nf_phos[$hfi], $nfi_nf_unphos[$hfi], $ratio_nf_PU_here, $enrich_nf_PU_here;
printf BINS " %9.0f %8.0f %9.4f %9.3f", $nbordfi_phos[$hfi], $nbordfi_unphos[$hfi], $STY_in_bord_here, $rel_enr_PU_bord_here;
printf BINS " %9.0f %8.0f %9.3f %9.3f\n", $entfi_phos[$hfi], $entfi_unphos[$hfi], $STY_ent_here, $rel_enr_PU_ent_here;
}
printf BINS "\nMaking sense of the bins:\n";
printf BINS "Fold Value below zero = unfolded, above zero = folded (Uversky / Prilusky)\n";
printf BINS "n_P/U numbers of phos/unphos STY picked up from seqs, wherever in seq\n";
printf BINS "n_nf_P/U nos of phos/unphos STY that are within complete winlenP (eg 21) windows\n";
printf BINS " these nf numbers of P/U are used for subsequent ratio/enrichment calcs\n";
printf BINS "ratio_PU simple ratio of phos / unphos in each FI bin\n";
printf BINS "enrich_PU enrichment of P to U, normalised such that average over all bins = 1\n";
printf BINS " the classic result is that we should have > 1 for unfolded proteins\n";
printf BINS "n_bord_P/U STY phos/unphos nos (in comp winlenP) at charge borders (tested in winlenB)\n";
printf BINS "bord/STY for all STY, the fraction in charge border as function of FI bin\n";
printf BINS " see a general trend, presumably related to more charge for less folded proteins\n";
printf BINS "rel_brdPU enrichment of charge borders P to U, normalised by ratio of P to U in bin\n";
printf BINS " we're looking here - may be some enrichment for P close to fold boundary\n";
printf BINS "ent_P/U sum over winlenP sequence entropies for phos / unphos sites\n";
printf BINS "ent/STY for all STY, the average seq entropy per site in this bin\n";
printf BINS "rel_entPU enrichment of these entropes P to U, normalised by ratio of P to U in bin\n";
printf BINS " what to say about the seq entropy P to U props ??\n";
# We see rel_brdPU variation on dis and FI - but what about vs charge and KD prop bins separately?
printf BINS "\n We see rel_bord variation with dis and FI bins - what about charge and KD bins?";
printf BINS "\n\nnumbin fracQ bord/STY rel_brdPU meanQ bord/STY rel_brdPU KD bord/STY rel_brdPU\n";
for ($h=1; $h<=$nbins_dis; $h++) {
printf BINS "%7.0f", $h;
$bot = $nFQ_nf_phos[$h] + $nFQ_nf_unphos[$h];
$top = $nbordFQ_phos[$h] + $nbordFQ_unphos[$h];
if (($top != 0) and ($bot != 0)) {
$STY_in_bord_here = $top/$bot;
} else {
$STY_in_bord_here = 1;
}
$top = $nbordFQ_phos[$h]*$nFQ_nf_unphos[$h];
$bot = $nbordFQ_unphos[$h]*$nFQ_nf_phos[$h];
if (($top != 0) and ($bot != 0)) {
$rel_enr_PU_bord_here = $top/$bot;
} else {
$rel_enr_PU_bord_here = 1.0;
}
printf BINS " fracQ %9.4f %9.3f", $STY_in_bord_here, $rel_enr_PU_bord_here;
$bot = $nMQ_nf_phos[$h] + $nMQ_nf_unphos[$h];
$top = $nbordMQ_phos[$h] + $nbordMQ_unphos[$h];
if (($top != 0) and ($bot != 0)) {
$STY_in_bord_here = $top/$bot;
} else {
$STY_in_bord_here = 1;
}
$top = $nbordMQ_phos[$h]*$nMQ_nf_unphos[$h];
$bot = $nbordMQ_unphos[$h]*$nMQ_nf_phos[$h];
if (($top != 0) and ($bot != 0)) {
$rel_enr_PU_bord_here = $top/$bot;
} else {
$rel_enr_PU_bord_here = 1.0;
}
printf BINS " meanQ %9.4f %9.3f", $STY_in_bord_here, $rel_enr_PU_bord_here;
$bot = $nKD_nf_phos[$h] + $nKD_nf_unphos[$h];
$top = $nbordKD_phos[$h] + $nbordKD_unphos[$h];
if (($top != 0) and ($bot != 0)) {
$STY_in_bord_here = $top/$bot;
} else {
$STY_in_bord_here = 1;
}
$top = $nbordKD_phos[$h]*$nKD_nf_unphos[$h];
$bot = $nbordKD_unphos[$h]*$nKD_nf_phos[$h];
if (($top != 0) and ($bot != 0)) {
$rel_enr_PU_bord_here = $top/$bot;
} else {
$rel_enr_PU_bord_here = 1.0;
}
printf BINS " KD %9.4f %9.3f\n", $STY_in_bord_here, $rel_enr_PU_bord_here;
}
close (BINS);
# output the sequences, with P and Order annotation lines
$file_here = "$FindBin::Bin/seq_props.out";
open (SEQS_OUT, ">$file_here") or die "cannot open seq_props.out\n";
# open (SEQS_SUM, ">POsummary.out") or die "cannot open PO.out\n";
# printf SEQS_SUM "VALUES OVER THE ENTIRE INPUT SET follow: N genes = $nseqs\n\n";
# printf SEQS_SUM "N genes with >= 1 Motif = $nwithmotif\n";
# printf SEQS_SUM "N genes with >= 1 Motif also DIS = $nwithmotifDIS\n";
# $answerA = $totalmotifs/$nseqs;
# printf SEQS_SUM "AVG Motifs per gene = $answerA\n";
# $answerB = $totalmotifsDIS/$nseqs;
# printf SEQS_SUM "AVG Motifs per gene also DIS = $answerB\n";
# printf SEQS_SUM "Perc disorder for all genes = $PercDISTOT\n\n";
# printf SEQS_SUM "VALUES FOR EACH GENE follow\n\n";
for ($nseq=1; $nseq<=$nseqs; $nseq++) {
printf SEQS_OUT "$seqid[$nseq]\n";
# printf SEQS_SUM "$seqid[$nseq]\n";
printf SEQS_OUT "KD meanQ FldInd Dis Ent FracPhos FracBord = ";
printf SEQS_OUT "%6.3f %6.3f ", $KDnorm_seqs{$seqid[$nseq]}, $chrmean_seqs{$seqid[$nseq]};
printf SEQS_OUT "%6.3f %6.3f %6.3f ", $foldind_seqs{$seqid[$nseq]}, $dis_seqs{$seqid[$nseq]}, $ent_seqs{$seqid[$nseq]};
printf SEQS_OUT "%6.3f %6.3f ", $frac_phos_seqs{$seqid[$nseq]}, $frac_bord_seqs{$seqid[$nseq]};
printf SEQS_OUT "\n";
printf SEQS_OUT "SEQ=$seq{$seqid[$nseq]}\n";
printf SEQS_OUT "DIS=$propchar{$seqid[$nseq]}\n";
printf SEQS_OUT "CHR=$signchar{$seqid[$nseq]}\n";
printf SEQS_OUT "CHR=$chrcharA{$seqid[$nseq]}\n";
printf SEQS_OUT "CHR=$chrcharB{$seqid[$nseq]}\n";
printf SEQS_OUT "\n";
if ($extra eq "yes") { # optional output of vertical data list, one file per sequence
$filehere = 'seq_'.$nseq.'.txt';
open (EXTRA, ">$filehere") or die "cannot open $filehere\n";
printf EXTRA "$seqid[$nseq]\n";
@seqextra = split ("",$seq{$seqid[$nseq]});
$aatop = length $seq{$seqid[$nseq]};
@disextra = split ("",$propchar{$seqid[$nseq]});
@signextra = split ("",$signchar{$seqid[$nseq]});
@chrextraA = split ("",$chrcharA{$seqid[$nseq]});
@chrextraB = split ("",$chrcharB{$seqid[$nseq]});
for ($aa=1; $aa<=$aatop; $aa++) {
if ($disextra[$aa-1] eq "*") {
$disout = "1";
} else {
$disout = "0";
}
if ($chrextraA[$aa-1] eq " ") {
$chrout = $chrextraB[$aa-1];
} else {
$chrout = $chrextraA[$aa-1].$chrextraB[$aa-1];
}
# printf EXTRA "$seqextra[$aa-1] $disout $all_winQ[$nseq][$aa] check=$signextra[$aa-1]$chrextra[$aa-1]\n";
printf EXTRA "$seqextra[$aa-1] $disout $all_winQ[$nseq][$aa] check=$signextra[$aa-1]$chrout\n";
}
close (EXTRA);
}
}
close (SEQS_OUT);
# close (SEQS_SUM);
close (BLAH);
exit;
# Niwa population 3049 ids matched, and a further 653 excluded by MEMBRANE_EXCLUDE
# Niwa - following averages and std deviations refer to the 3049-653 population
# Niwa - following data segment from seq_compositions_perc_pipeline.pl followed by seq_compositions_stats_pipeline w/ STATS_MODE unset
# Added POP, TOP, LOW to start of transferred data (seq_compositions), for ease of reading
# Also added the data source to each horizantal data line, for ease of reading
POP, NIWA, AVG, 53.347, DEV, 33.910, N, 2396, TAG, percentage-solubility
POP, NIWA, WHOLE-SEQ,headers,K-R,D-E,naa,totperc,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity
POP, NIWA, WHOLE-SEQ,average, -0.922, -0.983,307.996,100.000, 9.347, 1.360, 5.654, 6.637, 3.456, 7.021, 2.543, 5.668, 4.947, 9.969, 2.754, 3.921, 4.409, 4.697, 5.868, 5.542, 5.307, 6.847, 1.297, 2.755, 10.815, 12.291, -1.476, 23.106, 7.508, 6.693, 0.475, 0.029, 0.143, -0.084, 4.067, 0.993
POP, NIWA, WHOLE-SEQ,std-dev, 3.438, 2.759,178.157, 0.000, 2.752, 1.217, 1.614, 2.132, 1.409, 2.287, 1.289, 1.835, 2.327, 2.568, 1.114, 1.557, 1.632, 1.844, 2.183, 1.876, 1.743, 2.017, 0.994, 1.383, 2.923, 2.586, 3.708, 4.088, 2.529, 1.854, 0.025, 0.027, 0.080, 0.021, 0.093, 0.026
TOP, NIWA, AVG, 113.241, DEV, 8.620, N, 145, TAG, percentage-solubility
TOP, NIWA, WHOLE-SEQ,headers,K-R,D-E,naa,totperc,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity
TOP, NIWA, WHOLE-SEQ,average, 1.042, -1.529,188.055,100.000, 8.993, 1.277, 6.064, 7.593, 2.964, 6.639, 2.259, 5.472, 6.560, 9.202, 3.041, 3.774, 4.125, 4.691, 5.518, 5.770, 5.464, 7.112, 1.052, 2.431, 12.078, 13.657, -1.579, 25.735, 6.447, 6.572, 0.462, 0.047, 0.089, -0.084, 3.995, 0.986
TOP, NIWA, WHOLE-SEQ,std-dev, 4.125, 3.843,122.263, 0.000, 3.163, 1.502, 2.193, 2.973, 1.697, 2.544, 1.628, 1.989, 3.529, 2.816, 1.642, 1.941, 2.154, 2.685, 2.740, 2.226, 2.097, 2.674, 1.094, 1.514, 4.786, 3.540, 6.298, 5.586, 2.709, 2.227, 0.033, 0.045, 0.109, 0.026, 0.115, 0.032
LOW, NIWA, AVG, 5.208, DEV, 1.747, N, 125, TAG, percentage-solubility
LOW, NIWA, WHOLE-SEQ,headers,K-R,D-E,naa,totperc,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity
LOW, NIWA, WHOLE-SEQ,average, -2.651, -0.520,437.096,100.000, 8.922, 1.228, 5.438, 5.957, 3.800, 6.932, 2.923, 5.307, 3.812, 10.676, 2.674, 4.060, 4.525, 4.901, 6.463, 5.892, 5.369, 6.001, 1.833, 3.286, 10.275, 11.395, -1.120, 21.670, 8.918, 6.988, 0.473, 0.016, 0.150, -0.082, 4.113, 0.992
LOW, NIWA, WHOLE-SEQ,std-dev, 2.618, 2.230,204.698, 0.000, 2.227, 0.743, 1.187, 1.547, 1.272, 1.912, 1.117, 1.682, 1.387, 2.481, 0.906, 1.623, 1.274, 1.557, 1.764, 1.517, 1.470, 1.585, 0.956, 1.565, 1.793, 1.623, 1.643, 2.998, 2.599, 1.476, 0.021, 0.011, 0.060, 0.020, 0.056, 0.021
# calc_flag, signed_diff_zscore, y/yes or n/no to include or not in prediction, taken from p-value of overall correlations
# also avoid most straight linear combinations of aas in prediction
# then, in the prediction scheme, for each feature to be included (y/n) we will have:
# feature x, let xL = x avg in LOW set, xT = x avg in TOP set, and measured prop y, yL = avg in LOW ser, yT = avg in TOP set
# for feature value x:
# if x < xL, x=xL or if x > xT, x=xT, and predicted y = yL + [(x-xL)/( xT-xL)]*(yT-yL)
# followed by linear combination with abs value of zscore_diff for each x/feature
# our current (hopefully FINAL) list of 10 features for prediction is given in as y/n
ZDF,NIWA,WHOLE-SEQ,headers,K-R,D-E,naa,totperc,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity
ZDF, NIWA, WHOLE-SEQ, zscore-diff, 1.096, -0.333, -1.523, 0.000, 0.026, 0.043, 0.370, 0.724, -0.563, -0.132, -0.484, 0.090, 1.118, -0.557, 0.288, -0.160, -0.234, -0.099, -0.419, -0.065, 0.053, 0.522, -0.762, -0.555, 0.548, 0.876, -0.116, 0.947, -0.931, -0.225, -0.390, 1.105, -0.711, -0.103, -1.382, -0.258
ZDF, NIWA, WHOLE-SEQ, use_for_prob,y,n,y,n,n,n,n,n,n,n,y,n,n,y,n,n,n,n,n,n,n,y,n,n,n,y,n,n,y,n,n,y,y,n,y,n
#!/usr/bin/perl -w
use FindBin;
$file_here = "$FindBin::Bin/blah.txt";
open (BLAH, ">>$file_here"); # file for comments used throughout server code
# read in the reference data for this dataset.
$file_here = "$FindBin::Bin/seq_reference_data.txt";
open (REF, "<$file_here") or die "cannot open seq_reference_data.txt\n";
@ref = <REF>;
close (REF);
foreach $line_spaces (@ref) {
chomp $line_spaces;
$line = $line_spaces;
$line =~ s/\s//g;
@words = split (",",$line);
$nwords = @words;
if (exists $words[0]) { # non-null line
if (substr ($line,0,1) ne "#") { # not comment line
if ($words[2] eq 'AVG') {
if ($words[0] eq 'POP') {
$population_avg = $words[3];
}
if ($nwords != 10) { # check the number, but not format, of expected fields
printf BLAH "stopping *** number of fields in AVG line mismatch\n";
exit;
} else {
$AVG_line{$words[0]} = 'seen';
if ($words[0] eq 'TOP') { $top_prop_avg = $words[3]; }
if ($words[0] eq 'LOW') { $low_prop_avg = $words[3]; }
}
} elsif ($words[3] eq 'headers') {
if ($nwords != 40) { # check the number, but not format, of expected fields
printf BLAH "stopping *** number of fields in headers line mismatch\n";
exit;
} else {
$headers_line{$words[0]} = 'seen';
}
if ($words[0] eq 'POP') {
$headers_store = $line; # store the headers just once, for results write
}
} elsif ($words[3] eq 'average') {
if ($nwords != 40) { # check the number, but not format, of expected fields
printf BLAH "stopping *** number of fields in average line mismatch\n";
exit;
} else {
$average_line{$words[0]} = 'seen';
}
if ($words[0] eq 'TOP') { # store top and low averages to derive the linear fits (per feature)
$top_avg_line = $line;
} elsif ($words[0] eq 'LOW') {
$low_avg_line = $line;
} elsif ($words[0] eq 'POP') {
$pop_avg_line = $line;
}
} elsif ($words[3] eq 'std-dev') {
if ($nwords != 40) { # check the number, but not format, of expected fields
printf BLAH "stopping *** number of fields in std-dev line mismatch\n";
exit;
} else {
$std_dev_line{$words[0]} = 'seen';
}
if ($words[0] eq 'POP') { # store population std deviations for features
$pop_dev_line = $line;
#unused } elsif ($words[0] eq 'TOP') {
# $top_dev_line = $line;
# } elsif ($words[0] eq 'LOW') {
# $low_dev_line = $line;
}
} elsif (($words[0] eq 'ZDF') and ($words[3] eq 'zscore-diff')) {
if ($nwords != 40) { # check the number, but not format, of expected fields
printf BLAH "stopping *** number of fields in ZDF zscore-diff line mismatch\n";
exit;
} else {
$zdiff_line = $line;
$zdf_values = 'seen';
}
} elsif (($words[0] eq 'ZDF') and ($words[3] eq 'use_for_prob')) {
if ($nwords != 40) { # check the number, but not format, of expected fields
printf BLAH "stopping *** number of fields in ZDF use_for_prob line mismatch\n";
exit;
} else {
$use_for_prob_line = $line;
$zdf_use = 'seen'
}
}
} # end not comment line
} # end not null line
} # end line read
for $key_here ('POP', 'TOP', 'LOW' ) {
if (($headers_line{$key_here} ne 'seen') or ($average_line{$key_here} ne 'seen') or ($std_dev_line{$key_here} ne 'seen')) {
printf BLAH "stopping **** headers/average/std_dev line unseen for $key_here\n";
exit;
}
if ($AVG_line{$key_here} ne 'seen') {
printf BLAH "stopping **** AVG line unseen for $key_here\n";
exit;
}
}
if (($zdf_values ne 'seen') or ($zdf_use ne 'seen')) {
printf BLAH "stopping **** ZDF zscore-diff/use_for_prob line unseen\n";
}
# store the reference values needed for prediction
$low_prop = $low_prop_avg;
$top_prop = $top_prop_avg;
@lows = split (",",$low_avg_line); # values run from 4-39 (inclusive), with array index starting at 0
@tops = split (",",$top_avg_line); # as above
@pops = split (",",$pop_avg_line); # as above
@pop_devs = split (",",$pop_dev_line); # as above
@zdiffs = split (",",$zdiff_line); # as above, zdiff_scores remain signed at this point
@use = split (",",$use_for_prob_line); # as above, values are y,n
@heads = split (",",$headers_store); # as above
# read and store data values for this sequence set
$file_here = "$FindBin::Bin/seq_composition.txt";
open (COMP, "<$file_here") or die "cannot open seq_composition.txt\n";
$headers_found = "no";
$nseqs = 0;
while ($line_spaces = <COMP>) {
chomp $line_spaces;
$line = $line_spaces;
$line =~ s/\s//g;
@words = split (",",$line);
$nwords = @words;
if ($headers_found eq "no") {
if ((exists $words[0]) and (exists $words[1])) {
if (($words[0] eq "WHOLE-SEQ") and ($words[1] eq "ORF-ID")) {
$headers_found = "yes";
# @heads_comp = @words; # not used currently
}
}
}
if (exists $words[0]) {
if ($words[0] eq "WHOLE-SEQ") { # only process WHOLE-SEQ here - will need other of our code for seq-profiles
if ($words[1] ne "ORF-ID") { # check not on header line
if ($nwords != 38) { # flag, ID, 36 comp etc data fields
printf BLAH "stopping **** fields ne 38 in composition file data line\n";
exit;
} else {
$nseqs++;
$ids[$nseqs] = $words[1];
for ($n=1; $n<=36; $n++) { $data[$nseqs][$n] = $words[$n+1]; }
}
}
}
}
}
close (COMP);
$file_here = "$FindBin::Bin/seq_prediction.txt";
open (PREDA, ">$file_here") or die "cannot open seq_prediction.txt\n";
# before the sequence data writes, output things that don't change
$wt_sum = 0;
for ($d=1; $d<=36; $d++) {
if ($use[$d+3] eq 'y') {
$wt_feature[$d] = abs ($zdiffs[$d+3]); # get feature weighting for prediction (mostly zero probably)
$wt_sum += $wt_feature[$d];
} else {
$wt_feature[$d] = 0;
}
}
printf PREDA "LEGEND 35 sequence features are calculated, including 20 amino acid compositions\n";
printf PREDA "LEGEND All features are calculated over a sliding 21 amino acid window\n";
printf PREDA "LEGEND KmR=KminusR DmE=DminusE KpR=KplusR PmN=K+R-D-E PpN=K+R+D+N aro=F+W+Y\n";
printf PREDA "LEGEND fld = border of folded (pos) / unfolded (neg) Uversky et al Proteins 2000 41:415\n";
printf PREDA "LEGEND dis = disorder propensity Rune & Linding NAR 2003 31:3701\n";
printf PREDA "LEGEND bet = beta strand propensities Costantini et al 2006 BBRC 342:441\n";
printf PREDA "LEGEND mem = Kyte-Doolittle hydropathy normalised (-1 to +1) JMB 157:105\n";
printf PREDA "LEGEND Only a subset of features used for prediction, according to best fit against data\n";
printf PREDA "LEGEND underlying solubility data: cell-free expression Niwa et al PNAS 2009 106:4201\n";
printf PREDA "LEGEND Features used for prediction are scaled 0 - 1 over range in underlying dataset\n";
printf PREDA "HEADERS PREDICTIONS LINE,ID,percent-sol,scaled-sol,population-sol,pI\n";
printf PREDA "HEADERS FEATURES ORIGINAL,ID";
for ($d=1; $d<=3; $d++) { printf PREDA ",$heads[$d+3]"; }
for ($d=5; $d<=36; $d++) { printf PREDA ",$heads[$d+3]"; }
printf PREDA "\n";
printf PREDA "HEADERS FEATURES PLOT,ID,";
printf PREDA "KmR,DmE,len,";
for ($d=5; $d<=24; $d++) { printf PREDA "$heads[$d+3],"; }
printf PREDA "KpR,DpE,PmN,PpN,aro,pI,mem,chr,fld,dis,ent,bet\n";
printf PREDA "\n";
for ($n=1; $n<=$nseqs; $n++) {
$wt_sum = 0;
$pred_sum = 0;
for ($d=1; $d<=36; $d++) {
if ($d != 4) { # get the values of (X-Xavg-pop) / std-dev-pop i.e. a neg/pos scale
$norm_dev[$d] = ($data[$n][$d] - $pops[$d+3]) / $pop_devs[$d+3];
}
$wt_feature[$d] = 0; # initialise to zero, since will only overwrite if feature used for prediction
if ($use[$d+3] eq 'y') { # we are using this feature (based on correlation analysis)
$data_here = $data[$n][$d];
if ($lows[$d+3] <= $tops[$d+3]) { # range as the words say, bounds check
if ($data[$n][$d] < $lows[$d+3]) { $data_here = $lows[$d+3]; }
if ($data[$n][$d] > $tops[$d+3]) { $data_here = $tops[$d+3]; }
} else { # inverted range, bounds check the other way around
if ($data[$n][$d] > $lows[$d+3]) { $data_here = $lows[$d+3]; }
if ($data[$n][$d] < $tops[$d+3]) { $data_here = $tops[$d+3]; }
}
$pred_here = $low_prop + ($top_prop - $low_prop) * ( ($data_here - $lows[$d+3]) / ($tops[$d+3] - $lows[$d+3]) );
$wt_here = abs ($zdiffs[$d+3]);
$pred_sum += $wt_here*$pred_here;
$wt_sum += $wt_here;
$wt_feature[$d] = $wt_here;
}
}
$prediction = $pred_sum / $wt_sum;
$prediction_scaled = ($prediction - $low_prop) / ($top_prop - $low_prop);
$population_scaled = ($population_avg - $low_prop) / ($top_prop - $low_prop);
$pI_here = $data[$n][30];
printf PREDA "SEQUENCE PREDICTIONS,$ids[$n]"; # output sequence-specific data
printf PREDA ",%6.3f,%6.3f,%6.3f,%6.3f\n", $prediction,$prediction_scaled,$population_scaled,$pI_here;
printf PREDA "SEQUENCE WEIGHTS,$ids[$n]";
for ($d=1; $d<=3; $d++) {
$wt_print = $wt_feature[$d]/$wt_sum;
printf PREDA ",%6.3f", $wt_print;
}
for ($d=5; $d<=36; $d++) {
$wt_print = $wt_feature[$d]/$wt_sum;
printf PREDA ",%6.3f", $wt_print;
}
printf PREDA "\n"; # output the devs/std-dev for each feature, for this sequence
printf PREDA "SEQUENCE DEVIATIONS,$ids[$n]";
for ($d=1; $d<=3; $d++) { printf PREDA ",%6.3f", $norm_dev[$d]; }
for ($d=5; $d<=36; $d++) { printf PREDA ",%6.3f", $norm_dev[$d]; }
printf PREDA "\n\n";
}
close (PREDA);
close (BLAH);
exit;
comment amino acid sec struc propensities from
comment Constantini et al (2006) BBRC 342:441-451
comment aa3 aa1 freq Pa Pb Pc
data ALA A 7.73 1.39 0.75 0.80
data CYS C 1.84 0.74 1.31 1.05
data ASP D 5.82 0.89 0.55 1.33
data GLU E 6.61 1.35 0.72 0.86
data PHE F 4.05 1.01 1.43 0.76
data GLY G 7.11 0.47 0.65 1.62
data HIS H 2.35 0.92 0.99 1.07
data ILE I 5.66 1.04 1.71 0.59
data LYS K 6.27 1.11 0.83 1.00
data LEU L 8.83 1.32 1.10 0.68
data MET M 2.08 1.21 0.99 0.83
data ASN N 4.50 0.77 0.62 1.39
data PRO P 4.52 0.50 0.44 1.72
data GLN Q 3.94 1.29 0.76 0.89
data ARG R 5.03 1.17 0.91 0.91
data SER S 6.13 0.82 0.85 1.24
data THR T 5.53 0.76 1.23 1.07
data VAL V 6.91 0.89 1.86 0.64
data TRP W 1.51 1.06 1.30 0.79
data TYR Y 3.54 0.95 1.50 0.78
comment suggested algorithm is simply to sum over
comment window (eg 7) and take largest value = ss
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment