protein-sol全部改成当前路径

40086dd8 · junyu_yao · 40086dd8 · 40086dd8 · 40086dd8 · 40086dd8
Commit 40086dd8 authored Sep 14, 2023 by junyu_yao
11 changed files
--- a/README_sequence_prediction.txt
+++ b/README_sequence_prediction.txt
+
+October 2017
+
+Sequence-based prediction code used in University of Manchester protein-sol server.
+Available from download tab at www.protein-sol.manchester.ac.uk.
+see Hebdtich et al (2017) Bioinformatics 33:3098-3100.
+
+Jim Warwicker and Mex Hebditch, Manchester
+
+*****************
+   Copyright (C) 2017 Jim Warwicker and Max Hebdtich
+
+    These programs are free software: you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation, either version 3 of the License.
+
+    These programs are distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+The code is available 'as is', we are planning further developments to the server
+and code, and here is just a snapshot that should allow interested users to make
+calculations with multiple fasta sequences (as opposed to the single sequence
+operation of the web server).
+*****************
+
+CODE is in various perl scripts (.pl)
+
+fasta_seq_reformat_export.pl
+seq_compositions_perc_pipeline_export.pl
+server_prediction_seq_export.pl
+seq_props_ALL_export.pl
+profiles_gather_export.pl
+
+RUN is initiated with
+
+'multiple_prediction_wrapper_export.sh sequence_input_file'
+
+sequence_input_file has something like a fasta format, it will convert to
+paired records of ID and sequence for each entry:
+
+>blah11
+MVKVYAPASSANMSVGFDVLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDKLPSEPRENIVYQCWERFCQE
+
+>blah22
+MVKVYAPASSANMSVGFDVLFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDKLPSEPRENIVYQCWERFCQE
+
+but regular fasta should be OK (fasta_seq_reformat_export to convert).
+
+OTHER INPUT (DATA) FILES:
+
+ss_propensities.txt		(sec struc propensities)
+seq_reference_data_NIWA.txt	(fitting to experimental solubility data)
+
+RUNNING, as set up, will occur with all files in the local directort/
+
+OUTPUT
+
+seq_prediction.txt (CSV) contains the data relating to that provided on the server:
+
+LEGEND records - brief information on features used for the predictions
+
+HEADERS records - matched with keywords for the data output SEQUENCE records:
+
+HEADERS PREDICTIONS with SEQUENCE PREDICTIONS
+percent-sol, scaled-sol, population-sol, pI
+
+HEADERS FEATURES ORIGINAL gives definitions of the features
+HEADERS FEATURES PLOT give short-hand names for the features
+both of these HEADERS lines are matched columns with SEQUENCE WEIGHTS and SEQUENCE DEVIATIONS
+lines, where:
+SEQUENCE WEIGHTS - weights from the fit to experimental data, only 10 features are non-zero
+SEQUENCE DEVIATIONS - these are the z-score (see publication) deviation for each feature
+
+then follows PROFILE data across successive 21 amino acid windows for:
+Kyte-Doolittle
+Uversky fold value	- plotted on server
+sequence entropy
+windowed net charge	- plotted on server
+
+
+
+
--- a/blah.txt
+++ b/blah.txt
+
+TM REGION PREDICTED for >sp|P61626|LYSC_HUMAN - with non-normalised KD of   1.78 compared with threshold of 1.6
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for y is 0.355555555555556
+KDnorm for e is 0.111111111111111
+KDnorm for c is 0.777777777777778
+KDnorm for Q is 0.111111111111111
+KDnorm for W is 0.4
+KDnorm for n is 0.111111111111111
+KDnorm for q is 0.111111111111111
+KDnorm for m is 0.711111111111111
+KDnorm for N is 0.111111111111111
+KDnorm for f is 0.811111111111111
+KDnorm for x is 0.5
+KDnorm for X is 0.5
+KDnorm for F is 0.811111111111111
+KDnorm for l is 0.922222222222222
+KDnorm for v is 0.966666666666667
+KDnorm for K is 0.0666666666666667
+KDnorm for s is 0.411111111111111
+KDnorm for g is 0.455555555555555
+KDnorm for P is 0.322222222222222
+KDnorm for H is 0.144444444444444
+KDnorm for R is 0
+KDnorm for D is 0.111111111111111
+KDnorm for r is 0
+KDnorm for I is 1
+KDnorm for G is 0.455555555555555
+KDnorm for a is 0.7
+KDnorm for L is 0.922222222222222
+KDnorm for M is 0.711111111111111
+KDnorm for h is 0.144444444444444
+KDnorm for S is 0.411111111111111
+KDnorm for A is 0.7
+KDnorm for p is 0.322222222222222
+KDnorm for E is 0.111111111111111
+KDnorm for C is 0.777777777777778
+KDnorm for w is 0.4
+KDnorm for T is 0.422222222222222
+KDnorm for k is 0.0666666666666667
+KDnorm for d is 0.111111111111111
+KDnorm for Y is 0.355555555555556
+KDnorm for t is 0.422222222222222
+KDnorm for i is 1
+KDnorm for V is 0.966666666666667
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for P is 0.322222222222222
+KDnorm for D is 0.111111111111111
+KDnorm for Y is 0.355555555555556
+KDnorm for x is 0.5
+KDnorm for N is 0.111111111111111
+KDnorm for w is 0.4
+KDnorm for e is 0.111111111111111
+KDnorm for H is 0.144444444444444
+KDnorm for R is 0
+KDnorm for l is 0.922222222222222
+KDnorm for i is 1
+KDnorm for a is 0.7
+KDnorm for y is 0.355555555555556
+KDnorm for C is 0.777777777777778
+KDnorm for M is 0.711111111111111
+KDnorm for F is 0.811111111111111
+KDnorm for r is 0
+KDnorm for n is 0.111111111111111
+KDnorm for k is 0.0666666666666667
+KDnorm for Q is 0.111111111111111
+KDnorm for I is 1
+KDnorm for q is 0.111111111111111
+KDnorm for W is 0.4
+KDnorm for E is 0.111111111111111
+KDnorm for S is 0.411111111111111
+KDnorm for d is 0.111111111111111
+KDnorm for A is 0.7
+KDnorm for v is 0.966666666666667
+KDnorm for V is 0.966666666666667
+KDnorm for m is 0.711111111111111
+KDnorm for h is 0.144444444444444
+KDnorm for s is 0.411111111111111
+KDnorm for L is 0.922222222222222
+KDnorm for f is 0.811111111111111
+KDnorm for G is 0.455555555555555
+KDnorm for g is 0.455555555555555
+KDnorm for c is 0.777777777777778
+KDnorm for p is 0.322222222222222
+KDnorm for X is 0.5
+KDnorm for K is 0.0666666666666667
+KDnorm for T is 0.422222222222222
+KDnorm for t is 0.422222222222222
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for i is 1
+KDnorm for s is 0.411111111111111
+KDnorm for F is 0.811111111111111
+KDnorm for g is 0.455555555555555
+KDnorm for m is 0.711111111111111
+KDnorm for Y is 0.355555555555556
+KDnorm for S is 0.411111111111111
+KDnorm for v is 0.966666666666667
+KDnorm for q is 0.111111111111111
+KDnorm for D is 0.111111111111111
+KDnorm for w is 0.4
+KDnorm for P is 0.322222222222222
+KDnorm for M is 0.711111111111111
+KDnorm for V is 0.966666666666667
+KDnorm for W is 0.4
+KDnorm for R is 0
+KDnorm for t is 0.422222222222222
+KDnorm for c is 0.777777777777778
+KDnorm for Q is 0.111111111111111
+KDnorm for y is 0.355555555555556
+KDnorm for h is 0.144444444444444
+KDnorm for C is 0.777777777777778
+KDnorm for f is 0.811111111111111
+KDnorm for H is 0.144444444444444
+KDnorm for a is 0.7
+KDnorm for l is 0.922222222222222
+KDnorm for X is 0.5
+KDnorm for T is 0.422222222222222
+KDnorm for L is 0.922222222222222
+KDnorm for N is 0.111111111111111
+KDnorm for K is 0.0666666666666667
+KDnorm for k is 0.0666666666666667
+KDnorm for r is 0
+KDnorm for p is 0.322222222222222
+KDnorm for E is 0.111111111111111
+KDnorm for d is 0.111111111111111
+KDnorm for I is 1
+KDnorm for A is 0.7
+KDnorm for n is 0.111111111111111
+KDnorm for x is 0.5
+KDnorm for G is 0.455555555555555
+KDnorm for e is 0.111111111111111
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for g is 0.455555555555555
+KDnorm for R is 0
+KDnorm for N is 0.111111111111111
+KDnorm for Y is 0.355555555555556
+KDnorm for c is 0.777777777777778
+KDnorm for i is 1
+KDnorm for K is 0.0666666666666667
+KDnorm for G is 0.455555555555555
+KDnorm for x is 0.5
+KDnorm for e is 0.111111111111111
+KDnorm for D is 0.111111111111111
+KDnorm for p is 0.322222222222222
+KDnorm for a is 0.7
+KDnorm for s is 0.411111111111111
+KDnorm for M is 0.711111111111111
+KDnorm for v is 0.966666666666667
+KDnorm for X is 0.5
+KDnorm for l is 0.922222222222222
+KDnorm for t is 0.422222222222222
+KDnorm for I is 1
+KDnorm for m is 0.711111111111111
+KDnorm for h is 0.144444444444444
+KDnorm for P is 0.322222222222222
+KDnorm for S is 0.411111111111111
+KDnorm for r is 0
+KDnorm for q is 0.111111111111111
+KDnorm for H is 0.144444444444444
+KDnorm for E is 0.111111111111111
+KDnorm for f is 0.811111111111111
+KDnorm for y is 0.355555555555556
+KDnorm for L is 0.922222222222222
+KDnorm for F is 0.811111111111111
+KDnorm for C is 0.777777777777778
+KDnorm for n is 0.111111111111111
+KDnorm for W is 0.4
+KDnorm for Q is 0.111111111111111
+KDnorm for T is 0.422222222222222
+KDnorm for A is 0.7
+KDnorm for w is 0.4
+KDnorm for d is 0.111111111111111
+KDnorm for V is 0.966666666666667
+KDnorm for k is 0.0666666666666667
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for a is 0.7
+KDnorm for L is 0.922222222222222
+KDnorm for i is 1
+KDnorm for V is 0.966666666666667
+KDnorm for D is 0.111111111111111
+KDnorm for d is 0.111111111111111
+KDnorm for w is 0.4
+KDnorm for A is 0.7
+KDnorm for m is 0.711111111111111
+KDnorm for N is 0.111111111111111
+KDnorm for g is 0.455555555555555
+KDnorm for c is 0.777777777777778
+KDnorm for n is 0.111111111111111
+KDnorm for s is 0.411111111111111
+KDnorm for P is 0.322222222222222
+KDnorm for G is 0.455555555555555
+KDnorm for W is 0.4
+KDnorm for K is 0.0666666666666667
+KDnorm for v is 0.966666666666667
+KDnorm for T is 0.422222222222222
+KDnorm for H is 0.144444444444444
+KDnorm for l is 0.922222222222222
+KDnorm for p is 0.322222222222222
+KDnorm for e is 0.111111111111111
+KDnorm for x is 0.5
+KDnorm for k is 0.0666666666666667
+KDnorm for S is 0.411111111111111
+KDnorm for h is 0.144444444444444
+KDnorm for f is 0.811111111111111
+KDnorm for Q is 0.111111111111111
+KDnorm for M is 0.711111111111111
+KDnorm for t is 0.422222222222222
+KDnorm for E is 0.111111111111111
+KDnorm for X is 0.5
+KDnorm for F is 0.811111111111111
+KDnorm for Y is 0.355555555555556
+KDnorm for y is 0.355555555555556
+KDnorm for r is 0
+KDnorm for I is 1
+KDnorm for C is 0.777777777777778
+KDnorm for q is 0.111111111111111
+KDnorm for R is 0
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for H is 0.144444444444444
+KDnorm for P is 0.322222222222222
+KDnorm for E is 0.111111111111111
+KDnorm for K is 0.0666666666666667
+KDnorm for l is 0.922222222222222
+KDnorm for e is 0.111111111111111
+KDnorm for T is 0.422222222222222
+KDnorm for m is 0.711111111111111
+KDnorm for t is 0.422222222222222
+KDnorm for F is 0.811111111111111
+KDnorm for p is 0.322222222222222
+KDnorm for d is 0.111111111111111
+KDnorm for S is 0.411111111111111
+KDnorm for a is 0.7
+KDnorm for Y is 0.355555555555556
+KDnorm for f is 0.811111111111111
+KDnorm for L is 0.922222222222222
+KDnorm for G is 0.455555555555555
+KDnorm for D is 0.111111111111111
+KDnorm for N is 0.111111111111111
+KDnorm for M is 0.711111111111111
+KDnorm for s is 0.411111111111111
+KDnorm for A is 0.7
+KDnorm for X is 0.5
+KDnorm for k is 0.0666666666666667
+KDnorm for x is 0.5
+KDnorm for v is 0.966666666666667
+KDnorm for V is 0.966666666666667
+KDnorm for i is 1
+KDnorm for g is 0.455555555555555
+KDnorm for r is 0
+KDnorm for R is 0
+KDnorm for c is 0.777777777777778
+KDnorm for q is 0.111111111111111
+KDnorm for I is 1
+KDnorm for y is 0.355555555555556
+KDnorm for Q is 0.111111111111111
+KDnorm for h is 0.144444444444444
+KDnorm for W is 0.4
+KDnorm for C is 0.777777777777778
+KDnorm for n is 0.111111111111111
+KDnorm for w is 0.4
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for W is 0.4
+KDnorm for T is 0.422222222222222
+KDnorm for P is 0.322222222222222
+KDnorm for Q is 0.111111111111111
+KDnorm for v is 0.966666666666667
+KDnorm for I is 1
+KDnorm for g is 0.455555555555555
+KDnorm for N is 0.111111111111111
+KDnorm for k is 0.0666666666666667
+KDnorm for p is 0.322222222222222
+KDnorm for y is 0.355555555555556
+KDnorm for K is 0.0666666666666667
+KDnorm for C is 0.777777777777778
+KDnorm for i is 1
+KDnorm for R is 0
+KDnorm for L is 0.922222222222222
+KDnorm for H is 0.144444444444444
+KDnorm for m is 0.711111111111111
+KDnorm for e is 0.111111111111111
+KDnorm for t is 0.422222222222222
+KDnorm for h is 0.144444444444444
+KDnorm for n is 0.111111111111111
+KDnorm for r is 0
+KDnorm for Y is 0.355555555555556
+KDnorm for l is 0.922222222222222
+KDnorm for X is 0.5
+KDnorm for S is 0.411111111111111
+KDnorm for E is 0.111111111111111
+KDnorm for a is 0.7
+KDnorm for M is 0.711111111111111
+KDnorm for G is 0.455555555555555
+KDnorm for c is 0.777777777777778
+KDnorm for F is 0.811111111111111
+KDnorm for V is 0.966666666666667
+KDnorm for x is 0.5
+KDnorm for d is 0.111111111111111
+KDnorm for D is 0.111111111111111
+KDnorm for w is 0.4
+KDnorm for A is 0.7
+KDnorm for s is 0.411111111111111
+KDnorm for f is 0.811111111111111
+KDnorm for q is 0.111111111111111
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for X is 0.5
+KDnorm for q is 0.111111111111111
+KDnorm for C is 0.777777777777778
+KDnorm for Y is 0.355555555555556
+KDnorm for h is 0.144444444444444
+KDnorm for t is 0.422222222222222
+KDnorm for D is 0.111111111111111
+KDnorm for c is 0.777777777777778
+KDnorm for A is 0.7
+KDnorm for a is 0.7
+KDnorm for G is 0.455555555555555
+KDnorm for H is 0.144444444444444
+KDnorm for T is 0.422222222222222
+KDnorm for y is 0.355555555555556
+KDnorm for E is 0.111111111111111
+KDnorm for g is 0.455555555555555
+KDnorm for s is 0.411111111111111
+KDnorm for I is 1
+KDnorm for P is 0.322222222222222
+KDnorm for F is 0.811111111111111
+KDnorm for i is 1
+KDnorm for K is 0.0666666666666667
+KDnorm for w is 0.4
+KDnorm for v is 0.966666666666667
+KDnorm for x is 0.5
+KDnorm for m is 0.711111111111111
+KDnorm for W is 0.4
+KDnorm for R is 0
+KDnorm for e is 0.111111111111111
+KDnorm for M is 0.711111111111111
+KDnorm for V is 0.966666666666667
+KDnorm for k is 0.0666666666666667
+KDnorm for p is 0.322222222222222
+KDnorm for l is 0.922222222222222
+KDnorm for r is 0
+KDnorm for n is 0.111111111111111
+KDnorm for Q is 0.111111111111111
+KDnorm for f is 0.811111111111111
+KDnorm for N is 0.111111111111111
+KDnorm for d is 0.111111111111111
+KDnorm for L is 0.922222222222222
+KDnorm for S is 0.411111111111111
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for E is 0.111111111111111
+KDnorm for y is 0.355555555555556
+KDnorm for G is 0.455555555555555
+KDnorm for m is 0.711111111111111
+KDnorm for l is 0.922222222222222
+KDnorm for N is 0.111111111111111
+KDnorm for f is 0.811111111111111
+KDnorm for V is 0.966666666666667
+KDnorm for w is 0.4
+KDnorm for S is 0.411111111111111
+KDnorm for I is 1
+KDnorm for c is 0.777777777777778
+KDnorm for R is 0
+KDnorm for M is 0.711111111111111
+KDnorm for d is 0.111111111111111
+KDnorm for T is 0.422222222222222
+KDnorm for t is 0.422222222222222
+KDnorm for Q is 0.111111111111111
+KDnorm for s is 0.411111111111111
+KDnorm for h is 0.144444444444444
+KDnorm for a is 0.7
+KDnorm for P is 0.322222222222222
+KDnorm for v is 0.966666666666667
+KDnorm for p is 0.322222222222222
+KDnorm for X is 0.5
+KDnorm for W is 0.4
+KDnorm for i is 1
+KDnorm for q is 0.111111111111111
+KDnorm for n is 0.111111111111111
+KDnorm for H is 0.144444444444444
+KDnorm for D is 0.111111111111111
+KDnorm for k is 0.0666666666666667
+KDnorm for Y is 0.355555555555556
+KDnorm for r is 0
+KDnorm for C is 0.777777777777778
+KDnorm for x is 0.5
+KDnorm for L is 0.922222222222222
+KDnorm for g is 0.455555555555555
+KDnorm for F is 0.811111111111111
+KDnorm for e is 0.111111111111111
+KDnorm for A is 0.7
+KDnorm for K is 0.0666666666666667
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for e is 0.111111111111111
+KDnorm for a is 0.7
+KDnorm for n is 0.111111111111111
+KDnorm for x is 0.5
+KDnorm for W is 0.4
+KDnorm for I is 1
+KDnorm for s is 0.411111111111111
+KDnorm for S is 0.411111111111111
+KDnorm for N is 0.111111111111111
+KDnorm for X is 0.5
+KDnorm for T is 0.422222222222222
+KDnorm for v is 0.966666666666667
+KDnorm for M is 0.711111111111111
+KDnorm for h is 0.144444444444444
+KDnorm for H is 0.144444444444444
+KDnorm for l is 0.922222222222222
+KDnorm for k is 0.0666666666666667
+KDnorm for Q is 0.111111111111111
+KDnorm for C is 0.777777777777778
+KDnorm for A is 0.7
+KDnorm for P is 0.322222222222222
+KDnorm for G is 0.455555555555555
+KDnorm for t is 0.422222222222222
+KDnorm for m is 0.711111111111111
+KDnorm for y is 0.355555555555556
+KDnorm for d is 0.111111111111111
+KDnorm for c is 0.777777777777778
+KDnorm for V is 0.966666666666667
+KDnorm for R is 0
+KDnorm for q is 0.111111111111111
+KDnorm for Y is 0.355555555555556
+KDnorm for F is 0.811111111111111
+KDnorm for r is 0
+KDnorm for D is 0.111111111111111
+KDnorm for E is 0.111111111111111
+KDnorm for L is 0.922222222222222
+KDnorm for w is 0.4
+KDnorm for f is 0.811111111111111
+KDnorm for g is 0.455555555555555
+KDnorm for K is 0.0666666666666667
+KDnorm for p is 0.322222222222222
+KDnorm for i is 1
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for S is 0.411111111111111
+KDnorm for r is 0
+KDnorm for W is 0.4
+KDnorm for A is 0.7
+KDnorm for a is 0.7
+KDnorm for f is 0.811111111111111
+KDnorm for R is 0
+KDnorm for L is 0.922222222222222
+KDnorm for h is 0.144444444444444
+KDnorm for Q is 0.111111111111111
+KDnorm for e is 0.111111111111111
+KDnorm for N is 0.111111111111111
+KDnorm for V is 0.966666666666667
+KDnorm for s is 0.411111111111111
+KDnorm for d is 0.111111111111111
+KDnorm for C is 0.777777777777778
+KDnorm for k is 0.0666666666666667
+KDnorm for q is 0.111111111111111
+KDnorm for c is 0.777777777777778
+KDnorm for I is 1
+KDnorm for F is 0.811111111111111
+KDnorm for P is 0.322222222222222
+KDnorm for M is 0.711111111111111
+KDnorm for D is 0.111111111111111
+KDnorm for G is 0.455555555555555
+KDnorm for t is 0.422222222222222
+KDnorm for v is 0.966666666666667
+KDnorm for E is 0.111111111111111
+KDnorm for g is 0.455555555555555
+KDnorm for x is 0.5
+KDnorm for T is 0.422222222222222
+KDnorm for K is 0.0666666666666667
+KDnorm for w is 0.4
+KDnorm for i is 1
+KDnorm for m is 0.711111111111111
+KDnorm for p is 0.322222222222222
+KDnorm for H is 0.144444444444444
+KDnorm for l is 0.922222222222222
+KDnorm for X is 0.5
+KDnorm for Y is 0.355555555555556
+KDnorm for y is 0.355555555555556
+KDnorm for n is 0.111111111111111
+
+HIS will be set at charge 0
+
+column o/p file for each sequence = no - set with extra_output env variable
+
+KDnorm for n is 0.111111111111111
+KDnorm for d is 0.111111111111111
+KDnorm for r is 0
+KDnorm for y is 0.355555555555556
+KDnorm for e is 0.111111111111111
+KDnorm for D is 0.111111111111111
+KDnorm for T is 0.422222222222222
+KDnorm for s is 0.411111111111111
+KDnorm for f is 0.811111111111111
+KDnorm for a is 0.7
+KDnorm for Q is 0.111111111111111
+KDnorm for N is 0.111111111111111
+KDnorm for x is 0.5
+KDnorm for h is 0.144444444444444
+KDnorm for i is 1
+KDnorm for p is 0.322222222222222
+KDnorm for W is 0.4
+KDnorm for t is 0.422222222222222
+KDnorm for L is 0.922222222222222
+KDnorm for V is 0.966666666666667
+KDnorm for F is 0.811111111111111
+KDnorm for P is 0.322222222222222
+KDnorm for l is 0.922222222222222
+KDnorm for X is 0.5
+KDnorm for q is 0.111111111111111
+KDnorm for K is 0.0666666666666667
+KDnorm for A is 0.7
+KDnorm for I is 1
+KDnorm for c is 0.777777777777778
+KDnorm for S is 0.411111111111111
+KDnorm for m is 0.711111111111111
+KDnorm for H is 0.144444444444444
+KDnorm for k is 0.0666666666666667
+KDnorm for M is 0.711111111111111
+KDnorm for G is 0.455555555555555
+KDnorm for R is 0
+KDnorm for g is 0.455555555555555
+KDnorm for v is 0.966666666666667
+KDnorm for w is 0.4
+KDnorm for Y is 0.355555555555556
+KDnorm for C is 0.777777777777778
+KDnorm for E is 0.111111111111111
--- a/fasta_seq_reformat_export.pl
+++ b/fasta_seq_reformat_export.pl
+#!/usr/bin/perl
+
+use FindBin;
+$file_here = "$FindBin::Bin/blah.txt";
+printf $file_here;
+open (BLAH, ">>$file_here") or die "cannot open blah.txt\n";		# file for comments, throughout server code
+
+open (IN, "< $FindBin::Bin/reformat.in") or die "cannot open reformat_in\n";
+@in=<IN>;
+close (IN);
+
+$file_here = "$FindBin::Bin/reformat_out";
+open (OUT, ">$file_here") or die "cannot open reformat_out\n";
+
+$seq_stored	= 'no';
+$skip_seq	= 'no';
+foreach $line (@in) {
+  chomp $line;
+  @words	= split (" ",$line);
+  if ((substr ($line,0,1) eq '>') or ($words[0] eq '//')) {		# new sequence or a // terminator
+    if ($seq_stored eq 'yes') {			# process and write out the previous sequence, if no errors
+      $seqUC	= uc($seq_here);		# uppercase the sequence
+      $seqlen	= length($seqUC);
+      $seq_new	= "";
+      $naa	= 0;
+$n = 1;	# ?? VERY ODD does the loop not fix starting n at 1 ??
+      for ($n==1; $n<=$seqlen; $n++) {
+        $char_here	= substr($seqUC,$n-1,1);
+        if (($char_here ne ' ') and ($char_here ne '*')) {		# ignore blanks and asterisks
+	  if ($char_here =~ /[ACDEFGHIKLMNPQRSTVWY]/) {
+	    $seq_new	.= $char_here;
+	    $naa++;
+#	  } else {		# not happy, but we can get <cr> at end, and easier to take only aa than skip allowed non-aa
+#	    printf BLAH "**** stopping - fasta_seq_reformat - non 20 aa detected for protein ID = $current_id\n";
+#	    $skip_seq	= 'yes';
+	  }
+	}
+      }
+      if ($skip_seq ne 'yes') {
+        if ($naa >= 21) {
+          printf OUT "$current_id\n";		# print the stored id
+          printf OUT "$seq_new\n";
+        } else {					# no aas
+	  printf BLAH "**** stopping - fasta_seq_reformat - shorter than 21 aas for protein ID = $current_id\n";
+        }
+      }
+      $skip_seq		= 'no';
+    }
+    if (substr ($line,0,1) eq '>') {
+      $current_id	= $line;		# store the new seq ID - printer later, if seq is cleared
+      $seq_stored	= 'yes';		# set up for new sequence
+      $seq_here		= '';
+    }
+  } else {					# add on to curr seq, if not blank
+    $len_here	= length($line);
+    if ($len_here != 0) { 
+      $seq_here .= $line;
+    }
+  }
+}
+
+if ($seq_stored eq 'yes') {			# print the last seq if it conforms
+  $seqUC	= uc($seq_here);		# uppercase the sequence
+  $seqlen	= length($seqUC);
+  $seq_new	= "";
+  $naa		= 0;
+$n = 1;	# ?? VERY ODD does the loop not fix starting n at 1 ??
+  for ($n==1; $n<=$seqlen; $n++) {
+    $char_here	= substr($seqUC,$n-1,1);
+    if (($char_here ne ' ') and ($char_here ne '*')) {		# ignore blanks and asterisks
+      if ($char_here =~ /[ACDEFGHIKLMNPQRSTVWY]/) {
+        $seq_new	.= $char_here;
+	$naa++;
+      }
+    }
+  }
+  if ($skip_seq ne 'yes') {
+    if ($naa >= 21) {
+      printf OUT "$current_id\n";			# print the stored id
+      printf OUT "$seq_new\n";
+    } else {					# no aas
+      printf BLAH "**** stopping - fasta_seq_reformat - shorter than 21 aas for protein ID = $current_id\n";
+    }
+  }
+  $skip_seq	= 'no';
+}
+
+close (OUT);
+close (BLAH);
+
+exit;
--- a/multiple_prediction_wrapper_export.sh
+++ b/multiple_prediction_wrapper_export.sh
+#!/bin/bash
+
+FASTA_in=$1
+SCRIPTDIR="/opt/workspace/D3/src/static/protein-sol"
+
+cp $FASTA_in $SCRIPTDIR/reformat.in
+perl $SCRIPTDIR/fasta_seq_reformat_export.pl > $SCRIPTDIR/run.log
+
+mv $SCRIPTDIR/reformat_out $FASTA_in
+
+cp $FASTA_in $SCRIPTDIR/composition.in
+
+perl $SCRIPTDIR/seq_compositions_perc_pipeline_export.pl >> $SCRIPTDIR/run.log
+
+mv $SCRIPTDIR/composition_all.out $SCRIPTDIR/seq_composition.txt
+
+perl $SCRIPTDIR/server_prediction_seq_export.pl >> $SCRIPTDIR/run.log
+
+cp $FASTA_in $SCRIPTDIR/seq_props.in
+
+perl $SCRIPTDIR/seq_props_ALL_export.pl >> $SCRIPTDIR/run.log
+
+mv $SCRIPTDIR/seq_prediction.txt $SCRIPTDIR/seq_prediction_OLD.txt
+perl $SCRIPTDIR/profiles_gather_export.pl > $SCRIPTDIR/run.log
+
+#rm bins.txt reformat.in seq_props.in seq_props.out STYprops.out composition.in seq_prediction_OLD.txt
--- a/profiles_gather_export.pl
+++ b/profiles_gather_export.pl
+#!/usr/bin/perl -w
+
+use FindBin;
+$file_here = "$FindBin::Bin/blah.txt";
+open (BLAH, ">>$file_here");		# file for comments, throughout server code
+
+open (my $PROPS, "<", "$FindBin::Bin/STYprops.out") or die "cannot open STYprops.out profiles_gather\n";
+@props	= <$PROPS>;
+close ($PROPS);
+
+$nseq				= 0;
+foreach $line (@props) {
+  chomp $line;
+  @words			= split (" ",$line);
+  $nwords			= @words;
+
+  if (exists $words[0]) {			# non-null line
+    if ($words[0] eq 'STARTID') {		# new sequence - store the ID
+      $id_here			= $words[1];
+      $nseq++;
+
+      $seqs{$id_here}		= $nseq;
+      $nwin			= 0;
+    } elsif ($words[0] eq 'NONP')	{		# a window line, assume associate with id_here
+      $nwin++;
+      $KD[$nseq][$nwin]		= $words[5];
+      $FI[$nseq][$nwin]		= $words[6];
+      $ent[$nseq][$nwin]	= $words[7];
+      $meanQ[$nseq][$nwin]	= $words[10];
+    } elsif ($words[0] eq 'ENDID') {		# have the profiles
+      if ($words[1] ne $id_here) {
+        printf BLAH "**** STOPPING - IDs mismatch in profiles_gather.pl\n";
+	exit;
+      }
+      $numwin[$nseq]		= $nwin;	# store the number of windows (naa - 20) for this sequence
+    }
+  }
+}
+
+# read in the seq_prediction_OLD.txt file and write out the updated version with profiles added (interstitial)
+
+$file_here = "$FindBin::Bin/seq_prediction_OLD.txt";
+open (OLD, "<$file_here") or die "cannot open seq_prediction_OLD.txt";
+@old	= <OLD>;
+close (OLD);
+$file_here = "$FindBin::Bin/../../tmp/seq_prediction.txt";
+open (NEW, ">$file_here") or die "cannot open seq_prediction.txt";
+
+foreach $line (@old) {
+  chomp $line;
+  @words			= split (",",$line);
+  $nwords			= @words;
+  if (exists $words[0]) {			# non-null line
+    if ($words[0] eq 'SEQUENCE DEVIATIONS') {	# write matched seq profiles after this line
+      printf NEW "$line\n";				# write the SEQUENCE DEVIATIONS line
+      $id_here			= $words[1];
+
+      if (exists $seqs{$id_here}) {
+        $nseq_here		= $seqs{$id_here};
+	$nwin_here		= $numwin[$nseq_here];
+	printf NEW "SEQUENCE PROFILE KyteDoolittle,$id_here";
+	for ($n=1; $n<=$nwin_here; $n++) { printf NEW ",$KD[$nseq_here][$n]"; }
+	printf NEW "\n";
+	printf NEW "SEQUENCE PROFILE Uversky??,$id_here";
+	for ($n=1; $n<=$nwin_here; $n++) { printf NEW ",$FI[$nseq_here][$n]"; }
+	printf NEW "\n";
+	printf NEW "SEQUENCE PROFILE entropy,$id_here";
+	for ($n=1; $n<=$nwin_here; $n++) { printf NEW ",$ent[$nseq_here][$n]"; }
+	printf NEW "\n";
+	printf NEW "SEQUENCE PROFILE charge,$id_here";
+	for ($n=1; $n<=$nwin_here; $n++) { printf NEW ",$meanQ[$nseq_here][$n]"; }
+	printf NEW "\n";
+      } else {
+
+# ?? - sequence unmatched - what to do?
+      }
+    } else {
+      printf NEW "$line\n";			# all other (than SEQUENCE DEVIATIONS) lines write and carry on
+    }
+  } else {					# null line write
+    printf NEW "$line\n";
+  }
+}
+
+close (NEW);
+close (BLAH);
+
+exit;
--- a/seq_composition.txt
+++ b/seq_composition.txt
--- a/seq_compositions_perc_pipeline_export.pl
+++ b/seq_compositions_perc_pipeline_export.pl
+#!/usr/bin/perl -w
+
+use FindBin;
+$file_here = "$FindBin::Bin/blah.txt";
+open (BLAH, ">>$file_here");		# file used for comments throughout server code
+
+$window[1]	= 21;
+$window_half[1]	= 10;
+$window[2]	= 51;
+$window_half[2]	= 25;
+$window_flag[1]	= "WIN21";
+$window_flag[2]	= "WIN51";
+
+@aa_single	= ("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y");
+
+if ((exists $ENV{MEMBRANE_EXCLUDE}) and ($ENV{MEMBRANE_EXCLUDE} eq "YES")) {
+  $mem_exclude	= "yes";
+  $KD_threshold	= 1.6;	# value of 19 win avg (-4.5 to 4.5) KD suggested by KD - we'll use for 21 win
+  printf BLAH "membrane proteins will be excluded, based on KD of 21 aa window > ?? value absent \n";
+} else {
+  $mem_exclude	= "no";
+  $KD_threshold	= 1.6;	# value of 19 win avg (-4.5 to 4.5) KD suggested by KD - we'll use for 21 win, still use even if not excl
+}
+
+# e.g. for excel correlation calc, if we have the abundance/solub data in the ID line
+if ((exists $ENV{WHOLE_SEQ_OUT}) and ($ENV{WHOLE_SEQ_OUT} eq "yes")) {
+  $whole_seq_out	= "yes";
+  printf BLAH "WHOLE_SEQ_OUT set: for WHOLE-SEQ only, output file ID line with whole seq data, in whole_seq_data.txt\n";
+} else {
+  $whole_seq_out	= "no";
+}
+
+
+$file_here = "$FindBin::Bin/composition.in";
+open (COMP_IN, "<$file_here") or die "cannot open composition.in\n";
+$nseq		= 0;
+$first_seq		= 'yes';
+while ($line 		= <COMP_IN>) {
+  chomp $line;
+  @words 	= split ("",$line);
+  @fields	= split (" ",$line);
+  if (exists $words[0]) {
+    if ($words[0] eq ">") {				# new protein ID
+      if ($first_seq eq 'yes') {			# first ID, no seqbuf
+        $seqbuf		= "";
+	$idbuf		= $fields[0];
+	$textbuf	= $line;
+  	$first_seq	= 'no';
+      } else {
+        if ($seqbuf ne "") {				# only process further if non-null seq
+	  $nseq++;
+	  $seqtext[$nseq]		= $textbuf;
+	  $seqid[$nseq]			= $idbuf;
+	  $seqtmp{$seqid[$nseq]}	= $seqbuf;	# store away the sequence
+	}
+	$idbuf		= $fields[0];
+	$textbuf	= $line;
+	$seqbuf		= "";
+      }
+    } else {
+      if ($first_seq eq 'yes') {
+	printf BLAH "did not see first seq ID\n";
+  	exit;
+      }
+      $seqbuf	.= $line;
+    }
+  }
+}
+
+if ($seqbuf ne "") {				# catch the last seq, if non-null
+  $nseq++;
+  $seqtext[$nseq]		= $textbuf;
+  $seqid[$nseq]			= $idbuf;
+  $seqtmp{$seqid[$nseq]}	= $seqbuf;	# store away the sequence
+}
+
+$nseqs			= $nseq;
+close (COMP_IN);
+
+$file_here = "$FindBin::Bin/composition_all.out";
+open (COMP_OUT, ">$file_here") or die "cannot open composition_all.out\n";
+if ($whole_seq_out eq "yes") {
+  $file_here = "$FindBin::Bin/whole_seq_data.txt";
+  open (DATA_OUT, ">$file_here") or die "cannot open whole_seq_data.txt\n";
+}
+
+printf COMP_OUT "\nALL WIN21-MAX, WIN21-MIN, WIN51-MAX, WIN51-MIN data lines for each sequence\n";
+printf COMP_OUT "output header lines flagged by each data type for grep purposes\n\n";
+
+@flags = ('WHOLE-SEQ', 'WIN21-MAX', 'WIN21-MIN', 'WIN51-MAX', 'WIN51-MIN');
+foreach $win_flag (@flags) {
+  printf COMP_OUT "$win_flag, ORF-ID,K-R,D-E,naa,totperc,";
+  printf COMP_OUT "A,C,D,E,F,G,H,I,K,L,";
+  printf COMP_OUT "M,N,P,Q,R,S,T,V,W,Y,";
+  printf COMP_OUT "K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,";
+  printf COMP_OUT "KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity\n\n";
+  if (($whole_seq_out eq 'yes') and ($win_flag eq 'WHOLE-SEQ')) {
+    printf DATA_OUT "$win_flag, ORF-ID,K-R,D-E,naa,totperc,";
+    printf DATA_OUT "A,C,D,E,F,G,H,I,K,L,";
+    printf DATA_OUT "M,N,P,Q,R,S,T,V,W,Y,";
+    printf DATA_OUT "K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,";
+    printf DATA_OUT "KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity\n";
+  }
+}
+
+$Kall		= 0;
+$Rall		= 0;
+$Dall		= 0;
+$Eall		= 0;
+$Fall		= 0;
+$Wall		= 0;
+$Yall		= 0;
+$naa_all	= 0;
+$nmem_exclude	= 0;
+$nall_exclude	= 0;
+for ($nseq=1; $nseq<=$nseqs; $nseq++) {
+	$seqhere	= $seqtmp{$seqid[$nseq]};
+	$lentmp		= length $seqhere;
+
+	$KD_max_original = 999;
+
+	@words		= split ("",$seqhere);
+	$naa		= 0;
+	$Kcount		= 0;
+	$Rcount		= 0;
+	$Dcount		= 0;
+	$Ecount		= 0;
+	$Fcount		= 0;
+	$Wcount		= 0;
+	$Ycount		= 0;
+	for ($aa=1; $aa<=20; $aa++) {
+	  $aa_count[$aa-1]	= 0;
+	}
+	@seq_for_pI	= ();			# hopefully this zeroes the array
+	$nseq_for_pI	= 0;
+	for ($n=1; $n<=$lentmp; $n++) {
+	  if ($words[$n-1] ne " ") {
+	    if ($words[$n-1] eq "*") {
+	      if ($naa == 0) {
+	        printf BLAH "STOPPING - first seq char is a * - Plation\n";
+		exit;
+	      }
+	      $phosupd[$naa]	= "P";
+	    } else {
+	      $seq_for_pI[$nseq_for_pI]	= $words[$n-1];
+	      $nseq_for_pI++;
+	      $naa++;
+	      $sequpd[$naa]	= $words[$n-1];
+	      $phosupd[$naa]	= " ";
+	      if ($sequpd[$naa] eq 'K') { $Kcount++; }
+	      if ($sequpd[$naa] eq 'R') { $Rcount++; }
+	      if ($sequpd[$naa] eq 'D') { $Dcount++; }
+	      if ($sequpd[$naa] eq 'E') { $Ecount++; }
+	      if ($sequpd[$naa] eq 'F') { $Fcount++; }
+	      if ($sequpd[$naa] eq 'W') { $Wcount++; }
+	      if ($sequpd[$naa] eq 'Y') { $Ycount++; }
+	      for ($aa=1; $aa<=20; $aa++) {
+	        if ($sequpd[$naa] eq $aa_single[$aa-1]) { $aa_count[$aa-1]++; }
+	      }
+	    }
+	  }
+	}
+
+        $pI_all		= protein_pI (@seq_for_pI);
+	$null_return	= 'no';
+	$KD_all		= protein_KD (@seq_for_pI);
+	if ($KD_all == 999) { $null_return = 'yes'; }
+	$Q_all		= protein_Q (@seq_for_pI);
+	if ($Q_all == 999) { $null_return = 'yes'; }
+	$absQ_all	= abs ($Q_all);
+        $foldind_all	= 2.785*$KD_all - $absQ_all - 1.151;
+	$disorder_all	= protein_disorder (@seq_for_pI);
+	if ($disorder_all == 999) { $null_return = 'yes'; }
+	$ent_all	= protein_ent (@seq_for_pI);
+	if ($ent_all == 999) { $null_return = 'yes'; }
+	$betastrand_all	= protein_betastrand (@seq_for_pI);
+	if ($betastrand_all == 999) { $null_return = 'yes'; }
+
+	$perc_total		= 0;		# work out and output the various quantities for whole sequence
+	for ($aa=1; $aa<=20; $aa++) {
+	  $aa_perc[$aa-1]	= (100*$aa_count[$aa-1]/$naa);
+	  $perc_total		+= $aa_perc[$aa-1];
+	}
+	if ($naa != 0) {
+	  $Kcomp	= $Kcount/$naa;
+	  $Rcomp	= $Rcount/$naa;
+	  $Dcomp	= $Dcount/$naa;
+	  $Ecomp	= $Ecount/$naa;
+	  $Fcomp	= $Fcount/$naa;
+	  $Wcomp	= $Wcount/$naa;
+	  $Ycomp	= $Ycount/$naa;
+	  $KmR_perc	= 100*($Kcomp - $Rcomp);
+	  $DmE_perc	= 100*($Dcomp - $Ecomp);
+	} else {
+	  $Kcomp	= 0;			# not really 0 of course
+	  $Rcomp	= 0;			# also not really 0
+	  $Dcomp	= 0;			# not really 0 of course
+	  $Ecomp	= 0;			# also not really 0
+	  $Fcomp	= 0;
+	  $Wcomp	= 0;
+	  $Ycomp	= 0;
+	  $KmR_perc	= 0;
+	  $DmE_perc	= 0;
+	}
+	$KR_perc	= 100*($Kcomp+$Rcomp);
+	$DE_perc	= 100*($Dcomp+$Ecomp);
+	$FWY_perc	= 100*($Fcomp+$Wcomp+$Ycomp);
+	$KR_DE_perc	= $KR_perc - $DE_perc;
+	$KRplusDE_perc	= $KR_perc + $DE_perc;
+	$Kall		+= $Kcount;
+	$Rall		+= $Rcount;
+	$Dall		+= $Dcount;
+	$Eall		+= $Ecount;
+	$Fall		+= $Fcount;
+	$Wall		+= $Wcount;
+	$Yall		+= $Ycount;
+	$naa_all	+= $naa;
+
+    if ($null_return ne 'no') {
+	  printf COMP_OUT "\nSTOPPING - PROBLEM,$seqid[$nseq],null_return\n";
+	  exit;
+	}
+
+	if ($naa != 0) {			# now for the 21 and 51 aa windows, maxima and minima
+	  for ($nwin=1; $nwin<=2; $nwin++) {	# 1 = 21 and 2 = 51
+	    $window2	= $window_half[$nwin];
+	    $win_flag	= $window_flag[$nwin];
+	    $at_least_one	= "no";		# overwritten if we see a full length window
+
+	    $KR_perc_max 	= -100;
+	    $KR_perc_min 	= 100;
+	    $DE_perc_max 	= -100;
+	    $DE_perc_min 	= 100;
+	    $KR_DE_perc_max 	= -100;
+	    $KR_DE_perc_min 	= 100;
+	    $KRplusDE_perc_max 	= -100;
+	    $KRplusDE_perc_min 	= 100;
+	    $KmR_perc_max	= -100;
+	    $KmR_perc_min	= 100;
+	    $DmE_perc_max	= -100;
+	    $DmE_perc_min	= 100;
+	    $FWY_perc_max	= -100;
+	    $FWY_perc_min	= 100;
+	    for ($aa=1; $aa<=20; $aa++) {
+	      $aa_perc_min[$aa-1]	= 0;
+	      $aa_perc_max[$aa-1]	= 0;
+	    }
+	    $pI_max		= 1;
+	    $pI_min		= 14;
+
+	    $KD_max		=  0;
+	    $KD_min		=  1;
+
+	    $absQ_max		=  0;
+	    $absQ_min		=  1;
+
+	    $foldind_max	= -2;
+	    $foldind_min	=  2;
+
+	    $disorder_max	= -1;
+	    $disorder_min	=  1;
+
+	    $ent_max		=  0;
+	    $ent_min		=  5;
+
+	    $betastrand_max	= -1;
+	    $betastrand_min	=  1;
+
+	    for ($maa=1; $maa<=$naa; $maa++) {
+	      $Kwin		= 0;
+	      $Rwin		= 0;
+	      $Dwin		= 0;
+	      $Ewin		= 0;
+	      $Fwin		= 0;
+	      $Wwin		= 0;
+	      $Ywin		= 0;
+	      for ($aa=1; $aa<=20; $aa++) {
+	        $aa_count_win[$aa-1]	= 0;
+	      }
+
+	      $win_start	= $maa - $window2;
+	      if ($win_start < 1) {$win_start = 1; }
+	      $win_end		= $maa + $window2;
+	      if ($win_end > $naa) {$win_end = $naa; }
+	      @seq_for_pI	= ();			# hopefully this zeroes the array
+	      $nseq_for_pI	= 0;
+	      for ($win=$win_start; $win<=$win_end; $win++) {
+	        if ($sequpd[$win] eq 'K') { $Kwin++; }
+	        if ($sequpd[$win] eq 'R') { $Rwin++; }
+	        if ($sequpd[$win] eq 'D') { $Dwin++; }
+	        if ($sequpd[$win] eq 'E') { $Ewin++; }
+	        if ($sequpd[$win] eq 'F') { $Fwin++; }
+	        if ($sequpd[$win] eq 'W') { $Wwin++; }
+	        if ($sequpd[$win] eq 'Y') { $Ywin++; }
+	        for ($aa=1; $aa<=20; $aa++) {
+	          if ($sequpd[$win] eq $aa_single[$aa-1]) { $aa_count_win[$aa-1]++; }
+	        }
+	        $seq_for_pI[$nseq_for_pI]	= $sequpd[$win];
+	        $nseq_for_pI++;
+	      }
+	      $num_win		= $win_end - $win_start + 1;
+	      if ($num_win == $window[$nwin]) {			# process
+	        $at_least_one		= "yes";		# we have a full length window
+		$perc_total_win		= 0;
+		for ($aa=1; $aa<=20; $aa++) {
+	  	  $aa_perc_win[$aa-1]	= (100*$aa_count_win[$aa-1]/$num_win);
+	  	  $perc_total_win	+= $aa_perc_win[$aa-1];
+		}
+	        $Kcomp_win	= $Kwin/$num_win;
+	        $Rcomp_win	= $Rwin/$num_win;
+	        $Dcomp_win	= $Dwin/$num_win;
+	        $Ecomp_win	= $Ewin/$num_win;
+		$Fcomp_win	= $Fwin/$num_win;
+		$Wcomp_win	= $Wwin/$num_win;
+		$Ycomp_win	= $Ywin/$num_win;
+	        $KmR_perc_win	= 100*($Kcomp_win - $Rcomp_win);
+	        $DmE_perc_win	= 100*($Dcomp_win - $Ecomp_win);
+		$KR_perc_win	= 100*($Kcomp_win+$Rcomp_win);
+		$DE_perc_win	= 100*($Dcomp_win+$Ecomp_win);
+		$KR_DE_perc_win	= $KR_perc_win - $DE_perc_win;
+		$KRplusDE_perc_win	= $KR_perc_win + $DE_perc_win;
+		$FWY_perc_win	= 100*($Fcomp_win+$Wcomp_win+$Ycomp_win);
+
+		if ($KR_perc_win > $KR_perc_max) { $KR_perc_max = $KR_perc_win; }
+		if ($KR_perc_win < $KR_perc_min) { $KR_perc_min = $KR_perc_win; }
+		if ($DE_perc_win > $DE_perc_max) { $DE_perc_max = $DE_perc_win; }
+		if ($DE_perc_win < $DE_perc_min) { $DE_perc_min = $DE_perc_win; }
+		if ($KR_DE_perc_win > $KR_DE_perc_max) { $KR_DE_perc_max = $KR_DE_perc_win; }
+		if ($KR_DE_perc_win < $KR_DE_perc_min) { $KR_DE_perc_min = $KR_DE_perc_win; }
+		if ($KRplusDE_perc_win > $KRplusDE_perc_max) { $KRplusDE_perc_max = $KRplusDE_perc_win; }
+		if ($KRplusDE_perc_win < $KRplusDE_perc_min) { $KRplusDE_perc_min = $KRplusDE_perc_win; }
+		if ($KmR_perc_win > $KmR_perc_max) { $KmR_perc_max = $KmR_perc_win; }
+		if ($KmR_perc_win < $KmR_perc_min) { $KmR_perc_min = $KmR_perc_win; }
+		if ($DmE_perc_win > $DmE_perc_max) { $DmE_perc_max = $DmE_perc_win; }
+		if ($DmE_perc_win < $DmE_perc_min) { $DmE_perc_min = $DmE_perc_win; }
+		if ($FWY_perc_win > $FWY_perc_max) { $FWY_perc_max = $FWY_perc_win; }
+		if ($FWY_perc_win < $FWY_perc_min) { $FWY_perc_min = $FWY_perc_win; }
+
+	        for ($aa=1; $aa<=20; $aa++) {
+	          if ($aa_perc_win[$aa-1] > $aa_perc_max[$aa-1]) { $aa_perc_max[$aa-1] = $aa_perc_win[$aa-1]; }
+	          if ($aa_perc_win[$aa-1] < $aa_perc_min[$aa-1]) { $aa_perc_min[$aa-1] = $aa_perc_win[$aa-1]; }
+	        }
+
+                $pI_win		= protein_pI (@seq_for_pI);
+		if ($pI_win > $pI_max) { $pI_max = $pI_win; }
+		if ($pI_win < $pI_min) { $pI_min = $pI_win; }
+
+		$null_return	= 'no';
+		$KD_win		= protein_KD (@seq_for_pI);
+		if ($KD_win == 999) { $null_return = 'yes'; }
+		$Q_win		= protein_Q (@seq_for_pI);
+		if ($Q_win == 999) { $null_return = 'yes'; }
+		$absQ_win	= abs ($Q_win);
+	        $foldind_win	= 2.785*$KD_win - $absQ_win - 1.151;
+		$disorder_win	= protein_disorder (@seq_for_pI);
+		if ($disorder_win == 999) { $null_return = 'yes'; }
+		$ent_win	= protein_ent (@seq_for_pI);
+		if ($ent_win == 999) { $null_return = 'yes'; }
+		$betastrand_win	= protein_betastrand (@seq_for_pI);
+		if ($betastrand_win == 999) { $null_return = 'yes'; }
+
+		if ($KD_win > $KD_max) {
+		  $KD_max = $KD_win;
+		  $KD_max_original	= (9*$KD_max) - 4.5	# rescale back to KD proper, use if we're exlcuding mem proteins
+		}
+		if ($KD_win < $KD_min) { $KD_min = $KD_win; }
+		if ($absQ_win > $absQ_max) { $absQ_max = $absQ_win; }
+		if ($absQ_win < $absQ_min) { $absQ_min = $absQ_win; }
+		if ($foldind_win > $foldind_max) { $foldind_max = $foldind_win; }
+		if ($foldind_win < $foldind_min) { $foldind_min = $foldind_win; }
+		if ($disorder_win > $disorder_max) { $disorder_max = $disorder_win; }
+		if ($disorder_win < $disorder_min) { $disorder_min = $disorder_win; }
+		if ($ent_win > $ent_max) { $ent_max = $ent_win; }
+		if ($ent_win < $ent_min) { $ent_min = $ent_win; }
+		if ($betastrand_win > $betastrand_max) { $betastrand_max = $betastrand_win; }
+		if ($betastrand_win < $betastrand_min) { $betastrand_min = $betastrand_win; }
+
+	      }
+	    }		# end of this (21 51) window sliding along for this sequence
+
+	    if ($nwin == 1) {			# 21 aa window
+#	      if ($mem_exclude eq "yes") {	# put lower down to give flag pred TM, even if not skipping it
+	        if ($win_flag ne "WIN21") {
+		  printf BLAH "STOPPING - win_flag should be 21 for KD-membrane protein check\n";
+		  exit;
+		}
+	        if ($null_return ne 'no') { goto SKIPPED; }	# skip protein since we cannot KD test a null return
+		if ($at_least_one ne "yes") { goto SKIPPED; }	# skip protein since no window to KD test
+		if ($KD_max_original == 999) {
+		  printf BLAH "STOPPING - KD_max_original unset\n";
+		} else {
+		  if ($KD_max_original > $KD_threshold) {	# any KD looks like mem window, report, count and outta here
+	            if ($mem_exclude eq "yes") {
+		      $nmem_exclude++;
+		      printf BLAH "MEM EXCLUDE - $seqid[$nseq] - with KD, threshold = $KD_max_original $KD_threshold\n";
+		      goto SKIPPED;
+		    } else {
+		      printf BLAH "\nTM REGION PREDICTED for $seqid[$nseq] - with non-normalised KD of ";
+		      printf BLAH "%6.2f", $KD_max_original;
+		      printf BLAH " compared with threshold of $KD_threshold\n";
+		    }
+		  }
+		}
+	      printf COMP_OUT "\n";
+	      printf COMP_OUT "WHOLE-SEQ,$seqid[$nseq],";
+	      printf COMP_OUT "%7.2f,%7.2f,%5.0f,%7.2f", $KmR_perc, $DmE_perc, $naa, $perc_total;
+	      for ($aa=1; $aa<=20; $aa++) { printf COMP_OUT ",%6.2f", $aa_perc[$aa-1]; }
+	      printf COMP_OUT ",%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%6.2f", $KR_perc, $DE_perc, $KR_DE_perc, $KRplusDE_perc, $FWY_perc, $pI_all;
+	      printf COMP_OUT ",%8.3f,%8.3f,%8.3f,%8.3f, %8.3f, %8.3f", $KD_all, $absQ_all, $foldind_all, $disorder_all, $ent_all, $betastrand_all;
+	      printf COMP_OUT "\n";
+
+	      if ($whole_seq_out eq 'yes') {
+#	        printf DATA_OUT "\n";
+	        printf DATA_OUT "WHOLE-SEQ,$seqtext[$nseq],";		# seqtext (rather than seqid) should be the whole ID line
+	        printf DATA_OUT "%7.2f,%7.2f,%5.0f,%7.2f", $KmR_perc, $DmE_perc, $naa, $perc_total;
+	        for ($aa=1; $aa<=20; $aa++) { printf DATA_OUT ",%6.2f", $aa_perc[$aa-1]; }
+	        printf DATA_OUT ",%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%6.2f", $KR_perc, $DE_perc, $KR_DE_perc, $KRplusDE_perc, $FWY_perc, $pI_all;
+	        printf DATA_OUT ",%8.3f,%8.3f,%8.3f,%8.3f, %8.3f, %8.3f", $KD_all, $absQ_all, $foldind_all, $disorder_all, $ent_all, $betastrand_all;
+	        printf DATA_OUT "\n";
+	      }
+
+	    }		# end nwin=1 check
+
+	    if ($null_return eq 'no') {		# at present have no reporting of null returns for windows
+	     if ($at_least_one eq "yes") {	# we have at least one full length window
+						# OTHERWISE we simply MISS OUT this write
+						# so NOT ALL SEQS will have writes for both windows
+	      $flag_here	= $win_flag."-MAX";
+	      printf COMP_OUT "$flag_here,$seqid[$nseq],";
+	      printf COMP_OUT "%7.2f,%7.2f,%5.0f,%7.2f", $KmR_perc_max, $DmE_perc_max, $window[$nwin], $perc_total_win;
+	      for ($aa=1; $aa<=20; $aa++) { printf COMP_OUT ",%6.2f", $aa_perc_max[$aa-1]; }
+	      printf COMP_OUT ",%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%6.2f", $KR_perc_max, $DE_perc_max, $KR_DE_perc_max, $KRplusDE_perc_max, $FWY_perc_max, $pI_max;
+	      printf COMP_OUT ",%8.3f,%8.3f,%8.3f,%8.3f, %8.3f, %8.3f", $KD_max, $absQ_max, $foldind_max, $disorder_max, $ent_max, $betastrand_max;
+	      printf COMP_OUT "\n";
+	      $flag_here	= $win_flag."-MIN";
+	      printf COMP_OUT "$flag_here,$seqid[$nseq],";
+	      printf COMP_OUT "%7.2f,%7.2f,%5.0f,%7.2f", $KmR_perc_min, $DmE_perc_min, $window[$nwin], $perc_total_win;
+	      for ($aa=1; $aa<=20; $aa++) { printf COMP_OUT ",%6.2f", $aa_perc_min[$aa-1]; }
+	      printf COMP_OUT ",%7.2f,%7.2f,%7.2f,%7.2f,%7.2f,%6.2f", $KR_perc_min, $DE_perc_min, $KR_DE_perc_min, $KRplusDE_perc_min, $FWY_perc_min, $pI_min;
+	      printf COMP_OUT ",%8.3f,%8.3f,%8.3f,%8.3f, %8.3f, %8.3f", $KD_min, $absQ_min, $foldind_min, $disorder_min, $ent_min, $betastrand_min;
+	      printf COMP_OUT "\n";
+	     }
+	    }
+	  }	# end nwin = 1, 2 loop
+SKIPPED:
+	  $nall_exclude++;
+	}
+}
+$KmR_all_perc	= 100*(($Kall - $Rall)/$naa_all);	# a few things over the whole sequence set
+$DmE_all_perc	= 100*(($Dall - $Eall)/$naa_all);
+$Kall_perc	= 100*($Kall/$naa_all);
+$Rall_perc	= 100*($Rall/$naa_all);
+$Dall_perc	= 100*($Dall/$naa_all);
+$Eall_perc	= 100*($Eall/$naa_all);
+# $FWYall_perc	= 100*(($Fall + $Wall + $Yall)/$naa_all);
+printf COMP_OUT "\n";
+printf COMP_OUT "SUMS, for this set of seqs, overall K minus R and D minus E percentages = ";
+printf COMP_OUT "%6.3f %6.3f\n\n", $KmR_all_perc, $DmE_all_perc;
+printf COMP_OUT "SUMS, for this set of seqs, overall K R D E percentages = ";
+printf COMP_OUT "%6.4f %6.4f %6.4f %6.4f\n", $Kall_perc, $Rall_perc, $Dall_perc, $Eall_perc;
+if ($mem_exclude eq "yes") {
+  printf COMP_OUT "\n\nMEMBRANE_EXCLUDE YES set with nmem_excluded = $nmem_exclude and nall_excluded = $nall_exclude\n";
+  printf BLAH "\n\nMEMBRANE_EXCLUDE yes set with nmem_excluded = $nmem_exclude and nall_excluded = $nall_exclude\n";
+  if ($whole_seq_out eq 'yes') {
+    printf DATA_OUT "\n\nMEMBRANE_EXCLUDE YES set with nmem_excluded = $nmem_exclude and nall_excluded = $nall_exclude\n";
+  }
+}
+
+close (COMP_OUT);
+if ($whole_seq_out eq 'yes') {
+  close (DATA_OUT);
+}
+
+close (BLAH);
+
+exit;
+
+
+#-------------------------------------------------------------------------
+#----------------------
+# SUBROUTINE protein_pI
+#----------------------
+# pass a list of single letter amino acid codes, return the pI (pH at
+# neutral charge) for this protein
+
+sub protein_pI {
+
+	my @aa_seq;
+	my $aa;
+	my %aapK;
+	my %aafactor;
+	my %aanum;
+	my @aa_keys;
+	my $pI;
+	my $protein_charge;
+	my $protein_charge_last;
+	my $fac;
+	my $pH;
+	my $pH_dec;
+	my $loop;
+	my $pH_pK;
+	my $charge_inc;
+
+# put the passed sequence into the aa_seq array
+
+	@aa_seq = @_;
+
+# data initialisation --------------------------------------------------
+# set up the ionisable group pKs, note that we are leaving out Cys
+# (assume that cysteines are oxidised in disulphides)
+# phosphates (X) are -1 always and -1 to -2 with pK 7
+
+	%aapK = (
+ 	"D" =>  4.0,
+	"E" =>  4.4,
+ 	"H" =>  6.3,
+ 	"K" => 10.4,
+ 	"R" => 12.0,
+	"X" =>  7.0,
+	"Y" => 10.1);
+
+# set up the -1 or +1 factor that gives us the charge sign (acid=-,
+# base=+), and is used to get the right sign of (pH-pK)
+
+	%aafactor = (
+ 	"D" =>  -1.0,
+	"E" =>  -1.0,
+ 	"H" =>   1.0,
+ 	"K" =>   1.0,
+ 	"R" =>   1.0,
+	"X" =>  -1.0,
+	"Y" =>  -1.0);
+
+# initialise counter of amino acid types (for those with pKs of interest)
+
+	%aanum = (
+ 	"D" =>  0,
+	"E" =>  0,
+ 	"H" =>  0,
+ 	"K" =>  0,
+ 	"R" =>  0,
+	"X" =>  0,
+	"Y" =>  0);
+# -----------------------------------------------------------------------
+
+# add the number of each ionisable aa type of interest
+
+	my $ionisable_aas = 0;
+	foreach $aa (@aa_seq) {
+		if (exists $aapK{$aa}) {
+			$aanum{$aa}++;
+			$ionisable_aas++;
+		}
+	}
+	@aa_keys = keys %aanum;
+
+# for any ionisable group, using the aafactor of -1 or +1 as set above,
+# the charge at pH is:	(aafactor) x { 1/(1+10**[(aafactor)(pH-pK)]) }
+# and the overall protein charge is the sum of such over ionisable groups
+
+# our strategy is to come down from a high pH, look for charge sign
+# swap (- to +), and then linearly interpolate, (as in the 2CPS notes)
+
+	$loop = 0;
+	$pH = 14;
+	$pH_dec = 0.5;
+	$protein_charge = -100;		# dummy value, calculate as we go
+	while (($protein_charge < 0) and ($pH > 0.9)) {
+		$loop++;
+		$protein_charge_last = $protein_charge;
+		$protein_charge = 0.0;
+		foreach $aa (@aa_keys) {
+			if ($aanum{$aa} > 0) {
+				$fac = $aafactor{$aa};
+				$pH_pK = $pH - $aapK{$aa};
+				if ($aa eq "X") {
+					$charge_inc = -1.0 + $fac * (1/(1+10**($fac*$pH_pK)));
+				} else {
+					$charge_inc = $fac * (1/(1+10**($fac*$pH_pK)));
+				}
+				$protein_charge += $aanum{$aa}*$charge_inc;
+			}
+		}
+		$pH -= $pH_dec;	# i.e. $pH - $pH-$pH_dec
+	}
+
+# There are various conditions to look out for and signal on return:
+# 	with no ionisable groups, neutral at all pH:	return 999 - NO, just return pH 7.0
+#	if pI > 14 (e.g. only basic aas):		return 14
+#	if pI < 1 (e.g. only acidic aas):		return 1
+
+	if ($ionisable_aas == 0) {
+#		$pI = 999;
+		$pI = 7;
+	} elsif ($loop == 1) {
+		$pI= 14;
+	} elsif ($loop == 27) {
+		$pI = 1;
+	} else {
+		$pI = ($pH+$pH_dec) + $pH_dec*($protein_charge/($protein_charge-$protein_charge_last));
+	}
+
+$pH		= 8.75;		# add a specific charge calculation for selected pH
+$charge_here	= 0;
+foreach $aa (@aa_keys) {
+	if ($aanum{$aa} > 0) {
+		$fac = $aafactor{$aa};
+		$pH_pK = $pH - $aapK{$aa};
+		if ($aa eq "X") {
+			$charge_inc = -1.0 + $fac * (1/(1+10**($fac*$pH_pK)));
+		} else {
+			$charge_inc = $fac * (1/(1+10**($fac*$pH_pK)));
+		}
+		$charge_here += $aanum{$aa}*$charge_inc;
+	}
+}
+# printf OUT "\n$pdbwords[0] $pH ";
+# printf OUT "%7.3f\n", $charge_here;
+
+	return $pI;
+}
+
+#-------------------------------------------------------------------------
+
+#-------------------------------------------------------------------------
+#----------------------
+# SUBROUTINE protein_KD
+#----------------------
+# pass a list of single letter amino acid codes, return the pI (pH at
+# neutral charge) for this protein
+
+sub protein_KD {
+
+#	KD hash is Kyte-Doolittle hydropathy index : JMB (1982) 157:105
+#	for FOld Index, scale/norm following Uversky : Proteins (2000) 41:415, (-4.5,+4.5) to (0,1)
+#	use KDnorm to get KDnorm_mean = sum(KDnorm) / winlen
+
+$KD {'A'}	=  1.8;
+$KD {'C'}	=  2.5;
+$KD {'D'}	= -3.5;
+$KD {'E'}	= -3.5;
+$KD {'F'}	=  2.8;
+$KD {'G'}	= -0.4;
+$KD {'H'}	= -3.2;
+$KD {'I'}	=  4.5;
+$KD {'K'}	= -3.9;
+$KD {'L'}	=  3.8;
+$KD {'M'}	=  1.9;
+$KD {'N'}	= -3.5;
+$KD {'P'}	= -1.6;
+$KD {'Q'}	= -3.5;
+$KD {'R'}	= -4.5;
+$KD {'S'}	= -0.8;
+$KD {'T'}	= -0.7;
+$KD {'V'}	=  4.2;
+$KD {'W'}	= -0.9;
+$KD {'Y'}	= -1.3;
+$KD {'X'}	=  0.0;
+
+$KD {'a'}	=  1.8;
+$KD {'c'}	=  2.5;
+$KD {'d'}	= -3.5;
+$KD {'e'}	= -3.5;
+$KD {'f'}	=  2.8;
+$KD {'g'}	= -0.4;
+$KD {'h'}	= -3.2;
+$KD {'i'}	=  4.5;
+$KD {'k'}	= -3.9;
+$KD {'l'}	=  3.8;
+$KD {'m'}	=  1.9;
+$KD {'n'}	= -3.5;
+$KD {'p'}	= -1.6;
+$KD {'q'}	= -3.5;
+$KD {'r'}	= -4.5;
+$KD {'s'}	= -0.8;
+$KD {'t'}	= -0.7;
+$KD {'v'}	=  4.2;
+$KD {'w'}	= -0.9;
+$KD {'y'}	= -1.3;
+$KD {'x'}	=  0.0;
+
+# Normalise the KD hydropathy values.
+
+  @KD_keys	= keys %KD;
+  foreach $keyhere (@KD_keys) {
+    $KDtmp		= $KD{$keyhere} + 4.5;
+    $KDnorm{$keyhere}	= $KDtmp / 9.0;
+#    printf BLAH "KDnorm for $keyhere is $KDnorm{$keyhere}\n"; 
+  }
+
+# put the passed sequence into the aa_seq array
+
+  @aa_seq = @_;
+
+  $KDnorm_tot	= 0;
+  $n_KD_residues	= 0;
+  foreach $aa (@aa_seq) {
+    if (exists $KD{$aa}) {
+      $KDnorm_tot	= $KDnorm_tot + $KDnorm{$aa};
+      $n_KD_residues++;
+    }
+  }
+  if ($n_KD_residues != 0) {
+    $KDnorm_average	= $KDnorm_tot / $n_KD_residues;
+  } else {
+    $KDnorm_average	= 999;		# no residues, value unset, pick up in parent code
+  }
+
+return $KDnorm_average;
+
+}
+
+#-------------------------------------------------------------------------
+#----------------------
+# SUBROUTINE protein_Q
+#----------------------
+# pass a list of single letter amino acid codes, return the net charge per residue
+
+# ?? his_charged ENV variable can be used to set his to +1 rather than zero, not used currently ??
+
+sub protein_Q {
+
+$chr {'A'}	=  0.0;
+$chr {'C'}	=  0.0;
+$chr {'D'}	= -1.0;
+$chr {'E'}	= -1.0;
+$chr {'F'}	=  0.0;
+$chr {'G'}	=  0.0;
+$chr {'H'}	=  0.0;
+$chr {'I'}	=  0.0;
+$chr {'K'}	=  1.0;
+$chr {'L'}	=  0.0;
+$chr {'M'}	=  0.0;
+$chr {'N'}	=  0.0;
+$chr {'P'}	=  0.0;
+$chr {'Q'}	=  0.0;
+$chr {'R'}	=  1.0;
+$chr {'S'}	=  0.0;
+$chr {'T'}	=  0.0;
+$chr {'V'}	=  0.0;
+$chr {'W'}	=  0.0;
+$chr {'Y'}	=  0.0;
+$chr {'X'}	=  0.0;
+
+$chr {'a'}	=  0.0;
+$chr {'c'}	=  0.0;
+$chr {'d'}	= -1.0;
+$chr {'e'}	= -1.0;
+$chr {'f'}	=  0.0;
+$chr {'g'}	=  0.0;
+$chr {'h'}	=  0.0;
+$chr {'i'}	=  0.0;
+$chr {'k'}	=  1.0;
+$chr {'l'}	=  0.0;
+$chr {'m'}	=  0.0;
+$chr {'n'}	=  0.0;
+$chr {'p'}	=  0.0;
+$chr {'q'}	=  0.0;
+$chr {'r'}	=  1.0;
+$chr {'s'}	=  0.0;
+$chr {'t'}	=  0.0;
+$chr {'v'}	=  0.0;
+$chr {'w'}	=  0.0;
+$chr {'y'}	=  0.0;
+$chr {'x'}	=  0.0;
+
+if ((exists $ENV{his_charged}) and ($ENV{his_charged} eq 'yes')) {
+  $chr {'H'}	=  1.0;
+  $chr {'h'}	=  1.0;
+}
+
+# put the passed sequence into the aa_seq array
+
+  @aa_seq = @_;
+
+  $Q_tot		= 0;
+  $n_Q_residues		= 0;
+  foreach $aa (@aa_seq) {
+    if (exists $chr{$aa}) {
+      $Q_tot	= $Q_tot + $chr{$aa};
+      $n_Q_residues++;
+    }
+  }
+  if ($n_Q_residues == 0) {
+    $Q_mean		= 999;
+  } else {
+    $Q_mean		= $Q_tot / $n_Q_residues;
+  }
+  
+  return $Q_mean;
+}
+
+
+#-------------------------------------------------------------------------
+#----------------------------
+# SUBROUTINE protein_disorder
+#----------------------------
+# pass a list of single letter amino acid codes, return the disorder calculation
+
+sub protein_disorder {
+
+# dis hash is Rune and Linding disorder propensity : GlobPlot : NAR (2003) 31:3701
+
+$dis {'A'}	= -0.26154;
+$dis {'C'}	= -0.01515;
+$dis {'D'}	=  0.22763;
+$dis {'E'}	= -0.20469;
+$dis {'F'}	= -0.22557;
+$dis {'G'}	=  0.43323;
+$dis {'H'}	= -0.00122;
+$dis {'I'}	= -0.42224;
+$dis {'K'}	= -0.10009;
+$dis {'L'}	= -0.33793;
+$dis {'M'}	= -0.22590;
+$dis {'N'}	=  0.22989;
+$dis {'P'}	=  0.55232;
+$dis {'Q'}	= -0.18768;
+$dis {'R'}	= -0.17659;
+$dis {'S'}	=  0.14288;
+$dis {'T'}	=  0.00888;
+$dis {'V'}	= -0.38618;
+$dis {'W'}	= -0.24338;
+$dis {'Y'}	= -0.20751;
+$dis {'X'}	=  0.0;
+
+$dis {'a'}	= -0.26154;
+$dis {'c'}	= -0.01515;
+$dis {'d'}	=  0.22763;
+$dis {'e'}	= -0.20469;
+$dis {'f'}	= -0.22557;
+$dis {'g'}	=  0.43323;
+$dis {'h'}	= -0.00122;
+$dis {'i'}	= -0.42224;
+$dis {'k'}	= -0.10009;
+$dis {'l'}	= -0.33793;
+$dis {'m'}	= -0.22590;
+$dis {'n'}	=  0.22989;
+$dis {'p'}	=  0.55232;
+$dis {'q'}	= -0.18768;
+$dis {'r'}	= -0.17659;
+$dis {'s'}	=  0.14288;
+$dis {'t'}	=  0.00888;
+$dis {'v'}	= -0.38618;
+$dis {'w'}	= -0.24338;
+$dis {'y'}	= -0.20751;
+$dis {'x'}	=  0.0;
+
+# put the passed sequence into the aa_seq array
+
+  @aa_seq = @_;
+
+  $dis_tot	= 0;
+  $ndis_tot	= 0;
+  foreach $aa (@aa_seq) {
+    if (exists $dis{$aa}) {
+      $dis_tot += $dis{$aa};
+      $ndis_tot++;
+    }
+  }
+
+  if ($ndis_tot == 0) {
+    $dis_mean		= 999;
+  } else {
+    $dis_mean		= $dis_tot / $ndis_tot;
+  }
+  
+  return $dis_mean;
+}
+
+
+#-------------------------------------------------------------------------
+#-----------------------
+# SUBROUTINE protein_ent
+#-----------------------
+# pass a list of single letter amino acid codes, return the pI (pH at
+# neutral charge) for this protein
+
+sub protein_ent {
+
+@aa_single_here	= ("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y");
+
+# put the passed sequence into the aa_seq array
+
+  @aa_seq = @_;
+
+  for $aa (@aa_single_here) {
+    $aa_count_here{$aa}	= 0;
+  }
+
+  $n_ent		= 0;
+  foreach $aa (@aa_seq) {
+    $AA_UC		= uc ($aa);			# should be upper case already, but only want to count 20 aas, not 40
+    if (exists $aa_count_here{$AA_UC}) {
+      $aa_count_here{$AA_UC}++;
+      $n_ent++;
+    }
+  }
+
+  if ($n_ent == 0) {
+    $ent_sum		= 999;
+  } else {
+    $ent_sum		= 0;
+    $frac_sum		= 0;
+    for $aa (@aa_single_here) {
+      if ($aa_count_here{$aa} != 0) {
+        $frac_here	= $aa_count_here{$aa} / $n_ent;
+        $incr_here	= -$frac_here * (log($frac_here) / log(2.0));
+        $ent_sum		+= $incr_here;
+        $frac_sum		+= $frac_here;
+      }
+    }
+  }
+
+  return $ent_sum;
+}
+
+
+#-------------------------------------------------------------------------
+#------------------------------
+# SUBROUTINE protein_betastrand
+#------------------------------
+# just returning the mean beta propensity - ?? could do more with 3 states
+
+sub protein_betastrand {
+
+# a very simple Chou-Fasman type ss prediction - see Costantini A et al (2006) BBRC 342:441-451
+
+# read in the propensities
+
+$sp92		= ' ';
+for ($s=1; $s<=91; $s++) { $sp92 .= ' '; }
+$sp81		= ' ';
+for ($s=1; $s<=80; $s++) { $sp81 .= ' '; }
+
+# no need for time_stamp - this is data, for read only
+
+$file_here = "$FindBin::Bin/ss_propensities.txt";
+open (PROPENSITIES, "<$file_here") or die "cannot open $file_here\n";
+$naa_here		= 0;
+while ($line 		= <PROPENSITIES>) {
+	chomp $line;
+	my @words 	= split (" ",$line);
+	if (exists $words[0]) {
+	    if ($words[0] eq 'data') {			# amino acid data line
+		$naa_here++;
+		$aa		= $words[2];
+		$p_alpha{$aa}	= $words[4];
+		$p_beta{$aa}	= $words[5];
+		$p_coil{$aa}	= $words[6];
+	    }
+	}
+}
+# printf BLAH "propensities read in for $naa_here amino acids\n";
+close (PROPENSITIES);
+
+# put the passed sequence into the aa_seq array
+
+  @aa_seq = @_;
+
+  $sum_alpha	= 0;
+  $sum_beta	= 0;
+  $sum_coil	= 0;
+  $nseen	= 0;
+  foreach $aa (@aa_seq) {
+    if (exists $p_alpha{$aa}) {
+      $sum_alpha	+= $p_alpha{$aa};
+      $sum_beta		+= $p_beta{$aa};
+      $sum_coil		+= $p_coil{$aa};
+      $nseen++;
+    }
+  }
+  if ($nseen == 0) {
+    $beta_mean		= 999;
+  } else {
+    $beta_mean		= $sum_beta / $nseen;
+  }
+
+  return $beta_mean;
+}
+
--- a/seq_props_ALL_export.pl
+++ b/seq_props_ALL_export.pl
+#!/usr/bin/perl -w
+
+use FindBin;
+$file_here = "$FindBin::Bin/blah.txt";
+open (BLAH, ">>$file_here");	# file for comments, throughout server code
+
+$winlenP	= 21;		# window for calculation of all props except searching for charge border
+$winlenP2	= 10;		# = (winlenP-1)/2
+if ((2*$winlenP2+1) != $winlenP) {
+	printf BLAH "window or half window wrong\n";
+	exit;
+}
+
+# winlenB/2 used to search the charge border props around STY sites
+$winlenB	=  5;		#
+$winlenB2	=  2;		# 
+if ((2*$winlenB2+1) != $winlenB) {
+	printf BLAH "window or half window wrong\n";
+	exit;
+}
+
+$nbins_fi	= 30;
+$nbins_dis	= $nbins_fi;
+
+# env variable option to neutralise around STY in the charge border analysis ie test importance of Plation motif
+
+if (exists $ENV{motifneutral}) {
+  $bord_neutralise	= $ENV{motifneutral};
+} else {
+  $bord_neutralise	= "no";
+}
+# $len_neutral		= 5;	# not used currently
+$len_neutral2		= 2;
+
+# env variable to plot vertical lists of properties for each seq, for subsequent data handling/plotting
+
+if (exists $ENV{extra_output}) {
+  $extra		= $ENV{extra_output};
+} else {
+  $extra		= "no";
+}
+
+if ((exists $ENV{his_charged}) and ($ENV{his_charged} eq 'yes')) {
+  printf BLAH "\nHIS will be set at charge +1\n\n";
+} else {
+  printf BLAH "\nHIS will be set at charge 0\n\n";
+}
+printf BLAH "column o/p file for each sequence = $extra - set with extra_output env variable\n\n";
+
+# List the amino acids, for seq entropy (will convert to uc here at least)
+
+@aa_single	= ("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y");
+
+$dis {'A'}	= -0.26154;
+$dis {'C'}	= -0.01515;
+$dis {'D'}	=  0.22763;
+$dis {'E'}	= -0.20469;
+$dis {'F'}	= -0.22557;
+$dis {'G'}	=  0.43323;
+$dis {'H'}	= -0.00122;
+$dis {'I'}	= -0.42224;
+$dis {'K'}	= -0.10009;
+$dis {'L'}	= -0.33793;
+$dis {'M'}	= -0.22590;
+$dis {'N'}	=  0.22989;
+$dis {'P'}	=  0.55232;
+$dis {'Q'}	= -0.18768;
+$dis {'R'}	= -0.17659;
+$dis {'S'}	=  0.14288;
+$dis {'T'}	=  0.00888;
+$dis {'V'}	= -0.38618;
+$dis {'W'}	= -0.24338;
+$dis {'Y'}	= -0.20751;
+$dis {'X'}	=  0.0;
+
+$chr {'A'}	=  0.0;
+$chr {'C'}	=  0.0;
+$chr {'D'}	= -1.0;
+$chr {'E'}	= -1.0;
+$chr {'F'}	=  0.0;
+$chr {'G'}	=  0.0;
+$chr {'H'}	=  0.0;
+$chr {'I'}	=  0.0;
+$chr {'K'}	=  1.0;
+$chr {'L'}	=  0.0;
+$chr {'M'}	=  0.0;
+$chr {'N'}	=  0.0;
+$chr {'P'}	=  0.0;
+$chr {'Q'}	=  0.0;
+$chr {'R'}	=  1.0;
+$chr {'S'}	=  0.0;
+$chr {'T'}	=  0.0;
+$chr {'V'}	=  0.0;
+$chr {'W'}	=  0.0;
+$chr {'Y'}	=  0.0;
+$chr {'X'}	=  0.0;
+
+$dis {'a'}	= -0.26154;
+$dis {'c'}	= -0.01515;
+$dis {'d'}	=  0.22763;
+$dis {'e'}	= -0.20469;
+$dis {'f'}	= -0.22557;
+$dis {'g'}	=  0.43323;
+$dis {'h'}	= -0.00122;
+$dis {'i'}	= -0.42224;
+$dis {'k'}	= -0.10009;
+$dis {'l'}	= -0.33793;
+$dis {'m'}	= -0.22590;
+$dis {'n'}	=  0.22989;
+$dis {'p'}	=  0.55232;
+$dis {'q'}	= -0.18768;
+$dis {'r'}	= -0.17659;
+$dis {'s'}	=  0.14288;
+$dis {'t'}	=  0.00888;
+$dis {'v'}	= -0.38618;
+$dis {'w'}	= -0.24338;
+$dis {'y'}	= -0.20751;
+$dis {'x'}	=  0.0;
+
+$chr {'a'}	=  0.0;
+$chr {'c'}	=  0.0;
+$chr {'d'}	= -1.0;
+$chr {'e'}	= -1.0;
+$chr {'f'}	=  0.0;
+$chr {'g'}	=  0.0;
+$chr {'h'}	=  0.0;
+$chr {'i'}	=  0.0;
+$chr {'k'}	=  1.0;
+$chr {'l'}	=  0.0;
+$chr {'m'}	=  0.0;
+$chr {'n'}	=  0.0;
+$chr {'p'}	=  0.0;
+$chr {'q'}	=  0.0;
+$chr {'r'}	=  1.0;
+$chr {'s'}	=  0.0;
+$chr {'t'}	=  0.0;
+$chr {'v'}	=  0.0;
+$chr {'w'}	=  0.0;
+$chr {'y'}	=  0.0;
+$chr {'x'}	=  0.0;
+
+if ((exists $ENV{his_charged}) and ($ENV{his_charged} eq 'yes')) {
+  $chr {'H'}	=  1.0;
+  $chr {'h'}	=  1.0;
+}
+
+$KD {'A'}	=  1.8;
+$KD {'C'}	=  2.5;
+$KD {'D'}	= -3.5;
+$KD {'E'}	= -3.5;
+$KD {'F'}	=  2.8;
+$KD {'G'}	= -0.4;
+$KD {'H'}	= -3.2;
+$KD {'I'}	=  4.5;
+$KD {'K'}	= -3.9;
+$KD {'L'}	=  3.8;
+$KD {'M'}	=  1.9;
+$KD {'N'}	= -3.5;
+$KD {'P'}	= -1.6;
+$KD {'Q'}	= -3.5;
+$KD {'R'}	= -4.5;
+$KD {'S'}	= -0.8;
+$KD {'T'}	= -0.7;
+$KD {'V'}	=  4.2;
+$KD {'W'}	= -0.9;
+$KD {'Y'}	= -1.3;
+$KD {'X'}	=  0.0;
+
+$KD {'a'}	=  1.8;
+$KD {'c'}	=  2.5;
+$KD {'d'}	= -3.5;
+$KD {'e'}	= -3.5;
+$KD {'f'}	=  2.8;
+$KD {'g'}	= -0.4;
+$KD {'h'}	= -3.2;
+$KD {'i'}	=  4.5;
+$KD {'k'}	= -3.9;
+$KD {'l'}	=  3.8;
+$KD {'m'}	=  1.9;
+$KD {'n'}	= -3.5;
+$KD {'p'}	= -1.6;
+$KD {'q'}	= -3.5;
+$KD {'r'}	= -4.5;
+$KD {'s'}	= -0.8;
+$KD {'t'}	= -0.7;
+$KD {'v'}	=  4.2;
+$KD {'w'}	= -0.9;
+$KD {'y'}	= -1.3;
+$KD {'x'}	=  0.0;
+
+# Normalise the KD hydropathy values.
+
+@KD_keys	= keys %KD;
+foreach $keyhere (@KD_keys) {
+  $KDtmp		= $KD{$keyhere} + 4.5;
+  $KDnorm{$keyhere}	= $KDtmp / 9.0;
+  printf BLAH "KDnorm for $keyhere is $KDnorm{$keyhere}\n";
+}
+
+# get the sequences
+
+$file_here = "$FindBin::Bin/seq_props.in";
+open (SEQS_IN, "$file_here") or die "cannot open seq_props.in\n";
+my $nseq		= 0;
+while ($line 		= <SEQS_IN>) {
+	chomp $line;
+	my @words 	= split ("",$line);
+	if (exists $words[0]) {
+	    if ($words[0] eq ">") {			# new protein
+		$nseq++;
+		$seqid[$nseq]	= $line;
+		$seqtmp{$seqid[$nseq]}	= "";
+	    }
+	    else {
+		if ($nseq == 0) {
+			printf BLAH "did not see first seq id\n";
+			exit;
+		}
+		$seqtmp{$seqid[$nseq]}	.= $line;
+	    }
+	}
+}
+$nseqs			= $nseq;
+close (SEQS_IN);
+
+for ($nseq=1; $nseq<=$nseqs; $nseq++) {		# now put * Plations into flags
+	$seqhere	= $seqtmp{$seqid[$nseq]};
+	$lentmp		= length $seqhere;
+	@words		= split ("",$seqhere);
+	$naa		= 0;
+	for ($n=1; $n<=$lentmp; $n++) {
+	  if ($words[$n-1] ne " ") {
+	    if ($words[$n-1] eq "*") {
+	      if ($naa == 0) {
+	        printf BLAH "STOPPING - first seq char is a * - Plation\n";
+		exit;
+	      }
+	      $phosupd[$naa]	= "P";
+	    } else {
+	      $naa++;
+	      $sequpd[$naa]	= $words[$n-1];
+	      $phosupd[$naa]	= " ";
+	    }
+	  }
+	}
+	$seqstr		= "";
+	$phostr		= "";
+	for ($n=1; $n<=$naa; $n++) {
+	  $seqstr	.= $sequpd[$n];
+	  $phostr	.= $phosupd[$n];
+	}
+	$seq{$seqid[$nseq]}	= $seqstr;
+	$pho{$seqid[$nseq]}	= $phostr;
+}
+
+# calculate the P (by motif) and Order (by windowed propensity) values.
+# also calculate the overall hydropathy and net charge values, per aa, output with overall seq info
+
+$naasTOT		= 0;
+$ndisorderTOT		= 0;
+
+open (PSEQS, ">", "$FindBin::Bin/STYprops.out") or die "cannot open STYprops.out\n";
+#printf PSEQS "P/unP   dis   netQ  fracQ  meanQ     KD  FldInd    ent  chk1  bord\n";
+
+for ($h=1; $h<=$nbins_dis; $h++) {
+  $n_phos[$h]		= 0;
+  $n_unphos[$h]		= 0;
+  $net_phos[$h]		= 0;
+  $net_unphos[$h]	= 0;
+  $frac_phos[$h]	= 0;
+  $frac_unphos[$h]	= 0;
+  $n_nf_phos[$h]	= 0;		# n_nf counters where we have complete windows around STY
+  $n_nf_unphos[$h]	= 0;
+  $nbord_phos[$h]	= 0;
+  $nbord_unphos[$h]	= 0;
+  $meanQ_phos[$h]	= 0;
+  $meanQ_unphos[$h]	= 0;
+  $KDnorm_phos[$h]	= 0;
+  $KDnorm_unphos[$h]	= 0;
+  $ent_phos[$h]		= 0;
+  $ent_unphos[$h]	= 0;
+}
+for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
+  $nfi_phos[$hfi]		= 0;
+  $nfi_unphos[$hfi]		= 0;
+  $nfi_nf_phos[$h]		= 0;
+  $nfi_nf_unphos[$h]		= 0;
+  $nbordfi_phos[$hfi]		= 0;
+  $nbordfi_unphos[$hfi]		= 0;
+  $entfi_phos[$hfi]		= 0;
+  $entfi_unphos[$hfi]		= 0;
+
+  $nFQ_nf_phos[$h]		= 0;
+  $nFQ_nf_unphos[$h]		= 0;
+  $nbordFQ_phos[$hfi]		= 0;
+  $nbordFQ_unphos[$hfi]		= 0;
+  $nMQ_nf_phos[$h]		= 0;
+  $nMQ_nf_unphos[$h]		= 0;
+  $nbordMQ_phos[$hfi]		= 0;
+  $nbordMQ_unphos[$hfi]		= 0;
+  $nKD_nf_phos[$h]		= 0;
+  $nKD_nf_unphos[$h]		= 0;
+  $nbordKD_phos[$hfi]		= 0;
+  $nbordKD_unphos[$hfi]		= 0;
+}
+for ($nseq=1; $nseq<=$nseqs; $nseq++) {				# OPEN SEQUENCES LOOP
+	printf PSEQS "STARTID $seqid[$nseq]\n";
+	printf PSEQS "P/unP    dis    netQ   fracQ   meanQ      KD   FldInd     ent   chk1   bord meanQ_PM\n";
+
+	$seqhere	= $seq{$seqid[$nseq]};
+	$phohere	= $pho{$seqid[$nseq]};
+	$seqlen[$nseq]	= length $seqhere;
+	@words		= split ("",$seqhere);
+	@wordsP		= split ("",$phohere);
+
+	$propcharhere	= "";
+	$signhere	= "";
+#	$chrhere	= "";
+	$chrhereA	= "";
+	$chrhereB	= "";
+	$ndisorder	= 0;
+
+	$KDnorm_tot	= 0;
+	$chr_tot	= 0;
+	$dis_tot	= 0;
+	$ndis_tot	= 0;
+	for $key (@aa_single) {
+		$aa_count{$key}	= 0;
+	}
+
+#	$motifcharhere	= "";
+#	$skip		= 0;
+#	$nmotifs	= 0;
+	for ($naa=1; $naa<=$seqlen[$nseq]; $naa++) {		# OPEN AAs in SEQs LOOP
+		$naasTOT++;
+
+		if (exists $dis{$words[$naa-1]}) {
+		  $dis_tot += $dis{$words[$naa-1]};
+		  $ndis_tot++;
+		}
+#boxing - loop below added for debugging
+	      if (exists $KDnorm{$words[$naa-1]}) {
+		$KDnorm_tot	= $KDnorm_tot + $KDnorm{$words[$naa-1]};
+		$chr_tot	= $chr_tot + $chr{$words[$naa-1]};
+		$aa_lc		= $words[$naa-1];
+		$AA_UC		= uc ($aa_lc);
+		if (exists $aa_count{$AA_UC}) {
+		  $aa_count{$AA_UC}++;
+		} else {
+#		  printf BLAH "STOPPING AA (UC) not known in entropy count\n";
+		}
+
+	      } else {
+		printf BLAH " KDnorm PROBLEM naa words = $naa $words[$naa-1]\n";
+	      }
+
+		$propsum		= 0;
+		$nprop			= 0;
+		$wlow			= $naa - $winlenP2;
+		$wtop			= $naa + $winlenP2;
+
+		$chrsum			= 0;
+
+		for ($nw=$wlow; $nw<=$wtop; $nw++){		# second the disorder
+			if (($nw>=1) and ($nw<=$seqlen[$nseq])) {
+			    if (exists $dis{$words[$nw-1]}) {
+				$propsum += $dis{$words[$nw-1]};
+				$nprop++;
+			    }
+
+			    if (exists $chr{$words[$nw-1]}) {
+			        if ($bord_neutralise eq "yes") {
+				  $ndist	= $naa - $nw;
+				  $ndistabs	= abs($ndist);
+				  if ($ndistabs > $winlenP2) {
+				    printf BLAH "STOPPING ndist gt winlenP2\n";
+				    exit;
+				  }
+				  if ($ndistabs > $len_neutral2) {
+				    $chrsum += $chr{$words[$nw-1]};
+				  }
+				} else {
+				  $chrsum += $chr{$words[$nw-1]};
+				}
+			    }
+
+			}
+		}
+		if ($nprop != 0) {
+			$prophere	= $propsum/$nprop;
+			$windis[$naa]	= $prophere;
+			if ($prophere > 0) {
+				$propcharhere	.= "*";
+				$ndisorder++;
+				$ndisorderTOT++;
+			} else {
+				$propcharhere	.= " ";
+			}
+		} else {
+			$propcharhere	.= " ";
+		}
+
+		if      ($chrsum < 0) {
+		  $signinc		= "-";
+		} elsif ($chrsum > 0) {
+		  $signinc		= "+";
+		} else {
+		  $signinc		= " ";
+		}
+		$signhere		.= $signinc;
+
+		$abschr			= abs($chrsum);
+		if ($abschr > 9) {
+		  $chrhereA		.= int($abschr/10);
+		  $chrhereB		.= $abschr - 10*int($abschr/10);
+		} else {
+		  $chrhereA		.= " ";
+		  $chrhereB		.= $abschr;
+		}
+
+		$all_winQ[$nseq][$naa]	= $chrsum;
+	}							# CLOSE AAs in SEQs LOOP
+#	$motifchar{$seqid[$nseq]}	= $motifcharhere;
+	$propchar{$seqid[$nseq]}	= $propcharhere;
+	$signchar{$seqid[$nseq]}	= $signhere;
+	$chrcharA{$seqid[$nseq]}	= $chrhereA;
+	$chrcharB{$seqid[$nseq]}	= $chrhereB;
+
+	$KDnorm_seqs{$seqid[$nseq]}	= $KDnorm_tot / $seqlen[$nseq];
+	$chrmean_seqs{$seqid[$nseq]}	= abs ($chr_tot) / $seqlen[$nseq];
+	$foldind_seqs{$seqid[$nseq]}	= 2.785 * $KDnorm_seqs{$seqid[$nseq]} - $chrmean_seqs{$seqid[$nseq]} - 1.151;
+	$dis_seqs{$seqid[$nseq]}	= $dis_tot / $ndis_tot;
+
+	$chr_tot_seqs{$seqid[$nseq]}	= $chr_tot / $seqlen[$nseq];
+	$ent_sum		= 0;
+	$frac_sum		= 0;
+	for $key (@aa_single) {
+		if ($aa_count{$key} != 0) {
+		  $frac_here		= $aa_count{$key} / $seqlen[$nseq];
+		  $incr_here		= -$frac_here * (log($frac_here) / log(2.0));
+		  $ent_sum		+= $incr_here;
+		  $frac_sum		+= $frac_here
+		}
+	}
+	$ent_seqs{$seqid[$nseq]}	= $ent_sum;
+#	$frac_seqs{$seqid[$nseq]}	= $frac_sum;	# not used currently
+
+	@wordsPROP			= split ("",$propcharhere);
+	@wordsSIGN			= split ("",$signhere);
+#	@wordsCHAR			= split ("",$chrhere);
+
+# at the moment we're only processing S, T, Y phos and non-phos - of course non-phos is well-dodgy
+	for ($h=1; $h<=$nbins_dis; $h++) {
+	  $n_phos_here[$h]	= 0;
+  	  $n_unphos_here[$h]	= 0;
+  	  $net_phos_here[$h]	= 0;
+  	  $net_unphos_here[$h]	= 0;
+  	  $frac_phos_here[$h]	= 0;
+  	  $frac_unphos_here[$h]	= 0;
+	  $n_nf_phos_here[$h]	= 0;
+	  $n_nf_unphos_here[$h]	= 0;
+  	  $nbord_phos_here[$h]	= 0;
+  	  $nbord_unphos_here[$h] = 0;
+	  $meanQ_phos_here[$h]	= 0;
+	  $meanQ_unphos_here[$h] = 0;
+	  $KDnorm_phos_here[$h]	= 0;
+	  $KDnorm_unphos_here[$h] = 0;
+	  $ent_phos_here[$h]	= 0;
+	  $ent_unphos_here[$h]	= 0;
+	}
+	for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
+  	  $nfi_phos_here[$hfi]			= 0;
+  	  $nfi_unphos_here[$hfi]		= 0;
+	  $nfi_nf_phos_here[$hfi]		= 0;
+	  $nfi_nf_unphos_here[$hfi]		= 0;
+	  $nbordfi_phos_here[$hfi]		= 0;
+	  $nbordfi_unphos_here[$hfi]		= 0;
+	  $entfi_phos_here[$hfi]		= 0;
+	  $entfi_unphos_here[$hfi]		= 0;
+
+	  $nFQ_nf_phos_here[$hfi]		= 0;
+	  $nFQ_nf_unphos_here[$hfi]		= 0;
+	  $nbordFQ_phos_here[$hfi]		= 0;
+	  $nbordFQ_unphos_here[$hfi]		= 0;
+	  $nMQ_nf_phos_here[$hfi]		= 0;
+	  $nMQ_nf_unphos_here[$hfi]		= 0;
+	  $nbordMQ_phos_here[$hfi]		= 0;
+	  $nbordMQ_unphos_here[$hfi]		= 0;
+	  $nKD_nf_phos_here[$hfi]		= 0;
+	  $nKD_nf_unphos_here[$hfi]		= 0;
+	  $nbordKD_phos_here[$hfi]		= 0;
+	  $nbordKD_unphos_here[$hfi]		= 0;
+}
+	for ($naa=1; $naa<=$seqlen[$nseq]; $naa++) {	# OPEN AAs in SEQs LOOP
+
+# CHARGE SIGN CHANGES in winlenB ooooooooooooooooooooooooooooooo
+	  $nb1			= $naa - $winlenB2;
+	  $nb2			= $naa + $winlenB2;
+	  if (($nb1 >= 1) and ($nb2 <= $seqlen[$nseq])) {
+	    $pos	= 'no';
+	    $neg	= 'no';
+	    for ($nb=$nb1; $nb<=$nb2; $nb++) {
+	      if      ($wordsSIGN[$nb-1] eq '+') {
+	        $pos	= 'yes';
+	      } elsif ($wordsSIGN[$nb-1] eq '-') {
+	        $neg	= 'yes';
+	      }
+	    }
+	    if (($pos eq 'yes') and ($neg eq 'yes')) {
+	      $bord	= 'yes';
+	    } else {
+	      $bord	= 'no';
+	    }
+	  }
+# oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
+
+# NET CHARGE AND CHARGE FRACTION in P WINDOW ooooooooooooooooooooo
+	  $nb1			= $naa - $winlenP2;
+	  $nb2			= $naa + $winlenP2;
+	  if (($nb1 >= 1) and ($nb2 <= $seqlen[$nseq])) {
+	    $winP	= 'yes';
+	    $sumQ	= 0;
+	    $netQ	= 0;
+	    $KDnorm_tot	= 0;
+	    $chr_tot	= 0;
+	    for $key (@aa_single) {
+	      $aa_count{$key}	= 0;
+	    }
+	    $subseq		= "";
+	    for ($nb=$nb1; $nb<=$nb2; $nb++) {
+	      $subseq	.= $words[$nb-1];
+	      if (exists $chr{$words[$nb-1]}) {
+		$netQ	+= $chr{$words[$nb-1]};
+		$pm	= $chr{$words[$nb-1]};
+		if (abs($pm) > 0.9) {
+		  $sumQ++;
+		}
+	      }
+# boxing
+if (exists $KDnorm{$words[$nb-1]}) {
+	      $KDnorm_tot	= $KDnorm_tot + $KDnorm{$words[$nb-1]};
+} else {
+  printf BLAH "KDnorm PROBLEM 2 nb-1 words[nb-1] = $nb-1 $words[$nb-1]\n ";
+}
+	      $aa_lc		= $words[$nb-1];
+	      $AA_UC		= uc ($aa_lc);
+	      if (exists $aa_count{$AA_UC}) {
+		$aa_count{$AA_UC}++;
+	      } else {
+#		  printf BLAH "STOPPING AA (UC) not known in entropy count\n";
+	      }
+
+	    }
+	    $fracQ	= $sumQ / $winlenP;
+	    $meanQ_win		= abs ($netQ) / $winlenP;
+	    $meanQ_winNET	=     ($netQ) / $winlenP;
+	    $KDnorm_win	= $KDnorm_tot / $winlenP;
+	    $ent_sum		= 0;
+	    $frac_sum		= 0;
+	    for $key (@aa_single) {
+	      if ($aa_count{$key} != 0) {
+	        $frac_here	= $aa_count{$key} / $winlenP;
+	        $incr_here	= -$frac_here * (log($frac_here) / log(2.0));
+		$ent_sum	+= $incr_here;
+		$frac_sum	+= $frac_here
+	      }
+	    }
+	    $ent_win	= $ent_sum;
+	    $frac_win	= $frac_sum;
+
+	  } else {
+	    $winP	= 'no';
+	  }
+# oooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo
+
+#	  $am		= $words[$naa-1];		# not used currently
+#	  if (($am eq "S") or ($am eq "T") or ($am eq "Y") or ($am eq "s") or ($am eq "t") or ($am eq "y")) {
+	  if ((exists $words[$naa-1]) and ($winP eq 'yes')) {
+	    $prophere		= $wordsPROP[$naa-1];
+	    $bin		= 1 + int (($windis[$naa]+0.5)*$nbins_dis);
+
+	    if ($bin < 1) { $bin = 1; }				# cludge to avoid oub
+	    if ($bin > $nbins_dis) { $bin = $nbins_dis; }	# cludge to avoid oub
+
+	    if (($bin < 1) or ($bin > $nbins_dis)) {
+	      printf BLAH "STOPPING - bin for disorder not in 1 to $nbins_dis range\n";
+	      exit;
+	    }
+# bin Uversky fold from theoretical min, max = -2.151, 1.634 from the Prilusky paper
+	    $foldind_here	= 2.785 * $KDnorm_win - $meanQ_win - 1.151;
+	    $binfi		= 1 + int (($foldind_here + 2.151)/((1.634+2.151)/$nbins_fi));
+	    if (($binfi < 1) or ($binfi > $nbins_fi)) {
+	      printf BLAH "STOPPING - bin for Uversky fold not in 1 to $nbins_fi range\n";
+	      exit;
+	    }
+
+# for bins of fracQ (FQ), meanQ (MQ), KD; assume each has max range of 0 to 1
+	    $binFQ		= 1 + int (($fracQ)*$nbins_dis);
+	    if ($binFQ == $nbins_dis+1) { $binFQ = $nbins_dis; }
+	    if (($binFQ < 1) or ($binFQ > $nbins_dis)) {
+	      printf BLAH "STOPPING - bin for fracQ not in 1 to $nbins_dis range; $binFQ $fracQ\n";
+	      exit;
+	    }
+	    $binMQ		= 1 + int (($meanQ_win)*$nbins_dis);
+	    if ($binMQ == $nbins_dis+1) { $binMQ = $nbins_dis; }
+	    if (($binMQ < 1) or ($binMQ > $nbins_dis)) {
+	      printf BLAH "STOPPING - bin for meanQ not in 1 to $nbins_dis range\n";
+	      exit;
+	    }
+	    $binKD		= 1 + int (($KDnorm_win)*$nbins_dis);
+	    if (($binKD < 1) or ($binKD > $nbins_dis)) {
+	      printf BLAH "STOPPING - bin = $binKD for KD not in 1 to $nbins_dis range, KDnorm_win=$KDnorm_win\n";
+	      exit;
+	    }
+
+	    if ($wordsP[$naa-1] eq "P") {
+	      $n_phos_here[$bin]++;
+	      $nfi_phos_here[$binfi]++;
+	      if ($winP eq 'yes') {
+	        if ($bord eq 'yes') {		# only increment bord bins where winp='yes'
+	          $nbord_phos_here[$bin]++;	# i.e. binned bord props should be normalised with n_nf...
+	          $nbordfi_phos_here[$binfi]++;
+
+	          $nbordFQ_phos_here[$binFQ]++;
+	          $nbordMQ_phos_here[$binMQ]++;
+	          $nbordKD_phos_here[$binKD]++;
+
+		  $bord_write	= 1;
+	        } else {
+	          $bord_write	= 0;
+	        }
+	        $n_nf_phos_here[$bin]++;
+
+	        $net_phos_here[$bin]	+= $netQ;
+	        $frac_phos_here[$bin]	+= $fracQ;
+		$meanQ_phos_here[$bin]	+= $meanQ_win;
+		$KDnorm_phos_here[$bin]	+= $KDnorm_win;
+		$ent_phos_here[$bin]	+= $ent_win;
+
+	        $nfi_nf_phos_here[$binfi]++;
+		$entfi_phos_here[$binfi]	+= $ent_win;
+
+	        $nFQ_nf_phos_here[$binFQ]++;
+	        $nMQ_nf_phos_here[$binMQ]++;
+	        $nKD_nf_phos_here[$binKD]++;
+
+		$foldind		= 2.785 * $KDnorm_win - $meanQ_win - 1.151;
+		printf PSEQS "PHOS ";
+		printf PSEQS "%7.3f %7.3f %7.3f", $windis[$naa], $netQ, $fracQ;
+		printf PSEQS "%8.3f %7.3f %8.3f %7.3f %6.3f", $meanQ_win, $KDnorm_win, $foldind, $ent_win, $frac_win;
+		printf PSEQS "%7.3f  ", $bord_write;
+		printf PSEQS "%7.3f  ", $meanQ_winNET;
+		printf PSEQS "$subseq\n";
+	      }
+	    } else {
+	      $n_unphos_here[$bin]++;
+	      $nfi_unphos_here[$binfi]++;
+	      if ($winP eq 'yes') {		# only increment bord bins where winP='yes'
+	        if ($bord eq 'yes') {
+	          $nbord_unphos_here[$bin]++;
+	          $nbordfi_unphos_here[$binfi]++;
+
+	          $nbordFQ_unphos_here[$binFQ]++;
+	          $nbordMQ_unphos_here[$binMQ]++;
+	          $nbordKD_unphos_here[$binKD]++;
+
+		  $bord_write	= 1;
+	        } else {
+	          $bord_write	= 0;
+	        }
+	        $n_nf_unphos_here[$bin]++;
+	        $net_unphos_here[$bin]	+= $netQ;
+	        $frac_unphos_here[$bin]	+= $fracQ;
+		$meanQ_unphos_here[$bin]	+= $meanQ_win;
+		$KDnorm_unphos_here[$bin]	+= $KDnorm_win;
+		$ent_unphos_here[$bin]		+= $ent_win;
+
+	        $nfi_nf_unphos_here[$binfi]++;
+		$entfi_unphos_here[$binfi]	+= $ent_win;
+
+	        $nFQ_nf_unphos_here[$binFQ]++;
+	        $nMQ_nf_unphos_here[$binMQ]++;
+	        $nKD_nf_unphos_here[$binKD]++;
+
+		$foldind		= 2.785 * $KDnorm_win - $meanQ_win - 1.151;
+		printf PSEQS "NONP ";
+		printf PSEQS "%7.3f %7.3f %7.3f", $windis[$naa], $netQ, $fracQ;
+		printf PSEQS "%8.3f %7.3f %8.3f %7.3f %6.3f", $meanQ_win, $KDnorm_win, $foldind, $ent_win, $frac_win;
+		printf PSEQS "%7.3f  ", $bord_write;
+		printf PSEQS "%7.3f  ", $meanQ_winNET;
+		printf PSEQS "$subseq\n";
+	      }
+	    }
+	  }
+	}						# CLOSE AAs in SEQs LOOP
+
+	$nSTYunphos_processed	= 0;
+	$nSTYphos_processed	= 0;
+	$nbordunphos_processed	= 0;
+	$nbordphos_processed	= 0;
+	for ($h=1; $h<=$nbins_dis; $h++) {		# get sequence numbers of phos-STY and bord/phos-STY
+	  $nSTYphos_processed		+= $n_nf_phos_here[$h];
+	  $nSTYunphos_processed		+= $n_nf_unphos_here[$h];
+	  $nbordphos_processed		+= $nbord_phos_here[$h];
+	  $nbordunphos_processed	+= $nbord_unphos_here[$h];
+	}
+	$nSTYhere			= $nSTYphos_processed + $nSTYunphos_processed;
+	if ($nSTYhere != 0) {
+	  $frac_phos_seqs{$seqid[$nseq]}	= $nSTYphos_processed / $nSTYhere;
+	  $frac_bord_seqs{$seqid[$nseq]}	= ($nbordphos_processed + $nbordunphos_processed) / $nSTYhere;
+	} else {
+	  $frac_phos_seqs{$seqid[$nseq]}	= 99;
+	  $frac_bord_seqs{$seqid[$nseq]}	= 99;
+	}
+
+	printf PSEQS "KD meanQ FldInd Dis Ent FracPhos FracBord = ";
+#	printf PSEQS "%6.3f %6.3f ", $KDnorm_seqs{$seqid[$nseq]}, $chrmean_seqs{$seqid[$nseq]};
+	printf PSEQS "%6.3f %6.3f ", $KDnorm_seqs{$seqid[$nseq]}, $chr_tot_seqs{$seqid[$nseq]};
+	printf PSEQS "%6.3f %6.3f %6.3f ", $foldind_seqs{$seqid[$nseq]}, $dis_seqs{$seqid[$nseq]}, $ent_seqs{$seqid[$nseq]};
+	printf PSEQS "%6.3f %6.3f ", $frac_phos_seqs{$seqid[$nseq]}, $frac_bord_seqs{$seqid[$nseq]};
+	printf PSEQS "\n";
+	printf PSEQS "ENDID $seqid[$nseq]\n\n";
+
+	for ($h=1; $h<=$nbins_dis; $h++) {
+	  $n_phos[$h]		+= $n_phos_here[$h];
+	  $n_unphos[$h]		+= $n_unphos_here[$h];
+	  $net_phos[$h]		+= $net_phos_here[$h];
+	  $net_unphos[$h]	+= $net_unphos_here[$h];
+	  $frac_phos[$h]	+= $frac_phos_here[$h];
+	  $frac_unphos[$h]	+= $frac_unphos_here[$h];
+	  $n_nf_phos[$h]	+= $n_nf_phos_here[$h];
+	  $n_nf_unphos[$h]	+= $n_nf_unphos_here[$h];
+	  $nbord_phos[$h]	+= $nbord_phos_here[$h];
+	  $nbord_unphos[$h]	+= $nbord_unphos_here[$h];
+	  $meanQ_phos[$h]	+= $meanQ_phos_here[$h];
+	  $meanQ_unphos[$h]	+= $meanQ_unphos_here[$h];
+	  $KDnorm_phos[$h]	+= $KDnorm_phos_here[$h];
+	  $KDnorm_unphos[$h]	+= $KDnorm_unphos_here[$h];
+	  $ent_phos[$h]		+= $ent_phos_here[$h];
+	  $ent_unphos[$h]	+= $ent_unphos_here[$h];
+	}
+	for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
+	  $nfi_phos[$hfi]		+= $nfi_phos_here[$hfi];
+	  $nfi_unphos[$hfi]		+= $nfi_unphos_here[$hfi];
+	  $nfi_nf_phos[$hfi]		+= $nfi_nf_phos_here[$hfi];
+	  $nfi_nf_unphos[$hfi]		+= $nfi_nf_unphos_here[$hfi];
+	  $nbordfi_phos[$hfi]		+= $nbordfi_phos_here[$hfi];
+	  $nbordfi_unphos[$hfi]		+= $nbordfi_unphos_here[$hfi];
+	  $entfi_phos[$hfi]		+= $entfi_phos_here[$hfi];
+	  $entfi_unphos[$hfi]		+= $entfi_unphos_here[$hfi];
+
+	  $nFQ_nf_phos[$hfi]		+= $nFQ_nf_phos_here[$hfi];
+	  $nFQ_nf_unphos[$hfi]		+= $nFQ_nf_unphos_here[$hfi];
+	  $nbordFQ_phos[$hfi]		+= $nbordFQ_phos_here[$hfi];
+	  $nbordFQ_unphos[$hfi]		+= $nbordFQ_unphos_here[$hfi];
+	  $nMQ_nf_phos[$hfi]		+= $nMQ_nf_phos_here[$hfi];
+	  $nMQ_nf_unphos[$hfi]		+= $nMQ_nf_unphos_here[$hfi];
+	  $nbordMQ_phos[$hfi]		+= $nbordMQ_phos_here[$hfi];
+	  $nbordMQ_unphos[$hfi]		+= $nbordMQ_unphos_here[$hfi];
+	  $nKD_nf_phos[$hfi]		+= $nKD_nf_phos_here[$hfi];
+	  $nKD_nf_unphos[$hfi]		+= $nKD_nf_unphos_here[$hfi];
+	  $nbordKD_phos[$hfi]		+= $nbordKD_phos_here[$hfi];
+	  $nbordKD_unphos[$hfi]		+= $nbordKD_unphos_here[$hfi];
+
+	}
+
+
+}							# CLOSE SEQUENCES LOOP
+close (PSEQS);
+
+$file_here = "$FindBin::Bin/bins.txt";
+open (BINS, ">$file_here") or die "cannot open bins.txt\n";
+printf BINS "dis_bin win_netQ_P win_netQ_U win_fracQ_P win_fracQ_U ";
+printf BINS "Q_diffPU fracQ_diffPU ";
+printf BINS "meanQ_P meanQ_U  meanKD_P meanKD_U\n";
+for ($h=1; $h<=$nbins_dis; $h++) {
+	  $meanQ_phos[$h]	+= $meanQ_phos_here[$h];
+	  $meanQ_unphos[$h]	+= $meanQ_unphos_here[$h];
+	  $KDnorm_phos[$h]	+= $KDnorm_phos_here[$h];
+	  $KDnorm_unphos[$h]	+= $KDnorm_unphos_here[$h];
+	  $ent_phos[$h]		+= $ent_phos_here[$h];
+	  $ent_unphos[$h]	+= $ent_unphos_here[$h];
+#  $mean_frac_unphos	= $frac_unphos[$h] / $n_unphos[$h];
+  if ($n_unphos[$h] != 0) {
+    $n_PUratio[$h]	= $n_phos[$h] / $n_unphos[$h];
+  } else {
+    $n_PUratio[$h]	= 0;
+  }
+  if ($n_nf_phos[$h] != 0) {
+    $mean_net_phos	= $net_phos[$h] / $n_nf_phos[$h];
+    $mean_frac_phos	= $frac_phos[$h] / $n_nf_phos[$h];
+    $mean_meanQ_phos[$h]	= $meanQ_phos[$h] / $n_nf_phos[$h];
+    $mean_KDnorm_phos[$h]	= $KDnorm_phos[$h] / $n_nf_phos[$h];
+    $mean_ent_phos[$h]		= $ent_phos[$h] / $n_nf_phos[$h];
+  } else {
+    $mean_net_phos	= 0;
+    $mean_frac_phos	= 0;
+    $mean_meanQ_phos[$h]	= 0;
+    $mean_KDnorm_phos[$h]	= 0;
+    $mean_ent_phos[$h]		= 0;
+  }
+  if ($n_nf_unphos[$h] != 0) {
+    $mean_net_unphos	= $net_unphos[$h] / $n_nf_unphos[$h];
+    $mean_frac_unphos	= $frac_unphos[$h] / $n_nf_unphos[$h];
+    $mean_meanQ_unphos[$h]	= $meanQ_unphos[$h] / $n_nf_unphos[$h];
+    $mean_KDnorm_unphos[$h]	= $KDnorm_unphos[$h] / $n_nf_unphos[$h];
+    $mean_ent_unphos[$h]	= $ent_unphos[$h] / $n_nf_unphos[$h];
+  } else {
+    $mean_net_unphos	= 0;
+    $mean_frac_unphos	= 0;
+    $mean_meanQ_unphos[$h]	= 0;
+    $mean_KDnorm_unphos[$h]	= 0;
+    $mean_ent_unphos[$h]	= 0;
+  }
+  $netQ_diff[$h]	= $mean_net_phos - $mean_net_unphos;
+  #changed below line due to printf error newer perl Max/Jim/Desnoyers 020518
+  #$fracQ_diff[$h]	= $mean_frac_phos - $mean_frac_unphos;
+  printf BINS "%7.0f", $h;
+  printf BINS "%11.3f %10.3f", $mean_net_phos, $mean_net_unphos;
+  printf BINS "%12.3f %11.3f", $mean_frac_phos, $mean_frac_unphos;
+  printf BINS "%9.3f %12.3f", $h, $netQ_diff[$h];
+  #changed below line due to printf error newer perl Max/Jim/Desnoyers 020518
+  #printf BINS "%9.3f %12.3f", $h, $netQ_diff[$h], $fracQ_diff[$h];
+  printf BINS "%8.3f %7.3f", $mean_meanQ_phos[$h], $mean_meanQ_unphos[$h];
+  printf BINS "%10.3f %8.3f\n", $mean_KDnorm_phos[$h], $mean_KDnorm_unphos[$h];
+}
+
+# Newer dis bins below to mirror the Uversky fold Table
+
+$n_phos_tot		= 0;
+$n_unphos_tot		= 0;
+$n_nf_phos_tot		= 0;
+$n_nf_unphos_tot	= 0;
+for ($h=1; $h<=$nbins_dis; $h++) {
+  $n_phos_tot		+= $n_phos[$h];
+  $n_unphos_tot		+= $n_unphos[$h];
+  $n_nf_phos_tot	+= $n_nf_phos[$h];
+  $n_nf_unphos_tot	+= $n_nf_unphos[$h];
+}
+printf BINS "\n\n phos_tot, unphos_tot, nf_phos_tot, nf_unphos_tot = $n_phos_tot $n_unphos_tot $n_nf_phos_tot $n_nf_unphos_tot\n";
+printf BINS "\n NOTE that other than enrich_PU, all enr/enrich props use 'nf' - complete wins around STY";
+printf BINS "\n NOTE that enrich = 1.0 where a bin P/U ratio for that prop = P/U ratio overall all";
+printf BINS "\n\nDis_bin    Dis       n_P       n_U   n_nf_P    n_nf_U  ratio_PU enrich_PU  n_bord_P n_bord_U  bord/STY rel_brdPU";
+printf BINS "     ent_P    ent_U   ent/STY rel_entPU\n";
+for ($h=1; $h<=$nbins_dis; $h++) {
+  $top		= $n_nf_phos[$h] * $n_nf_unphos_tot;	# calc enrich with nwins P/U
+  $bot		= $n_nf_unphos[$h] * $n_nf_phos_tot;
+  if (($top != 0) and ($bot != 0)) {
+    $enrich_nf_PU_here	= $top/$bot;
+  } else {
+    $enrich_nf_PU_here	= 1.0;
+  }
+  if ($n_nf_unphos[$h] != 0) {
+    $ratio_nf_PU_here	= $n_nf_phos[$h] / $n_nf_unphos[$h];
+  } else {
+    $ratio_nf_PU_here	= 1;
+  }
+
+  $bot			= $n_nf_phos[$h] + $n_nf_unphos[$h];
+  $top			= $nbord_phos[$h] + $nbord_unphos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $STY_in_bord_here	= $top/$bot;
+  } else {
+    $STY_in_bord_here	= 1;
+  }
+
+  $bot			= $n_nf_phos[$h] + $n_nf_unphos[$h];
+  $top			= $ent_phos[$h] + $ent_unphos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $STY_ent_here	= $top/$bot;
+  } else {
+    $STY_ent_here	= 1;
+  }
+
+  $top			= $nbord_phos[$h]*$n_nf_unphos[$h];
+  $bot			= $nbord_unphos[$h]*$n_nf_phos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $rel_enr_PU_bord_here	= $top/$bot;
+  } else {
+    $rel_enr_PU_bord_here	= 1.0;
+  }
+  $top			= $ent_phos[$h]*$n_nf_unphos[$h];
+  $bot			= $ent_unphos[$h]*$n_nf_phos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $rel_enr_PU_ent_here		= $top/$bot;
+  } else {
+    $rel_enr_PU_ent_here		= 1.0;
+  }
+  $dis_here		= (($h-0.5)/$nbins_dis) - 0.5;
+  printf BINS "%7.0f %6.3f %9.0f %9.0f", $h, $dis_here, $n_phos[$h], $n_unphos[$h];
+  printf BINS "%  9.0f %9.0f %9.4f %9.3f", $n_nf_phos[$h], $n_nf_unphos[$h], $ratio_nf_PU_here, $enrich_nf_PU_here;
+  printf BINS " %9.0f %8.0f %9.4f %9.3f", $nbord_phos[$h], $nbord_unphos[$h], $STY_in_bord_here, $rel_enr_PU_bord_here;
+  printf BINS " %9.0f %8.0f %9.3f %9.3f\n", $ent_phos[$h], $ent_unphos[$h], $STY_ent_here, $rel_enr_PU_ent_here;
+}
+printf BINS "\n";
+printf BINS " See the equivalent Table for more information on these properties\n";
+printf BINS " Note: eg winlenB=11 get higher PtoU bord for higher dis - ?? charge props of Plation spec\n";
+printf BINS " But overall trend is fewer STY border at higher dis, perhaps due to more charge at higher dis\n";
+printf BINS " So if subs seqs have more Q, then enr_bordPU would go other way\n";
+
+$nfi_phos_tot		= 0;
+$nfi_unphos_tot		= 0;
+$nfi_nf_phos_tot	= 0;
+$nfi_nf_unphos_tot	= 0;
+for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
+  $nfi_phos_tot		+= $nfi_phos[$hfi];
+  $nfi_unphos_tot	+= $nfi_unphos[$hfi];
+  $nfi_nf_phos_tot	+= $nfi_nf_phos[$hfi];
+  $nfi_nf_unphos_tot	+= $nfi_nf_unphos[$hfi];
+}
+printf BINS "\n phos_tot, unphos_tot, nf_phos_tot, nf_unphos_tot = $nfi_phos_tot $nfi_unphos_tot $nfi_nf_phos_tot $nfi_nf_unphos_tot\n";
+printf BINS "\n NOTE that other than enrich_PU, all enr/enrich props use 'nf' - complete wins around STY";
+printf BINS "\n NOTE that enrich = 1.0 where a bin P/U ratio for that prop = P/U ratio overall all";
+printf BINS "\n\n FI_bin FldInd       n_P       n_U   n_nf_P    n_nf_U  ratio_PU enrich_PU  n_bord_P n_bord_U  bord/STY rel_brdPU";
+printf BINS "     ent_P    ent_U   ent/STY rel_entPU\n";
+for ($hfi=1; $hfi<=$nbins_fi; $hfi++) {
+  $top		= $nfi_nf_phos[$hfi] * $nfi_nf_unphos_tot;	# calc enrich with nwins P/U
+  $bot		= $nfi_nf_unphos[$hfi] * $nfi_nf_phos_tot;
+  if (($top != 0) and ($bot != 0)) {
+    $enrich_nf_PU_here	= $top/$bot;
+  } else {
+    $enrich_nf_PU_here	= 1.0;
+  }
+  if ($nfi_nf_unphos[$hfi] != 0) {
+    $ratio_nf_PU_here	= $nfi_nf_phos[$hfi] / $nfi_nf_unphos[$hfi];
+  } else {
+    $ratio_nf_PU_here	= 1;
+  }
+
+  $bot			= $nfi_nf_phos[$hfi] + $nfi_nf_unphos[$hfi];
+  $top			= $nbordfi_phos[$hfi] + $nbordfi_unphos[$hfi];
+  if (($top != 0) and ($bot != 0)) {
+    $STY_in_bord_here	= $top/$bot;
+  } else {
+    $STY_in_bord_here	= 1;
+  }
+
+  $bot			= $nfi_nf_phos[$hfi] + $nfi_nf_unphos[$hfi];
+  $top			= $entfi_phos[$hfi] + $entfi_unphos[$hfi];
+  if (($top != 0) and ($bot != 0)) {
+    $STY_ent_here	= $top/$bot;
+  } else {
+    $STY_ent_here	= 1;
+  }
+
+  $top			= $nbordfi_phos[$hfi]*$nfi_nf_unphos[$hfi];
+  $bot			= $nbordfi_unphos[$hfi]*$nfi_nf_phos[$hfi];
+  if (($top != 0) and ($bot != 0)) {
+    $rel_enr_PU_bord_here	= $top/$bot;
+  } else {
+    $rel_enr_PU_bord_here	= 1.0;
+  }
+  $top			= $entfi_phos[$hfi]*$nfi_nf_unphos[$hfi];
+  $bot			= $entfi_unphos[$hfi]*$nfi_nf_phos[$hfi];
+  if (($top != 0) and ($bot != 0)) {
+    $rel_enr_PU_ent_here		= $top/$bot;
+  } else {
+    $rel_enr_PU_ent_here		= 1.0;
+  }
+  $fi_here		= -2.151 + ($hfi-0.5)*((1.634+2.151)/$nbins_fi);
+  printf BINS "%7.0f %6.3f %9.0f %9.0f", $hfi, $fi_here, $nfi_phos[$hfi], $nfi_unphos[$hfi];
+  printf BINS "%  9.0f %9.0f %9.4f %9.3f", $nfi_nf_phos[$hfi], $nfi_nf_unphos[$hfi], $ratio_nf_PU_here, $enrich_nf_PU_here;
+  printf BINS " %9.0f %8.0f %9.4f %9.3f", $nbordfi_phos[$hfi], $nbordfi_unphos[$hfi], $STY_in_bord_here, $rel_enr_PU_bord_here;
+  printf BINS " %9.0f %8.0f %9.3f %9.3f\n", $entfi_phos[$hfi], $entfi_unphos[$hfi], $STY_ent_here, $rel_enr_PU_ent_here;
+}
+printf BINS "\nMaking sense of the bins:\n";
+printf BINS "Fold Value    below zero = unfolded, above zero = folded (Uversky / Prilusky)\n";
+printf BINS "n_P/U         numbers of phos/unphos STY picked up from seqs, wherever in seq\n";
+printf BINS "n_nf_P/U      nos of phos/unphos STY that are within complete winlenP (eg 21) windows\n";
+printf BINS "              these nf numbers of P/U are used for subsequent ratio/enrichment calcs\n";
+printf BINS "ratio_PU      simple ratio of phos / unphos in each FI bin\n";
+printf BINS "enrich_PU     enrichment of P to U, normalised such that average over all bins = 1\n";
+printf BINS "              the classic result is that we should have > 1 for unfolded proteins\n";
+printf BINS "n_bord_P/U    STY phos/unphos nos (in comp winlenP) at charge borders (tested in winlenB)\n";
+printf BINS "bord/STY      for all STY, the fraction in charge border as function of FI bin\n";
+printf BINS "              see a general trend, presumably related to more charge for less folded proteins\n";
+printf BINS "rel_brdPU     enrichment of charge borders P to U, normalised by ratio of P to U in bin\n";
+printf BINS "              we're looking here - may be some enrichment for P close to fold boundary\n";
+printf BINS "ent_P/U       sum over winlenP sequence entropies for phos / unphos sites\n";
+printf BINS "ent/STY       for all STY, the average seq entropy per site in this bin\n";
+printf BINS "rel_entPU     enrichment of these entropes P to U, normalised by ratio of P to U in bin\n";
+printf BINS "              what to say about the seq entropy P to U props ??\n";
+
+# We see rel_brdPU variation on dis and FI - but what about vs charge and KD prop bins separately?
+printf BINS "\n We see rel_bord variation with dis and FI bins - what about charge and KD bins?";
+printf BINS "\n\nnumbin  fracQ  bord/STY rel_brdPU  meanQ  bord/STY rel_brdPU     KD  bord/STY rel_brdPU\n";
+for ($h=1; $h<=$nbins_dis; $h++) {
+  printf BINS "%7.0f", $h;
+  $bot			= $nFQ_nf_phos[$h] + $nFQ_nf_unphos[$h];
+  $top			= $nbordFQ_phos[$h] + $nbordFQ_unphos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $STY_in_bord_here	= $top/$bot;
+  } else {
+    $STY_in_bord_here	= 1;
+  }
+  $top			= $nbordFQ_phos[$h]*$nFQ_nf_unphos[$h];
+  $bot			= $nbordFQ_unphos[$h]*$nFQ_nf_phos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $rel_enr_PU_bord_here	= $top/$bot;
+  } else {
+    $rel_enr_PU_bord_here	= 1.0;
+  }
+  printf BINS " fracQ %9.4f %9.3f", $STY_in_bord_here, $rel_enr_PU_bord_here;
+  $bot			= $nMQ_nf_phos[$h] + $nMQ_nf_unphos[$h];
+  $top			= $nbordMQ_phos[$h] + $nbordMQ_unphos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $STY_in_bord_here	= $top/$bot;
+  } else {
+    $STY_in_bord_here	= 1;
+  }
+  $top			= $nbordMQ_phos[$h]*$nMQ_nf_unphos[$h];
+  $bot			= $nbordMQ_unphos[$h]*$nMQ_nf_phos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $rel_enr_PU_bord_here	= $top/$bot;
+  } else {
+    $rel_enr_PU_bord_here	= 1.0;
+  }
+  printf BINS "  meanQ %9.4f %9.3f", $STY_in_bord_here, $rel_enr_PU_bord_here;
+  $bot			= $nKD_nf_phos[$h] + $nKD_nf_unphos[$h];
+  $top			= $nbordKD_phos[$h] + $nbordKD_unphos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $STY_in_bord_here	= $top/$bot;
+  } else {
+    $STY_in_bord_here	= 1;
+  }
+  $top			= $nbordKD_phos[$h]*$nKD_nf_unphos[$h];
+  $bot			= $nbordKD_unphos[$h]*$nKD_nf_phos[$h];
+  if (($top != 0) and ($bot != 0)) {
+    $rel_enr_PU_bord_here	= $top/$bot;
+  } else {
+    $rel_enr_PU_bord_here	= 1.0;
+  }
+  printf BINS "     KD %9.4f %9.3f\n", $STY_in_bord_here, $rel_enr_PU_bord_here;
+}
+
+close (BINS);
+
+# output the sequences, with P and Order annotation lines
+
+$file_here = "$FindBin::Bin/seq_props.out";
+open (SEQS_OUT, ">$file_here") or die "cannot open seq_props.out\n";
+# open (SEQS_SUM, ">POsummary.out") or die "cannot open PO.out\n";
+
+# printf SEQS_SUM "VALUES OVER THE ENTIRE INPUT SET follow: N genes = $nseqs\n\n";
+# printf SEQS_SUM "N genes with >= 1 Motif          = $nwithmotif\n";
+# printf SEQS_SUM "N genes with >= 1 Motif also DIS = $nwithmotifDIS\n";
+# $answerA	= $totalmotifs/$nseqs;
+# printf SEQS_SUM "AVG Motifs per gene              = $answerA\n";
+# $answerB	= $totalmotifsDIS/$nseqs;
+# printf SEQS_SUM "AVG Motifs per gene also DIS     = $answerB\n";
+# printf SEQS_SUM "Perc disorder for all genes      = $PercDISTOT\n\n";
+
+# printf SEQS_SUM "VALUES FOR EACH GENE follow\n\n";
+for ($nseq=1; $nseq<=$nseqs; $nseq++) {
+	printf SEQS_OUT "$seqid[$nseq]\n";
+#	printf SEQS_SUM "$seqid[$nseq]\n";
+
+	printf SEQS_OUT "KD meanQ FldInd Dis Ent FracPhos FracBord = ";
+	printf SEQS_OUT "%6.3f %6.3f ", $KDnorm_seqs{$seqid[$nseq]}, $chrmean_seqs{$seqid[$nseq]};
+	printf SEQS_OUT "%6.3f %6.3f %6.3f ", $foldind_seqs{$seqid[$nseq]}, $dis_seqs{$seqid[$nseq]}, $ent_seqs{$seqid[$nseq]};
+	printf SEQS_OUT "%6.3f %6.3f ", $frac_phos_seqs{$seqid[$nseq]}, $frac_bord_seqs{$seqid[$nseq]};
+	printf SEQS_OUT "\n";
+
+	printf SEQS_OUT "SEQ=$seq{$seqid[$nseq]}\n";
+	printf SEQS_OUT "DIS=$propchar{$seqid[$nseq]}\n";
+	printf SEQS_OUT "CHR=$signchar{$seqid[$nseq]}\n";
+	printf SEQS_OUT "CHR=$chrcharA{$seqid[$nseq]}\n";
+	printf SEQS_OUT "CHR=$chrcharB{$seqid[$nseq]}\n";
+
+
+	printf SEQS_OUT "\n";
+
+
+	if ($extra eq "yes") {			# optional output of vertical data list, one file per sequence
+	    $filehere	= 'seq_'.$nseq.'.txt';
+	    open (EXTRA, ">$filehere") or die "cannot open $filehere\n";
+	    printf EXTRA "$seqid[$nseq]\n";
+	    @seqextra	= split ("",$seq{$seqid[$nseq]});
+	    $aatop	= length $seq{$seqid[$nseq]};
+	    @disextra	= split ("",$propchar{$seqid[$nseq]});
+	    @signextra	= split ("",$signchar{$seqid[$nseq]});
+	    @chrextraA	= split ("",$chrcharA{$seqid[$nseq]});
+	    @chrextraB	= split ("",$chrcharB{$seqid[$nseq]});
+	    for ($aa=1; $aa<=$aatop; $aa++) {
+	        if ($disextra[$aa-1] eq "*") {
+		    $disout	= "1";
+		} else {
+		    $disout	= "0";
+		}
+		if ($chrextraA[$aa-1] eq " ") {
+		  $chrout	= $chrextraB[$aa-1];
+		} else {
+		  $chrout	= $chrextraA[$aa-1].$chrextraB[$aa-1];
+		}
+#	        printf EXTRA "$seqextra[$aa-1]  $disout  $all_winQ[$nseq][$aa]  check=$signextra[$aa-1]$chrextra[$aa-1]\n";
+	        printf EXTRA "$seqextra[$aa-1]  $disout  $all_winQ[$nseq][$aa]  check=$signextra[$aa-1]$chrout\n";
+	    }
+	    close (EXTRA);
+	}
+}
+close (SEQS_OUT);
+# close (SEQS_SUM);
+
+close (BLAH);
+
+exit;
--- a/seq_reference_data.txt
+++ b/seq_reference_data.txt
+
+# Niwa population 3049 ids matched, and a further 653 excluded by MEMBRANE_EXCLUDE
+# Niwa - following averages and std deviations refer to the 3049-653 population
+# Niwa - following data segment from seq_compositions_perc_pipeline.pl followed by seq_compositions_stats_pipeline w/ STATS_MODE unset
+# Added POP, TOP, LOW to start of transferred data (seq_compositions), for ease of reading
+# Also added the data source to each horizantal data line, for ease of reading
+
+POP, NIWA, AVG, 53.347, DEV, 33.910, N, 2396, TAG, percentage-solubility
+
+POP, NIWA, WHOLE-SEQ,headers,K-R,D-E,naa,totperc,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity
+POP, NIWA, WHOLE-SEQ,average, -0.922, -0.983,307.996,100.000,  9.347,  1.360,  5.654,  6.637,  3.456,  7.021,  2.543,  5.668,  4.947,  9.969,  2.754,  3.921,  4.409,  4.697,  5.868,  5.542,  5.307,  6.847,  1.297,  2.755, 10.815, 12.291, -1.476, 23.106,  7.508,  6.693,  0.475,  0.029,  0.143, -0.084,  4.067,  0.993
+POP, NIWA, WHOLE-SEQ,std-dev,  3.438,  2.759,178.157,  0.000,  2.752,  1.217,  1.614,  2.132,  1.409,  2.287,  1.289,  1.835,  2.327,  2.568,  1.114,  1.557,  1.632,  1.844,  2.183,  1.876,  1.743,  2.017,  0.994,  1.383,  2.923,  2.586,  3.708,  4.088,  2.529,  1.854,  0.025,  0.027,  0.080,  0.021,  0.093,  0.026
+
+TOP, NIWA, AVG, 113.241, DEV, 8.620, N, 145, TAG, percentage-solubility
+
+TOP, NIWA, WHOLE-SEQ,headers,K-R,D-E,naa,totperc,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity
+TOP, NIWA, WHOLE-SEQ,average,  1.042, -1.529,188.055,100.000,  8.993,  1.277,  6.064,  7.593,  2.964,  6.639,  2.259,  5.472,  6.560,  9.202,  3.041,  3.774,  4.125,  4.691,  5.518,  5.770,  5.464,  7.112,  1.052,  2.431, 12.078, 13.657, -1.579, 25.735,  6.447,  6.572,  0.462,  0.047,  0.089, -0.084,  3.995,  0.986
+TOP, NIWA, WHOLE-SEQ,std-dev,  4.125,  3.843,122.263,  0.000,  3.163,  1.502,  2.193,  2.973,  1.697,  2.544,  1.628,  1.989,  3.529,  2.816,  1.642,  1.941,  2.154,  2.685,  2.740,  2.226,  2.097,  2.674,  1.094,  1.514,  4.786,  3.540,  6.298,  5.586,  2.709,  2.227,  0.033,  0.045,  0.109,  0.026,  0.115,  0.032
+
+LOW, NIWA, AVG, 5.208, DEV, 1.747, N, 125, TAG, percentage-solubility
+
+LOW, NIWA, WHOLE-SEQ,headers,K-R,D-E,naa,totperc,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity
+LOW, NIWA, WHOLE-SEQ,average, -2.651, -0.520,437.096,100.000,  8.922,  1.228,  5.438,  5.957,  3.800,  6.932,  2.923,  5.307,  3.812, 10.676,  2.674,  4.060,  4.525,  4.901,  6.463,  5.892,  5.369,  6.001,  1.833,  3.286, 10.275, 11.395, -1.120, 21.670,  8.918,  6.988,  0.473,  0.016,  0.150, -0.082,  4.113,  0.992
+LOW, NIWA, WHOLE-SEQ,std-dev,  2.618,  2.230,204.698,  0.000,  2.227,  0.743,  1.187,  1.547,  1.272,  1.912,  1.117,  1.682,  1.387,  2.481,  0.906,  1.623,  1.274,  1.557,  1.764,  1.517,  1.470,  1.585,  0.956,  1.565,  1.793,  1.623,  1.643,  2.998,  2.599,  1.476,  0.021,  0.011,  0.060,  0.020,  0.056,  0.021
+
+
+# calc_flag, signed_diff_zscore, y/yes or n/no to include or not in prediction, taken from p-value of overall correlations
+# also avoid most straight linear combinations of aas in prediction
+# then, in the prediction scheme, for each feature to be included (y/n) we will have:
+# feature x, let xL = x avg in LOW set, xT = x avg in TOP set, and measured prop y, yL = avg in LOW ser, yT = avg in TOP set
+# for feature value x:
+# if x < xL, x=xL or if x > xT, x=xT, and predicted y = yL + [(x-xL)/( xT-xL)]*(yT-yL)
+# followed by linear combination with abs value of zscore_diff for each x/feature
+
+# our current (hopefully FINAL) list of 10 features for prediction is given in as y/n
+
+ZDF,NIWA,WHOLE-SEQ,headers,K-R,D-E,naa,totperc,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,K+R,D+E,K+R-D-E,K+R+D+E,F+W+Y,pI,KyteDoo,abs-charge,FoldIndex,disorder,entropy,betapropensity
+ZDF, NIWA, WHOLE-SEQ, zscore-diff,     1.096,    -0.333,    -1.523,     0.000,     0.026,     0.043,     0.370,     0.724,    -0.563,    -0.132,    -0.484,     0.090,     1.118,    -0.557,     0.288,    -0.160,    -0.234,    -0.099,    -0.419,    -0.065,     0.053,     0.522,    -0.762,    -0.555,     0.548,     0.876,    -0.116,     0.947,    -0.931,    -0.225,    -0.390,     1.105,    -0.711,    -0.103,    -1.382,    -0.258
+ZDF, NIWA, WHOLE-SEQ, use_for_prob,y,n,y,n,n,n,n,n,n,n,y,n,n,y,n,n,n,n,n,n,n,y,n,n,n,y,n,n,y,n,n,y,y,n,y,n
+
--- a/server_prediction_seq_export.pl
+++ b/server_prediction_seq_export.pl
+#!/usr/bin/perl -w
+
+use FindBin;
+$file_here = "$FindBin::Bin/blah.txt";
+open (BLAH, ">>$file_here");			# file for comments used throughout server code
+
+# read in the reference data for this dataset.
+$file_here = "$FindBin::Bin/seq_reference_data.txt";
+open (REF, "<$file_here") or die "cannot open seq_reference_data.txt\n";
+@ref	= <REF>;
+close (REF);
+
+foreach $line_spaces (@ref) {
+  chomp $line_spaces;
+  $line					= $line_spaces;
+  $line					=~ s/\s//g;
+  @words				= split (",",$line);
+  $nwords				= @words;
+  if (exists $words[0]) {			# non-null line
+    if (substr ($line,0,1) ne "#") {		# not comment line
+
+      if ($words[2] eq 'AVG') {
+
+	if ($words[0] eq 'POP') {
+	  $population_avg	= $words[3];
+	}
+
+	if ($nwords != 10) {			# check the number, but not format, of expected fields
+	  printf BLAH "stopping *** number of fields in AVG line mismatch\n";
+	  exit;
+	} else {
+	  $AVG_line{$words[0]}		= 'seen';
+	  if ($words[0] eq 'TOP') { $top_prop_avg	= $words[3]; }
+	  if ($words[0] eq 'LOW') { $low_prop_avg	= $words[3]; }
+	}
+
+      } elsif ($words[3] eq 'headers') {
+	if ($nwords != 40) {			# check the number, but not format, of expected fields
+	  printf BLAH "stopping *** number of fields in headers line mismatch\n";
+	  exit;
+	} else {
+	  $headers_line{$words[0]}	= 'seen';
+	}
+	if ($words[0] eq 'POP') {
+	  $headers_store			= $line;	# store the headers just once, for results write
+	}
+
+      } elsif ($words[3] eq 'average') {
+	if ($nwords != 40) {			# check the number, but not format, of expected fields
+	  printf BLAH "stopping *** number of fields in average line mismatch\n";
+	  exit;
+	} else {
+	  $average_line{$words[0]}	= 'seen';
+	}
+        if ($words[0] eq 'TOP') {		# store top and low averages to derive the linear fits (per feature)
+	  $top_avg_line			= $line;
+	} elsif ($words[0] eq 'LOW') {
+	  $low_avg_line			= $line;
+	} elsif ($words[0] eq 'POP') {
+	  $pop_avg_line			= $line;
+	}
+
+      } elsif ($words[3] eq 'std-dev') {
+	if ($nwords != 40) {			# check the number, but not format, of expected fields
+	  printf BLAH "stopping *** number of fields in std-dev line mismatch\n";
+	  exit;
+	} else {
+	  $std_dev_line{$words[0]}	= 'seen';
+	}
+        if ($words[0] eq 'POP') {		# store population std deviations for features
+	  $pop_dev_line			= $line;
+#unused	} elsif ($words[0] eq 'TOP') {
+#	  $top_dev_line			= $line;
+#	} elsif ($words[0] eq 'LOW') {
+#	  $low_dev_line			= $line;
+	}
+
+      } elsif (($words[0] eq 'ZDF') and ($words[3] eq 'zscore-diff')) {
+	if ($nwords != 40) {			# check the number, but not format, of expected fields
+	  printf BLAH "stopping *** number of fields in ZDF zscore-diff line mismatch\n";
+	  exit;
+	} else {
+	  $zdiff_line			= $line;
+	  $zdf_values			= 'seen';
+	}
+      } elsif (($words[0] eq 'ZDF') and ($words[3] eq 'use_for_prob')) {
+	if ($nwords != 40) {			# check the number, but not format, of expected fields
+	  printf BLAH "stopping *** number of fields in ZDF use_for_prob line mismatch\n";
+	  exit;
+	} else {
+	  $use_for_prob_line		= $line;
+	  $zdf_use			= 'seen'
+	}
+      }
+
+    }						# end not comment line
+  }						# end not null line
+}						# end line read
+
+for $key_here ('POP', 'TOP', 'LOW' ) {
+ if (($headers_line{$key_here} ne 'seen') or ($average_line{$key_here} ne 'seen') or ($std_dev_line{$key_here} ne 'seen')) {
+   printf BLAH "stopping **** headers/average/std_dev line unseen for $key_here\n";
+   exit;
+ }
+ if ($AVG_line{$key_here} ne 'seen') {
+   printf BLAH "stopping **** AVG line unseen for $key_here\n";
+   exit;
+ }
+}
+if (($zdf_values ne 'seen') or ($zdf_use ne 'seen')) {
+  printf BLAH "stopping **** ZDF zscore-diff/use_for_prob line unseen\n";
+}
+
+# store the reference values needed for prediction
+
+$low_prop	= $low_prop_avg;
+$top_prop	= $top_prop_avg;
+@lows		= split (",",$low_avg_line);		# values run from 4-39 (inclusive), with array index starting at 0
+@tops		= split (",",$top_avg_line);		# as above
+@pops		= split (",",$pop_avg_line);		# as above
+@pop_devs	= split (",",$pop_dev_line);		# as above
+@zdiffs		= split (",",$zdiff_line);		# as above, zdiff_scores remain signed at this point
+@use		= split (",",$use_for_prob_line);	# as above, values are y,n
+@heads		= split (",",$headers_store);		# as above
+
+# read and store data values for this sequence set
+
+$file_here = "$FindBin::Bin/seq_composition.txt";
+open (COMP, "<$file_here") or die "cannot open seq_composition.txt\n";
+$headers_found		= "no";
+$nseqs			= 0;
+while ($line_spaces = <COMP>) {
+  chomp $line_spaces;
+  $line					= $line_spaces;
+  $line					=~ s/\s//g;
+  @words				= split (",",$line);
+  $nwords				= @words;
+  if ($headers_found eq "no") {
+    if ((exists $words[0]) and (exists $words[1])) {
+      if (($words[0] eq "WHOLE-SEQ") and ($words[1] eq "ORF-ID")) {
+        $headers_found	= "yes";
+#        @heads_comp	= @words;		# not used currently
+      }
+    }
+  }
+  if (exists $words[0]) {
+    if ($words[0] eq "WHOLE-SEQ") {		# only process WHOLE-SEQ here - will need other of our code for seq-profiles
+      if ($words[1] ne "ORF-ID") {		# check not on header line
+        if ($nwords != 38) {			# flag, ID, 36 comp etc data fields
+	  printf BLAH "stopping **** fields ne 38 in composition file data line\n";
+	  exit;
+	} else {
+	  $nseqs++;
+	  $ids[$nseqs]			= $words[1];
+	  for ($n=1; $n<=36; $n++) { $data[$nseqs][$n]	= $words[$n+1]; }
+	}
+      }
+    }
+  }
+}
+close (COMP);
+
+$file_here = "$FindBin::Bin/seq_prediction.txt";
+open (PREDA, ">$file_here") or die "cannot open seq_prediction.txt\n";
+
+# before the sequence data writes, output things that don't change
+
+$wt_sum			= 0;
+for ($d=1; $d<=36; $d++) {
+  if ($use[$d+3] eq 'y') {
+    $wt_feature[$d]	= abs ($zdiffs[$d+3]);		# get feature weighting for prediction (mostly zero probably)
+    $wt_sum		+= $wt_feature[$d];
+  } else {
+    $wt_feature[$d]	= 0;
+  }
+}
+
+printf PREDA "LEGEND 35 sequence features are calculated, including 20 amino acid compositions\n";
+printf PREDA "LEGEND All features are calculated over a sliding 21 amino acid window\n";
+printf PREDA "LEGEND   KmR=KminusR DmE=DminusE KpR=KplusR PmN=K+R-D-E PpN=K+R+D+N aro=F+W+Y\n";
+printf PREDA "LEGEND   fld = border of folded (pos) / unfolded (neg) Uversky et al Proteins 2000 41:415\n";
+printf PREDA "LEGEND   dis = disorder propensity Rune & Linding NAR 2003 31:3701\n";
+printf PREDA "LEGEND   bet = beta strand propensities Costantini et al 2006 BBRC 342:441\n";
+printf PREDA "LEGEND   mem = Kyte-Doolittle hydropathy normalised (-1 to +1) JMB 157:105\n";
+printf PREDA "LEGEND Only a subset of features used for prediction, according to best fit against data\n";
+printf PREDA "LEGEND   underlying solubility data: cell-free expression Niwa et al PNAS 2009 106:4201\n";
+printf PREDA "LEGEND Features used for prediction are scaled 0 - 1 over range in underlying dataset\n";
+printf PREDA "HEADERS PREDICTIONS LINE,ID,percent-sol,scaled-sol,population-sol,pI\n";
+printf PREDA "HEADERS FEATURES ORIGINAL,ID";
+for ($d=1; $d<=3; $d++) { printf PREDA ",$heads[$d+3]"; }
+for ($d=5; $d<=36; $d++) { printf PREDA ",$heads[$d+3]"; }
+printf PREDA "\n";
+printf PREDA "HEADERS FEATURES PLOT,ID,";
+printf PREDA "KmR,DmE,len,";
+for ($d=5; $d<=24; $d++) { printf PREDA "$heads[$d+3],"; }
+printf PREDA "KpR,DpE,PmN,PpN,aro,pI,mem,chr,fld,dis,ent,bet\n";
+printf PREDA "\n";
+
+for ($n=1; $n<=$nseqs; $n++) {
+  $wt_sum		= 0;
+  $pred_sum		= 0;
+  for ($d=1; $d<=36; $d++) {
+
+    if ($d != 4) {				# get the values of (X-Xavg-pop) / std-dev-pop i.e. a neg/pos scale
+      $norm_dev[$d]	= ($data[$n][$d] - $pops[$d+3]) / $pop_devs[$d+3];
+    }
+
+    $wt_feature[$d]	= 0;			# initialise to zero, since will only overwrite if feature used for prediction
+    if ($use[$d+3] eq 'y') {			# we are using this feature (based on correlation analysis)
+
+      $data_here	= $data[$n][$d];
+      if ($lows[$d+3] <= $tops[$d+3]) {		# range as the words say, bounds check 
+        if ($data[$n][$d] < $lows[$d+3]) { $data_here	= $lows[$d+3]; }
+	if ($data[$n][$d] > $tops[$d+3]) { $data_here	= $tops[$d+3]; }
+      } else {					# inverted range, bounds check the other way around
+        if ($data[$n][$d] > $lows[$d+3]) { $data_here	= $lows[$d+3]; }
+	if ($data[$n][$d] < $tops[$d+3]) { $data_here	= $tops[$d+3]; }
+      }
+
+      $pred_here	= $low_prop + ($top_prop - $low_prop) * ( ($data_here - $lows[$d+3]) / ($tops[$d+3] - $lows[$d+3]) );
+      $wt_here		= abs ($zdiffs[$d+3]);
+      $pred_sum		+= $wt_here*$pred_here;
+      $wt_sum		+= $wt_here;
+      $wt_feature[$d]	= $wt_here;
+
+    }
+  }
+  $prediction		= $pred_sum / $wt_sum;
+  $prediction_scaled	= ($prediction - $low_prop) / ($top_prop - $low_prop);
+  $population_scaled	= ($population_avg - $low_prop) / ($top_prop - $low_prop);
+  $pI_here		= $data[$n][30];
+  printf PREDA "SEQUENCE PREDICTIONS,$ids[$n]";			# output sequence-specific data
+  printf PREDA ",%6.3f,%6.3f,%6.3f,%6.3f\n", $prediction,$prediction_scaled,$population_scaled,$pI_here;
+  printf PREDA "SEQUENCE WEIGHTS,$ids[$n]";
+  for ($d=1; $d<=3; $d++) {
+    $wt_print		= $wt_feature[$d]/$wt_sum;
+    printf PREDA ",%6.3f", $wt_print;
+  }
+  for ($d=5; $d<=36; $d++) {
+    $wt_print		= $wt_feature[$d]/$wt_sum;
+    printf PREDA ",%6.3f", $wt_print;
+  }
+  printf PREDA "\n";						# output the devs/std-dev for each feature, for this sequence
+  printf PREDA "SEQUENCE DEVIATIONS,$ids[$n]";
+  for ($d=1; $d<=3; $d++) { printf PREDA ",%6.3f", $norm_dev[$d]; }
+  for ($d=5; $d<=36; $d++) { printf PREDA ",%6.3f", $norm_dev[$d]; }
+  printf PREDA "\n\n";
+
+}
+
+close (PREDA);
+close (BLAH);
+
+exit;
+
--- a/ss_propensities.txt
+++ b/ss_propensities.txt
+comment amino acid sec struc propensities from
+comment Constantini et al (2006) BBRC 342:441-451
+comment aa3 aa1  freq    Pa    Pb    Pc
+data    ALA   A  7.73  1.39  0.75  0.80
+data    CYS   C  1.84  0.74  1.31  1.05
+data    ASP   D  5.82  0.89  0.55  1.33
+data    GLU   E  6.61  1.35  0.72  0.86
+data    PHE   F  4.05  1.01  1.43  0.76
+data    GLY   G  7.11  0.47  0.65  1.62
+data    HIS   H  2.35  0.92  0.99  1.07
+data    ILE   I  5.66  1.04  1.71  0.59
+data    LYS   K  6.27  1.11  0.83  1.00
+data    LEU   L  8.83  1.32  1.10  0.68
+data    MET   M  2.08  1.21  0.99  0.83
+data    ASN   N  4.50  0.77  0.62  1.39
+data    PRO   P  4.52  0.50  0.44  1.72
+data    GLN   Q  3.94  1.29  0.76  0.89
+data    ARG   R  5.03  1.17  0.91  0.91
+data    SER   S  6.13  0.82  0.85  1.24
+data    THR   T  5.53  0.76  1.23  1.07
+data    VAL   V  6.91  0.89  1.86  0.64
+data    TRP   W  1.51  1.06  1.30  0.79
+data    TYR   Y  3.54  0.95  1.50  0.78
+comment suggested algorithm is simply to sum over
+comment window (eg 7) and take largest value = ss