The RASP algorithm casts chemical shift assignment as a combinatorial optimisation problem: it seeks assignments that maximise the sum of two terms, the first of which scores the agreement between the predicted and experimental chemical shifts and the second of which scores agreement between equivalent pairs of shifts from sequentially assigned spin systems:
Here a permutation of residues r is denoted π, and π expresses the assignment such that π(r) = i denotes the assignment of spin system i to residue r. RASP seeks the permutation π that maximises equation 1.
We derive first the term from the weighted shift distance between the set of chemical shifts defining the experimental spin system i and the corresponding set of predicted shifts for residue r:
where the sum is over all shifts X common to spin system i and residue r, including i-1 (r-1) shifts. The weighting terms wX are chosen to account for the varying precision with which each shift type can be predicted.
In an attempt to account for the fact that chemical shift space is not uniformly populated, and that spin systems in densely populated regions of shift space will tend to be closer to a larger number of predicted shifts, purely by chance, we normalize D such that the average shift distance for each spin system is 1, to yield N:
where n is the number of residues. Small values of N are strongly predictive of assignment accuracy, and a simple empirical function,
approximates the relationship between the normalised shift distance and the likelihood of a correct assignment. This serves as the first term of the RASP scoring function, Eq 1.
For the second term, Si,j, we score the agreement between the common shifts Xj and Xi-1 that support a sequential relationship between spin systems i and j. Because this entails the comparison of sets of experimental chemical shifts (rather than comparison of predicated shifts with experimental shifts, as in L) we adopt a more direct and more stringent scoring function:
where the cdf() is the normal cumulative distribution function with variance of 0.1 ppm for CA and CO chemical shifts, 0.2 ppm for CB, 0.06 for N and 0.03 for any proton shift. The resulting scoring function does not penalise the absence of any particular shift in either spin system.
The RASP optimisation strategy is based on the greedy randomised adaptive search procedures (GRASP) metaheuristic (1), although we make a number of adaptations of the basic GRASP protocol to suit this specific application, as outlined below.
GRASP heuristics comprise two phases, an initial construction phase generates a high-quality feasible solution, followed by a local search procedure that identifies the local optimum in the neighbourhood of this initial solution. Recognising the power of the sequential information encoded in S and following the approach of several other backbone assignment algorithms (eg, refs 2-4) we undertake the construction phase by selecting short contiguous stretches of assigned sequence from the best-ranked of an exhaustive list of all possible such assignments. As a suitable compromise between efficiency and accuracy, we use stretches of length three, which we term triplets. To assemble the list of triplets, we first split the target protein sequence into overlapping three-residue stretches, excluding all those containing proline or residues for which chemical shift predictions are not available.
For each such three-residue stretch, all possible spin-system assignments are enumerated and evaluated according to a triplet scoring function:
for the assignment of spin systems i, j and k to residues r, r-1 and r-2. The exhaustive list is pruned to keep only the best 3N possible triplets, where N is the total number of possible assignments (the smaller of the number of spin systems and the number of residues).
Triplets are selected from this pruned list using a weighted random selection such that the likelihood of any given triplet being selected is proportional to its score. Each selected triplet is then added to the assignment under construction. In the event that a selected triplet is inconsistent with the current assignment state (ie that residues or spin systems in the triplet are already assigned in a way that conflicts with the triplet) then that triplet is excluded from further consideration, and another is selected as before. The construction phase ends when the assignment is complete, or when the list of triplets is exhausted.
The multiplicative character of the triplet scoring function, which is used only in the construction phase, ensures that only triplet assignments well supported by each of the individual assignment likelihoods and by the relevant sequential information are likely to be selected. On the other hand, the random selection of elite triplets introduces sufficient diversity in the constructed solutions to ensure sampling of all residue assignments that are reasonable in the light of the available constraints, thus ensuring that the correct residue assignment is almost always sampled for each spin system.
The RASP construction phase is adaptive (as required by the GRASP metaheuristic) only in the weak sense that triplets inconsistent with the assignment under construction are not considered further. We find that a more aggressive adaptive triplet rescoring strategy that takes into account potential sequential connectivities to the assignment state at each construction step results in a reduction in the diversity of sampled assignment possibilities, and a consequent reduction in assignment accuracy, albeit with a modest increase in coverage.
Frequently, the list of elite triplets is exhausted before the assignment is complete. In this case, missing assignments are completed by applying the Hungarian assignment algorithm (5) over the set of unassigned spin systems and residues, optimising only the agreement between predicted and experimental chemical shifts (ie. using –Li,r as the cost function).
After the initial construction phase, the assignment is further refined by a systematic search for the local maximum of the assignment score (eq. 1) in the neighbourhood of the current assignment. This search is conducted by considering all pairs of spin systems, and swapping their residue assignments if such a swap will increase the assignment score. When no such swap is possible, a local maximum has been reached and the current assignment is taken as a member of the assignment ensemble.
(1) Resende, M. G. C.; Ribeiro, C. C. In Handbook of Metaheuristics; Gendreau, M., Potvin, J.-Y., Eds.; Springer: 2010, p 283.
(2) Jung, Y. S.; Zweckstetter, M. (2004) J. Biomol. NMR, 30: 11.
(3) Coggins, B. E.; Zhou, P. (2003) J. Biomol. NMR, 26: 93.
(4) Atreya, H. S.; Sahu, S. C.; Chary, K. V.; Govil, G. (2000) J. Biomol. NMR, 17, 125.
(5) Kuhn, H. W. (1955) Naval Research Logistics, 2: 83.