Core Tracing: Depicting Connections Between Features in Electron Density

This is the html version of the file http://journals.iucr.ac.uk/d/issues/1994/05/00/gr0232/gr0232.pdf.
G o o g l e automatically generates html versions of documents as we crawl the web.
To link to or bookmark this page, use the following url:

http://www.google.com/search?q=cache:IowzXKG1f0oJ:journals.iucr.ac.uk/d/issues/1994/05/00/gr0232/gr0232.pdf+gr0232.pdf&hl=en&gl=us&ct=clnk&cd=1&client=firefox-a

Google is neither affiliated with the authors of this page nor responsible for its content.

These terms only appear in links pointing to this page: gr0232 pdf

Page 1

Acta Cryst.

(1994). D50, 695-708

Core Tracing: Depicting Connections Between Features in Electron Density

BY STANLEY M. SWANSON

Biographics Laboratory, Department of Biochemistry and Biophysics, Texas A & M University,

College Station, Texas

77843-2128,

USA

(Received 23 November 1992; accepted 1 March 1994)

695

Abstract

Core tracing is a threshold-independent method of deter-

mining connectivity (long chains of high-density values)

in electron-density maps. It gives visually sparse pictures

of large volumes which are useful for initial fitting and

for molecular-boundary determination. New methods for

visual presentation of the traces are suggested by the

way that the connectivity is parameterized in terms of

local connections between maxima and the saddle (low-

est) points along the connecting paths. The algorithm

also partitions the density into small compact volumes

containing the maxima. These volumes are useful for

localization and statistical analysis.

Introduction

The interpretation of electron-density maps is one of the

most labor-intensive and demanding aspects of macro-

molecular crystallography, especially when the map is

noisy or poorly phased. Core tracing addresses this prob-

lem by highlighting dominant features in the density.

The graphical presentation of electron density has tra-

ditionally been in terms of contours: by two-dimensional

projections, by layers of two dimensional sections on

transparent sheets, or, with the advent of computer

graphics, by a wire-frame representation of iso-density

surfaces formed from superimposed two-dimensional

contoured layers in three directions. In the initial stages

of macromolecular analysis, when molecular packing

is determined, or when the initial chain tracing of a

structure is performed, contours on a typical computer-

graphics screen provide a too local and too cluttered

view of the density.

Various skeletonization techniques [Greer (1974);

Hilditch (1969); Johnson (1977, 1978);

GRINCH

(Williams, 1982; Swanson, 1979);

BONES

(Jones &

Thirup, 1986)] have been proposed to give a visually

economical depiction of density connectivity in large

volumes. The complexity of skeletons (the number

of vectors in the picture) is approximately that of

the final structural model, and at least an order of

magnitude less than that of contours. Skeletons are

complementary to contours: they provide an overview

of large volumes and, when used with contours in small

volumes, an indication of the most probable connection

path. Skeletons fail to depict aspects of density such

as shape and bulk which contours suggest, but it is

possible to encode some of these in the rendering or

presentation phase.

Implicit in the idea of skeletonization is that con-

nectivity in the density corresponds to bond chains in

the structure. In contoured density of macromolecular

structures at intermediate resolution, it does. However, at

either very low or very high resolution, or in the presence

of noise, exceptions occur. For very low resolution,

one expects only to see molecular outlines, while in

the case of very high resolution (attained in small-

molecule studies) one resolves individual atoms and

infers bonding from distance calculations. Noise, by

altering density values relative to a chosen threshold, can

break or add connections. For example, one may have

to interpret a string of islands (a sequence of nearby but

disconnected lumps) as a continuous chain.

Core tracing is a new method of density skeletoniza-

tion that has been developed to address some perceived

inadequacies in previous methods. Greer's method and

its descendents

(BONES)

force a pre-processing decision

on the lowest connection level; it is not easy to ask

what other possibilitieslie just below that threshold.

In contrast, core tracing uses a threshold-independent

top-down scan of the density map so that all local

connections can be found. The most prominent features

are noted first, and the decision about viewing threshold

level can be an interactive one at the display.

Greer's method forms skeletons by connecting ad-

jacent grid points; the paths tend to have many short

lines of about atomic bond length which appear only in

limited orientations (along lattice coordinates or diago-

nals).

GRINCH

does interpolate locally (which removes

the limitations on line orientation), but still uses many

short lines. Moreover, the interpolation is biased toward

the grid positions. In contrast, while developing core

tracing, it was found that drawing paths which connect

maxima to saddles gives an adequate and less busy pic-

ture with longer line segments. Interpolation to provide

diversity of orientation becomes less necessary because

adjacent grid points are rarely connected. (The subject

Acta Crystallographica Section D

Page 2

696

CORE TRACING

of interpolation itself deserves more careful study to

find optimum methods and to determine accuracy in the

face of experimental error. See the Appendix for more

comments.)

be called features of the density. For macromolecular

crystallography, we can ignore the caves and focus on

the core of the density.

Descriptive geometry

Mathematically, an electron density is a scalar field: a

real function defined on a three-dimensional domain.

Experience suggests that we will find one-dimensional

paths passing through high density which correspond to

bonded chains of atoms in the macromolecular struc-

ture being studied. Although an experimental density

is sampled on a grid and algorithms must deal with

that partial information, we now consider the ideal

continuous case. The gradient operator measures spatial

changes. Most points will have a non-zero slope, but a

few will be critical points where the gradient is zero.

There are critical points in addition to maxima and

minima: the saddle points with mixed partial curvatures.

The number of kinds of saddle points depends on the

dimension of the space; there are two different kinds

in three dimensions, but only one in two dimensions.

The language used previously in crystallography has

a two-dimensional bias, being based on terminology

borrowed from topography [peaks, ridges, passes, pales,

pits (Johnson, 1978)], and really describes only planar

projections of density.

To sharpen our description, consider three dimensions

specifically. Density maxima are local concentrations of

high density, or nodules. One-dimensional connections

follow the path of highest density between the nodules.

Such a path travels through two-dimensional maxima

in planes perpendicular to its direction, but will en-

counter a minimum density value somewhere between

two nodules. The minimum on a path would be seen

as a constriction or neck in a contour representation of

the density, and corresponds to one of the intermediate

non-extremal critical points. Together, the nodules, the

paths connecting them and the constrictions will form

the focus of this paper and be called the core of the

density. Note that our viewpoint has been from outside

the density: looking at a hard object or structure in the

density. The other critical points are best thought about

from the inside, as though we travel through a system

of caves. Minima correspond to voids, connections of

minimum density between voids to passages and narrow

places in passages to portals (the other non-extremal

critical point). In Johnson's terminology, the core of the

density consists of peaks, ridges, and passes; the caves

are described by pales and pits.

After having discussed three dimensions in fairly pic-

turesque language, I will also use a dimensionally more

neutral terminology: maxima (for the nodule centers) and

joins (for the constrictions), and path or core or trace

for the connections. Together, maxima and joins will

The algorithm

Now consider the algorithm, first in qualitative terms and

then in more detail. Some more technical details which

define and facilitate the implementation are discussed in

the Appendix.

Core tracing proceeds by associating successively

lower nearby points to maxima in the density, form-

ing distinct, local, growing, nodules. Eventually these

nodules will merge; the highest point at which two (or

more) nodules touch is the join between them. Line

segments from a join to the connected maxima represent

the core of the density. The ideas of nearby and touch

are defined in terms of a neighborhood of a point (other

points within a specified distance). The result is a list

of connected features and a partition of the density into

many small, compact volumes, each identified with a

feature contained within it. A two-dimensional example

is given in Fig. 1.

A neighborhood is defined by a list of nearby lattice

points, sorted so that the nearest ones come first. A single

loop then drives the investigation of the neighborhood

of a point; a re-analysis of a map with a different

neighborhood is handled by a different list of neighbors.

As concrete examples of neighborhoods, consider

cubic and hexagonal lattices with equal grid intervals

in all directions. The 27 points which form a cube

with a maximum coordinate offset of 1 grid unit from

the central point separate into four distance classes:

the single point at the center, the six points along the

coordinate directions (distance squared = 1 grid unit),

the 12 points on edges of the cube (distance squared

= 2 units), and the eight points at the cube vertices

(distance squared = 3 units). Although the conventional

neighborhood (or shell) is all 26 surface points, one can

choose fewer (18 or 6) or more merely by specifying

a new defining distance limit. For a hexagonal lattice,

the corresponding set contains 21 points arranged as

a stack of three hexagons, each with six peripheral

points and a center. There are three classes: the single

central point, the eight points at unit distance (six on the

medial hexagon and two axial points) and 12 points on

the periphery of the top and bottom hexagons (squared

distance = 2 units). Again, the search loops need not

be rewritten, one simply gets a different list of position

offsets. Non-standard density sampling schemes can be

accommodated (e.g. body-centered cubic sampling in

which alternate layers are shifted by half a grid unit).

Features (maxima and joins) are found and tabulated

as the map is examined, and are assigned an identifying

number, with smaller numbers corresponding to higher

density. We construct a list of features, their positions

Page 3

STANLEY M. SWANSON

697

and density, and a connectivity table• In the connectivity

table, each join has the set of feature numbers (usually

maxima) found together in its neighborhood, and each

maximum has a list of all the joins which include it.

More specifically, we break the algorithm into its data

structures and their initialization (0) and three steps: (1)

an initial sort that determines the sequence in which

the grid points of the map are considered; (2) the

examination of a shell of points about each grid point

for connectivity; and (3) an assessment of whether there

is any new connectivity information.

(0) Start with a three-dimensional array of density

values,

d(j,k,l)

(the map) and a corresponding

array of

marks, q(j,k,l).

A mark is initially zero,

but ultimately is the identifying number of the

associated feature, which is normally a nearby

maximum. This feature will not necessarily be

the geometrically closest one, but the closest

one among a set of features reachable by paths

through equal or higher density.

(1) Sort the grid positions by their density values,

to give a list of grid points ordered by density

. . . .

6152015

3 5 3

•

31530353015

6152015

•

52035403520

3 15 30 35 30 15

31530353016

10 15 20 15 6

52035403520

. . . .

6 15 20 15 10 16 30 35 30 15 3

. . . .

3 5 3 4 15 30 35 30 15 3

. . . . .

3 5 3 52035403520

. 6 15 20 15 7 7 15 20 15 6

3 15 30 35 30 19 8 3

. 3 15 303530

15 4 3 5 3

6152022222015

6 52035403520

. . . . . .

3 8 19 30 35 30 15 7 15 30 35 30 15 3

. . . . . .

52035403520

5 6 15 20 15 6

•

3 5 3

. . . .

31530353015

. 3 5 3

6152015

6 15

20 15

1520 15 6

3 15 30

35 30

31530353015

. 4 8 8 4

3 5 820354035

52035403520

. 6 15 20 15 6

. . . . . .

6 15 20 19 22 30 35 30

3 15 303530

15 3

. 3 8 19 30 35 30 15 3

. 3 8 19 30 35 30 22 19 20 15

•

6 15 20 15

15 25 35 41 40 35 20

3 5 3

. 315303845453830

15 3

. 3 15 30 38 45 45 38 30 15 3

5203540413525

15 6

. 5203540413525

15 6

. . . . . .

3 15 30 35 30 19 8 3

6 15 20 15 6

(a)

. . . .

c c c c c

d d d

• c

c c ¢ c c c

d d d d d

• c

c c*C*c

¢ c e e e

d d d d d d d

• c c c c c c e e e e e

d d d*D*d

d d

. . . .

c c c c c-L-o

e e e e

. . . .

d d d d d d d

. . . . .

c c c e e e*E*e

e e

f-N-d

d d d d

e e e e e e g g

. f

d d d

e e e-J-g

g g g f

f*F*f

. . . . . .

. . . . . . . . . .

e e g g g g g-O-f

f f f f f

. . . . . .

g g g*G*g

g g f f f f f

. h h h

• i i

. . . . . .

g g g g g g g ,

f f f

. . . .

h h h h

• i i i i i . . . . . .

g g g g g

h h h h h

i i i i i i i

. . . . .

g-M-g

. . . . . . .

b h h h h*H*h

i i i*I*1

i i

. . . .

a a a a g

. . . . . .

b b b h-K-h

h h

i i i i i i i

. . a a a a a a a a

•

b b b b b b h h h h

. i i i 1 i-P-.

a a a a a a a a a

. . b b b b b b b h h h h

• i i i

. a a a a*A*a

a a a a

. b b b b*B*b

b b b h

a a a a a a a a a

b b b b b b b b b

• a a a a a a a a

•

b b b b b b b b

. . . .

• a a a a a

. . . . . .

b b b b b

. . . . . .

(b)

Fig. 1. A model two-dimensional

density (a) and its partition (b)

with features indicated. This ex-

ample is generated by and anal-

ysed by the computer code fur-

nished as supplemental material

to this paper. (a) 11 peaks have

been summed in a plane. Reso-

lution is four grid points; pairs

of peaks have merged in features

A and B. Values range from 0

to 45, with the range below 3

replaced by dots to separate the

molecule from the solvent. (b)

The partition of the density is

indicated by blocks of lower case

letters, except in the regions of

very low density masked by dots.

Features are indicated by capital

letters: maxima flanked by aster-

isks ( A through I ) and joins

flanked by hyphens (-J- through

-P-). Thus, all points which are

closer to *A* than to any other

maximum are identified by 'a'.

The point -M- is the join be-

tween *A* and *G*.

Page 4

698

CORE TRACING

value, from highest to lowest. (The sort can be

performed efficiently: see

Appendix.)

(2) At each point, in the order determined in (1),

examine a shell of nearby points. Define the set

of unique feature marks found in the neighbor-

hood of the central point to be the feature set

of the point. The marks serve as proxies for the

features, indicating that a path from the mark

to the feature exists through higher density. To

the central point, assign a mark which depends

on the surrounding density values and previously

assigned marks. If the feature set is not empty,

the mark is normally the closest feature whose

mark is in the feature set (2b, 2c, 2d).

There are several alternatives:

(2a) No previous marks are seen - a new local

maximum. Mark the point with a new feature

number, and add that position to the list of

features as a maximum.

(2b) Only one kind of mark (one mark value) is

seen - the point is part of a growing nodule

which is isolated in this direction. Mark the point

as associated with the feature whose mark was

found.

(2c) A set of at least two different previous marks

are seen -"two or more nodules are merging,

possibly a new join [see (3) below].

(2d) All points in the neighborhood are marked - a

new local minimum.

Alternative (2c) requires more analysis and it is dis-

cussed as a separate step: checking for new connections.

(3) To determine which features have not previously

been connected to others in the feature set,

try to construct paths from each feature to the

others, using the existing connectivity table. If

all features are found to be interconnected, there

is no new information. If one or more subsets

are unconnected, the central point is tabulated

as a new join, and the feature set is entered into

the connectivity table. Search paths are limited

in length: small rings can be excluded from

the connectivity table, but larger ones will not

be. Increasing the length of the search paths

decreases the number of joins accepted, but may

ignore direct connections in favor of alternate,

more circuitous routes.

One way to limit searching is to use the minimum

search depth (3) and attempt to eliminate short loops

later during the rendering phase. This has emerged as

a reasonable strategy since there are not many loops at

thresholds of interest (1.3cr) and one does not irretriev-

ably lose connectivity information.

Typically, most of the lower points near a join will

involve the same feature set found originally for the

join. As another way to eliminate excessive searching,

a scheme of growing 'pancakes' (flat separating disks)

out from joins is available. In this case, grid positions

are also marked with join numbers, whereas previously

only maxima were used as feature marks. Whenever a

single join mark is found in a neighborhood (even mixed

with maxima), one assumes that one is in the vicinity of

that join, and marks the central point with the join mark

without a connectivity search. If one sees multiple join

Fig. 2. Comparison of contouring

(blue), core tracing (red), and

Greer-Hilditch skeletonization

(green). Threshold level is 1.3tr

in a four-derivative MIR-phased

Ht-d map with resolution of

about 3A. Note the longer

vectors in the core tracing and

the jaggedness of the Greer

skeleton. The Greer rendering

is unsophisticated: no attempt

is made to eliminate extra lines

at some branch points, nor is a

choice made among members of

clusters of points of equal value.

Page 5

STANLEY M. SWANSON

699

marks, alone or mixed with maximum marks the point

becomes a candidate for a join. Pancakes complicate

the logic but are effective; they tend to separate the

original nodules associated with maxima and to reduce

the number and complexity of possible joins at lower

density values.

Core tracing is actually a family of techniques. It is

controlled by choice of neighborhood, depth of connec-

tivity search, and use of pancakes. We need more experi-

ence to make definitive recommendations for parameters.

There will also be some effect from the grid-size choice

(as compared to resolution) and the amount of smoothing

used to combat truncation ripple in the map. Current

working defaults are as follows.

Neighbors: 26 for 'rectangular' lattices (or possibly

18), 20 for hexagonal.

Search depth: 3

(m-j-m

paths) or possibly 5

(m-j-m-

j-m

paths).

Pancakes: use a search depth of 3.

Core tracing finds the 'total connectivity' of the

density, but the nature of the joins found at lower density

levels is influenced by the parameter choices just dis-

cussed. Although core tracing can find the connections

at all density levels, there may not be much point in

tabulating this information below the map average (well

into the noise for initial maps). Such a restriction cuts

the amount of computation at least in half.

Presentation and analysis

There are choices to be made concerning the graphical

presentation of the core tracing. These choices are as

important as knowing the connectivity because they

influence one's perceptions of the apparent connectivity.

Some of them have been or could be applied to con-

nectivity determined by other skeletonization techniques.

Fig. 2 compares core tracing, contouring and Greer

skeletonization in a small volume of density.

Even with the economy of vectors inherent in the

method, an entire asymmetric unit or unit cell can be

too busy for visual analysis. More limited volumes may

also need pruning. Several techniques are available to

select a subset of the joins:

thresholds,

limiting the number of connections drawn to each

maximum,

lower limits on the length of continuous tracing above

threshold, and

volume restrictions based on closeness to a position

or to a partial model.

Usually a threshold is combined with one or more of

the other restrictions. Since the set of maxima and joins

is threshold independent, the choice of level can be made

or changed when a display is generated. Most selection

criteria could be made interactive with the proper inte-

gration into a display program - the number of vectors

involved is in the low thousands. To simulate dynamic

change of level, core tracings have been rendered at

several different thresholds and displayed sequentially

or superimposed.

The traditional method of restricting the complexity

of a map display is a threshold. Often the choice of

threshold has been made on a visual assessment of

contours: low enough to give connectivity but not too

low so as to result in confusion. Typical levels are of

the order of 10.. (0. is calculated by considering the map

as a statistical distribution: calculate the average density

and the square root of the variance about that average.)

The number of joins increases rapidly below about 10. in

the map. Since a count of the number of maxima and the

number of joins is kept as a core tracing is generated, one

heuristic could be the density level at which the number

of maxima equals the number of joins. This gives enough

joins to form a single trace with each join between two

maxima and each maximum between two joins (...m-j-

m-j-m...). The

level at which this occurs is somewhat

below 10.

(ca

0.90.) and gives excessive connectivity.

However, some of the maxima are in solvent volume

so that connections to and between them are irrelevant

to the protein structure. Using only the number of

joins equal to the number of maxima in the fraction of

volume occupied by plotein results in a density threshold

comparable with the heuristic rule of 1.3o" found by

Jones & Thirup (1986). This value is comfortably larger

than one estimate of statistical noise of about 0.50"

(Swanson, 1993). In two test cases (see examples below),

most of the helix backbone was evident already at

1.50", while/3-sheet structure was seen only at a lower

level (1.3-1.00"). These observations emphasize that the

optimum threshold may not be the same in all regions

of the structure.

Core tracing may find six or more joins for prominent

maxima; in addition to finding those connections likely

to correspond to chemical structure, it connects maxima

in neighboring chains, albeit at a lower level. (Indeed,

the C, N and O atoms in proteins form not more than

three bonds to each other, so one can expect only two

or three 'real' connections to a maximum.) For a given

maximum, the connection table entries order the joins

by their density value, with the first entry having the

largest density. The 'join order' of a join is the highest

order in any of the several maxima lists in which the join

appears. Thus, a join between two maxima which is first

in one list and second in the other would have join order

2. Connections of join order 1 and 2 usually correspond

to main chain, and those of order 3 to branches, although

sometimes the 'chain' visits a strong side residue or

crosses a disulfide bridge. A simpler way to include

approximately the same set of connections is to draw

all joins above threshold which are either first or second

entries for any maximum.

Page 6

700

CORE TRACING

Displays of only joins of order 1 and 2 in an initial

MIR-phased map (solvent flattened) at 3/, resolution

showed suggestions of helices and sheets and aided in

the location of molecular boundaries. With no informa-

tion other than density level, one should argue that the

highest joins give the most likely connections. However,

one may still examine some of the lower connections for

more acceptable structural shape or topology. In regions

of disorder where the connections are at a lower level,

the first and second joins to a maximum may give clues

to the correct path, even if they are nominally below the

global threshold.

By following the connectivity in the maximum-join

tables and restricting paths to be above a threshold,

the length of chains can be determined. Typically there

are many short paths, but only a few very long ones.

This provides a powerful selection method to emphasize

prominent aspects of a map by focusing only on the

longer paths, and is useful for finding molecular bound-

aries (cf. Fig. 5) or for studying a map on the scale of

domains or whole molecules.

Two conventional means of restricting display volume

are the use of hardware clipping planes (z axis) and

density contouring in a small box. A density partition

provides a more flexible way to define a volume of

interest by combining the grid points associated with

some selected subset of features. To restrict the display

to the neighborhood of a partial model, determine the

feature volumes which contain atoms of the model, and

then optionally extend this volume outward in successive

layers by adding features (and their volumes) connected

to the previous subset. One need not have even a partial

model: the original subset could be a few very prominent

features or those features within some radius of a point.

Core tracing can be restricted to such a subset of features,

or can include the entirety of any path which contains a

feature in the subset, thus providing clues for extending

the model.

A density partition provides a framework in which to

ask statistical questions. Does the density connectivity

correspond to the model connectivity? Are there statis-

tical differences between features in solvent and protein

volumes? How do feature locations and connectivity

change as a function of phasing or noise? Can one

correlate the shape or bulk of feature volumes with

certain side chains? What is the variability of density

maxima near specific atom types (Ca, carbonyl O, main

or side chain)?

To adequately explore these questions will require

comparison of a number of structures, but Table 1 gives

some preliminary results. Refined atomic coordinates

were compared to MIR maps for two structures (see next

section for references and more details). To get main-

chain statistics a sequence of atoms (...N--Ca--C--

N-..) is transformed into a sequence of features. Pairs

of maxima are determined from the density partition

corresponding to atom pairs along the chain. Two atoms

Table 1. Connection statistics

Breaks

% Main chain

% Outside

Map

1.3tr

1.0a

1.3a

1.3tr 1.5tr

Astacin

Ht-d 4a

Ht-d 4b

Ht-d 3a

Ht-d 3b

Notes: Astacin has 200 residues, each molecule of Ht-d has 202 residues.

'Ht-d 4a' and 'Ht-d 4b' designate two independent molecules in a

four-derivative map. 'Ht-d 3a' and 'Ht-d 3b' designate corresponding

molecules in an earlier three-derivative map. 'Breaks' gives the number

of discontinuities in the main-chain trace at 1.3tr. '% Main chain' gives

the percentage of joins present along the main-chain path at two levels

when compared with the total number of joins required for a connected

main chain. '% Outside' gives the percentage of joins in the molecular

volume which connect to features outside that volume.

in the same feature volume are presumed connected and

ignored. When adjacent atoms correspond to different

maxima, the density level of the join between these

maxima is found. The '% main-chain' statistic compares

the percentage of joins above a specified level to the

total number of joins required to form a complete chain.

Counting terminations of sequences of joined features at

some threshold gives the number of 'breaks'. The astacin

and Ht-d four-derivative maps are clearly better than the

three-derivative Ht-d map, but even the good maps show

only 80% main-chain connectivity at 1.0a.

Molecules in a crystal are not isolated. To estimate

intermolecular connectivity, the maxima whose feature

volumes included atoms from a single molecule were

determined. Then all joins above a threshold which

referenced any maximum in this set were found. The per-

centage of these joins which also connect to a maximum

not in the set is an estimate of the connectivity between

molecules ('% outside' in Table 1). No effort was made

to determine whether such outside paths extend for short

or long distances or actually made contact with other

molecules. All of the examples show more than 10%

outside connections at a relatively high threshold (1.5or).

The astacin map may have slightly lower values because

it was solvent flattened.

Examples

Core tracing can be instructively compared with the

original density or with the final model at several dif-

ferent scales: close up (Fig. 2, a few residues), at the

scale of elements of secondary structure (several helices

or a r-sheet, Fig. 3), or large volumes encompassing

domains, whole molecules or even multiple asymmetric

units when delineating molecular boundaries (Figs. 4 and

5).

Our examples are taken from two zinc metallopro-

teases, astacin and a snake venom type IV collagenase

Page 7

STANLEY M. SWANSON

701

-6 . ""

'"5

.--.--

r,.

O,-1

Page 8

702

CORE TRACING

""6

req.

",=

Page 9

STANLEY M. SWANSON

703

""*

.--10

,,.z,

° ,...,

t,Y=

r,,.) ,x: I

° ..

°,.., ,....-i

.a-a

Page 10

704

CORE TRACING

(Ht-d). Although neither structure was solved using core

tracing, the solution of Ht-d employed core tracing for

the determination of molecular boundaries and in the

initial fitting.

Astacin has a well formed 3 A solvent-flattened MIR-

density map (Gomis-Rtith, Stoker, Huber, Zwilling &

Bode, 1993). The map (their Model 1) is based on six

heavy-atom derivatives. Helices, some /-strands and a

number of bulky side chains are clearly visible when the

map is compared with the final model. Several figures

have been made from this structure in order to have a

wider representation in our examples.

The majority of our examples are taken from Ht-d, a

structure under investigation in our laboratory (Zhang et

al., 1994). Despite extensive effort, we did not solve this

structure from our MIR maps. From the beginning, there

were tantalizing views of helices with enough detail to

verify the handedness and fix the space group (P65).

None of the partial models would successfully refine.

In retrospect, the problem was a poor density, phased

on only three heavy-atom derivatives. The presence of

two molecules in the asymmetric unit related by non-

crystallographic symmetry also caused difficulties in

the phasing. The structure was solved by a graphical

replacement procedure using a closely related model

(Gomis-Rtith, Cress & Bode, 1993) positioned to co-

incide with identifiable parts of the three-derivative

maps (Zn, several helicies). At about the same time,

a reassessment of our data sets found a usable fourth

derivative. This produced a much better MIR map which

fits the refined model but is independent of it. A four-

derivative map was used in most of the illustrations,

resulting in clearer views than those obtainable with

a three-derivative map, but being less faithful to the

structure-solution process.

Helices were discernible at higher thresholds than/3-

sheets in both structures, although there were always

some gaps in the main chain, even at 1.0r. Figs. 3(a)

and 3(b) show the four major helices of Ht-d compared,

at 1.5or, to core tracing and contours. For the illustration,

the core tracing and contours have been restricted to

feature volumes containing the refined atoms and a

single guard layer. This is a four-derivative map; the

three-derivative maps showed only one or two helices

consistently.

fl-strands are not clearly resolved in Ht-d, even with

a four-derivative map. There tends to be more cross

connectivity between strands than connectivity along

the main chain. However, when viewed edge on, the

sheet is separable from the rest of the molecule. Figs.

3(c) and 3(d) show the/3-sheet region of astacin which

is better resolved into strands, and which shows some

of the bulkier side chains. Even this sheet has cross

connectivity at a level low enough (1.0tr) to capture most

of the main-chain connections.

Sometimes a core trace or Greet skeleton is recogniz-

able as helical, but often there are additional connections

along the axis of the helix creating a rod-like bundle of

lines. The sheets with cross connections may look more

like a net than parallel strands. Only some pieces of

random coil consistently appear chain-like. Nonetheless

there are probably characteristic signatures for helices

and sheets which we can learn to recognize.

Volumes containing a few hundred residues are not

too cluttered for viewing. Thus, it is possible to compare

a core tracing to a Co trace of an entire molecule.

The two independent molecules of Ht-d are shown in

Figs. 4(a)-4(d) and 4(h) (202 residues each). There are

some differences between the two molecules but, by

and large, the core tracings are similar. This is more

a test of the quality of the maps and phasing than of the

ability of core tracing to find connectivity. As another

example, the amino domain (residues 1-99) of astacin

is shown in Figs. 4(e)-4(g). For clarity in these printed

figures, the rendering volume has been restricted to the

neighborhood of a single molecule or domain. Although

such a restriction is not possible before a structure is

solved, the dynamic rotation, scaling and clipping on an

interactive display compensate in part.

Finally, we explore the delineation of molecular

boundaries in Fig. 5. A single static view can only

suggest what can be seen interactively. We present

projections perpendicular to the z axis, since the

asymmetric unit in P65 is relatively thin in that direction

(15 A in Ht-d). The densities used are the two MIR maps

for Ht-d analysed in Table 1. The relative quality of the

maps is apparent in the differentiation between molecule

and solvent regions. The selection of long paths for the

core tracing reduces the clutter without degrading the

signal (Figs. 5a and 5b). Some paths may appear short in

the figures because they have been clipped at a boundary

and continued elsewhere. Our structural references are

a dot plot of Ca positions (Fig. 5f) and an augmented

dot plot also containing C6 positions (Fig. 5e) which

fleshes out the molecular volume but does not obscure

the solvent volume. Dot plots of joins and maxima

were also examined as alternatives to core tracing. Figs.

5(c) and 5(d) show the highest 30% of the maxima

for Ht-d (threshold about 2.3tr). The maxima clump in

the molecular volume, especially for the four-derivative

map where only 9% of the dots lie in solvent volume

(Fig. 5d). The difference between solvent and protein is

still visible for the three-derivative map although 25%

of the dots lie in solvent.

Spin-offs: implications for Greer's algorithm

Some of the techniques developed for core tracing,

especially neighborhoods and the sort procedure, are

useful when applied to other algorithms. A sort on a

13-bit density provides several thousand bins for values,

many more than the usual dozen or so value ranges

used in Greer implementations. Thus, the list of points

Page 11

STANLEY M. SWANSON

705

to be removed at each step (set R) is smaller and

much less likely to contain neighbors. Lists of neighbors

connected to each (nearby) neighbor of the central point

can be computed once and used to determine whether

deletion of a specific point will disconnect a point set.

The computation depends only on distance, not on the

presumed shape of a neighborhood (conventionally a

27-point, 3 x 3 x 3 box), and thus is adaptable to non-

- W,. , • '

.:"-,, ,: ",,-.. W,. ,, "- p, ",, :",, C ",.-,

(a)

)'" ..5"

,,(2

)",,r.

,,,')

*':

•

¢-

•

"._

' C/"._:" J ,.,r --\"

' C/'5-" J ,A --

...,7-(- . " ,

..,r- ( - ,.

(b)

• .. ".:-:'....:,:.:-,...." .:..:..'.'. .:, :...'.-;.:,....'..,...-.

• -.

.."

"" ('.

"".d'.

"" " . . . . . .

:. ""

.:'.'"\'"

"" "..

•

i.'.,

-:-'

....

. 7,. :,.:..

• ." ". ": ."'..C ".',:,- "" ..-'.

....

. ." "..: ..- ,.

..'...

•

..:-" ....':.,.'(,-:..-:.

:.:

.." :-..

.'::.,%'-:

:: .. ;-:

-... ...-

,,.'o "." .,'.:

.-,

• -...',.....j.'':: -

•

:..'..i:. :;.... ).7:- 2 7;....-:.'..'.i: -,...-).-: : .;"

7:.....:.

• .: "::..

.... :"" . ":"" :;.:i." '- : ::: ; ./"-. " ."." :i'i .:

• ....: - :-...:.-.: ,-...'..,.-:- .-.. ;.: : :.......;..: ,..-'- .,..:..

... .:... (....-.,:-.........

. • ........-.......

....

....,

•

......

. .'....,

•

•...

. ..,

• . .

•

.. ..,.

• .

L:,'.'.-'.'-""

. . - - .

• ." "r.. '.'..,;

" ."'.

•

• "..G" "...:.,'v':..':.

: ":

• -.:'' ..."..,%"-:

.': -. ; ":

"'.. .:" :.', "." ..",

a',

• :-....-.:.','.''.'., ',

•

::..i:. .,....).':- 2 ):{...'..:.:'.i: -,-...).-.. : .', 7:;..'..':.

(c)

.•

... _,:-,:...•.

•

¢o

.(..•

•

" ",d-;

" ."

"'¢';

• •

. ; • =.. ""

. .... .. ""

• ,*•'. • ....

• e ..".

• " °' "'• ""

t" .."-

•

. "..

"-....."

"....

"..

"-.... "

o "....

.. •:.,.= .,....:./:. !,: :......-.,:= .c:...:./:. !,, •

•: , .:i""

"/.,:, : !. ": , .'.i"" : "" "

, . 2 "r •

•

• .'"

... • •...*

. • .

...

"t *" •" -.

• "

" ""

': °" "" ".

...

.....

•

...•" ;,"...,.;,... "......,., • " :....

".'. • . ......:,.' •,.

•

•.

•

.""

"'L...

• ":>. 'L...

•

• '-.

•

. ....=•.

. ;....

•

• ':"..'..'.

t" .."-

•

• '':".

"'" ""

¢ "

"..

"-.....¢

-¢.

•

• -..

.'." ..¢

, • ...

• "

. .,.,.',.=

.k'" " .

:" .'." :,

•

.-,% ..

. :.. ,..,

•

.'"

r'; "J-"

"•

r-" ".-"

":-.:i ""

.... :"

, .,:

...

: t

,. " • "1".

• .,

. .'"

•

....'....,

. .'"

. .'..

. •

...

• .

...

• ,.

"°...

• . /."::.."?.::i:..;."

%"::......:.,:.;....

(d)

• •;•,

. ":•.

•

"3 .'. 5", c'.-

-.', ,.'..-":. ;'--

•

"5 ;;a'='"

"-';:-'.

• •

'.'-:Y-.':.:

..'."

•

'.-.,,:2-.-.

...

• •

•...,

.¢,x..:...c

"...'..-,-t..:..-,"

• •.

....

¢:'...:.

• . :- ..-.t

. •.

..-.

¢:-..t.•.

• : ..,

•

."..:':'-.

"'"..'

....

r.".: . . . .

"..:0"..'."

:"".'-"'r.'r'::..

• .•.*'-t.

,.. ,

"°t r.,'.

• ...:":.'•

,.. ,

"',

.,-.

• -'..

"'...:'. *..'. -.'.',,",.

•

.'..

"L".

,.:. d-'',.'^",.

•

;.,.:.'-.:..- "," .';..; ;:.,..:" ." :.::.,. ,.'...:..:... ". 7;..: ::.,..." ? • :.::.,.

:'.

;-J

• ";"'t., ,':" "':':--'.:.

%" :".

;::

- ""J,

"."' "':" ..'.: '-.-"

.. ,.

•..

•

¢ ....

.. ,.-..

•. ¢....

• "•

..'..

i.-..

" :':

.'.

..:.o,..,..

• ".'...

....

%:.....

...-,. .-,"

....

-,...... ...-. .-,"

". :!..

•., :. ". .:,..

•., :..., ", ..

•

:'.

":.°2. 7,',V..:.

.',"

:'.

":.°:'. --D,'..:...',"

•

"...t,:';Y,:;."

:! '7 ","r., ".'"

. "...,', ;:'.:Y,:;." " :'tr'r.':

.."

.,.......',

::"',.'^-':

. . . . .

..-'....',

:. "."'.'-'.

•

".';:,'-'

-"7"J;"."

-i':

..-.:. '..."

:" : '-. "; -'.Y ..';:-j;"."

.":•--."..."

• -

•.

". %...

•

.t ...

...

• -...

.t ...

• ".

- - t • • ..,..

:...

. . . . . .

. - ",..

.-..

%:.-... ...--.

).-,"

.... ,......

"...=... : ...,'

(o)

":'.....'':..'..:C'.'./ .... ,,......:'

..?:':.:.:::/":!:....

...'.•.

....

......,:..:...

:...'.-:."...:,;.

"i-.'."?;'."" :'-:':":::'

• .

..'...°.

• •;-

..... ,

:ii!ii! ;

..:iii!;!i:

::ii : / :)

• . 2

::,. ....

.,'...:.

,.-....-..:.:;:

::(ii:.;:!i:: :

Fig. 5. Molecular boundaries and packing• Very large scale views of two MIR maps of Ht-d comparing long core-tracing segments above 1.3o

containing at least four maxima [(a), (b)] to high maxima [(c), (d)] and to selected atom positions [(e), (d0]. Views on the left [(a), (c)] are from

an early three-derivative map. Views on the right [(b), (d)] are from a much better four-derivative map. Resolution is about 3 A and neither map

is solvent flattened• The space group is P65. The views extend two unit cells in x and y and are perpendicular to the z axis (1/6 of a unit cell

thick), for a total of four asymmetric units containing eight protein molecules• (a) Core tracing of a three-derivative map (a higher threshold

gives a clearer view of the solvent volume)• (b) Core tracing of a four-derivative map. (c) The highest 30% of the maxima in a three-derivative

map. (d) The highest 30% of the maxima in a four-derivative map. (e) Ca and Cb positions. (f) Co positions alone•

Page 12

706

CORE TRACING

orthorhombic lattices. Use of the Greer 'cube' on a

hexagonal grid results in a 27-point rhomboid which is

geometrically biased. Joins and maxima can be located

(above the initial threshold) and rendered by core-tracing

techniques, or the Greer-Hilditch trace can be con-

structed by a steepest upward gradient search from joins

to maxima. I have not found that the test for hole creation

is very useful in three dimensions and believe that it can

be eliminated with no important consequences.

Implementation

The algorithm has been implemented on a VAX linked

to an E & S PS330 display (program

FRODO:

Jones,

1978; Pflugrath, Saper & Quiocho, 1984) and on the

E & S workstation (program

PRONTO,

a variant of

FRODO),

but not yet fully integrated with either system.

Currently, a typical calculation (277 200 map points) re-

quires 4.5 CPU min on a VAX station 3100 (five VPU's)

for the identification of features. Another much quicker

step produces MOL files for display with

FRODO.

The user may then choose to display combinations

of MOL files (chains, branches, at selected levels) to

examine local or global volumes. Of course, the usual

electron-density map contours can be displayed at any

time. Because of the visually overwhelming complexity

of the contour map, it is usually viewed only intermit-

tently. The (real-time) interactive application of core

tracing in the program

PRONTO

is a project for the

immediate future.

This has been a long-term slowly maturing project.

I wish to thank R. Swanson for numerous discussions

and just for listening as I tried to explain the ideas.

E. F. Meyer has provided encouragement, enthusiasm

and financial support. D. Zhang and E. F. Meyer have

provided initial user feedback in the application to Ht-d.

F. X. Gomis-Rfith and W. Bode have kindly provided the

map and model for astacin used in several of the figures.

Funds have come (indirectly, as salary and laboratory

support) from the National Science Foundation, the Of-

fice of Naval Research, the Robert A. Welch Foundation,

the Texas Agricultural Experiment Station, ICI Americas

and Schering-Plough.

APPENDIX

A simplified, two-dimensional version of the algorithm

in both Fortran and C has been submitted as supple-

mental material.* This

Appendix

is intended to address

* A simplified version of the algorithm has been deposited with

the IUCr (Reference: GR232). Copies may be obtained through the

Managing Editor, International Union of Crystallography, 5 Abbey

Square, Chester CH1 2HU, England.

some general implementation issues and to indicate

where the algorithm must be extended beyond what

was sketched in the body of the paper. Representation

of density by integers permits the use of an efficient

sorting algorithm but exacerbates the problem of equal

values in neighborhoods. There can also be problems

near the boundary surfaces of a density volume. Lastly,

interpolation of feature positions may not gain much

accuracy.

In order that points with the same density value are

added at the 'same' time to the appropriate growing

nodules all over the map, the indices of the points are

sorted by the density value of the point. The points are

visited in top-down density order, from highest to lowest.

Since the density takes on a medium-sized (8000) range

of integer values, a modified radix sort (Knuth, 1973) has

been developed which takes only two passes through the

density (speed of order N). A conventional radix sort is

multi-digit whereas we use a single 13-bit 'digit'. The

first pass counts the number of points with each value,

and allocates variably sized bins in a table of indices

to the density points; each bin will contain the indices

of all density points of the same value. The second

pass puts the indices into the corresponding variably

sized bins by using and incrementing a pointer into

the bin for the value. In practice, the table of indices

takes twice as much space (32-bit values) as the density

so that the random insertion of indices into such a

large table produces excessive memory paging. Instead a

multipass scheme has been adopted: by ignoring values

outside of a subrange, only part of the sorted table

of indices is made for a pass through the density and

all of the neighborhoods for that density subrange are

analysed before sorting a lower density subrange. This

is effectively a variable radix two-digit sort.

Since the density values are restricted to integers,

occasionally neighbors will have equal values. A local

search of equal values is used to determine whether a

constant region is an extended maximum, or just a flat

stretch on a path. The frequency of occurrence (0.4%)

and volume (two to three points) of constant regions are

small with a density range of 8000, but the frequency

increases to 8% of points above ltr with a range of

250. This is one of the motivations for using a 16-bit

density (together with allowing feature marks as large as

32 000) instead of a more economical eight-bit density

representation. With clusters of equal density values, the

radix sort may contribute a positional bias since the

density indices are entered into the sorted table by a

scan of the entire density in a particular sequence.

Boundaries on the density map give rise to truncated

neighborhoods, and to incomplete lists of associated

features (you cannot see easily beyond the boundary,

although a unit-cell/symmetry continuation could be

devised at the expense of much more complicated dis-

tance and neighbor calculations). I have not found a

satisfactory way to find features in small volumes and

Page 13

STANLEY M. SWANSON

707

combine only the feature lists without worrying about

the completeness of the search at boundaries. Perhaps

sufficiently thick guard layers would work (about eight

grid points!).

Connectivity is not a purely local relation (local in

the sense of depending only on fixed neighborhood of

a point). Connections can be long and twisting or bent,

and cannot be determined until a considerable volume

of density is analysed. The model of a single line

segment from maximum to join may miss the core of

the density in some cases. This seems to be infrequent,

since the spacing of features is usually comparable to the

resolution. It can be checked during a post-processing

phase, or when the display is generated, and taken care

of by drawing a more complicated sequence of lines.

Density is normally sampled on a grid. One is tempted

to 'gain accuracy' by interpolating positions and values

from the neighboring grid points to the 'true' off-grid

feature. Is this reasonable or justified when one is already

sampling at one half or one third the resolution? Compli-

cated interpolation schemes (cubic or higher order, some

least-squares techniques) have seemed to violate local-

ity in my limited testing; they have needed excessive

grid span or sometimes have placed the 'interpolated'

position outside of the neighborhood. I have concluded

that a'simple three-point quadratic scheme along axis

directions is all that is justified and even that may be

misleading since the density shape is not always a power

law. Gradient searches of density calculated at arbitrary

points (with a slow Fourier transform) suggest that at

most an extra bit or two may be gained in positional

accuracy. Also consider what has been implicitly used

historically to position the model: with a single low

contour level one fits to the midpoint of the contour cage,

not to the maximum of the contained density (unless

the density profile is symmetrical). Is an interpolated

maximum more or less stable to noise than a midpoint

of containing contours?

Finally, some remarks on coding details are given

below.

Density is stored in an array of 16-bit integers. Con-

ceptually this is a three-dimensional array, but the di-

mensioning is dynamic, depending on the layout of the

map for a particular problem, and actual reference is

done with a single index into a linear array. Density

values are restricted to the range -8100 to -100 by

scaling and shifting so that the same array can be used

for feature marks (positive values) which replace density

(negative values) as the classification of grid points

proceeds.

A neighborhood is defined by a list of nearby lattice

points, sorted so that the nearest ones come first. The

definition is a template for all the neighborhoods in

a given density map; it is given in terms of offsets

from the central point, and is computed only once.

For each neighbor we retain three offsets along grid

axes (used to check whether the neighbor lies within

the map volume), a linear offset for indexing into the

density array relative to the index of the central point

and information about the distance from the center. For

each density point, a call to a check routine returns a list

of legal neighbors within a specified distance which do

not fall outside the spatial bounds of the map. A single

loop then drives the investigation of the neighborhood

of the point; a re-analysis of a map with a different size

neighborhood is handled by a different list of neighbors,

not by changing limits on three nested loops along with

different boundary tests. The technique is dimension

independent: we have used it in film spot analysis (two

dimensions) as well.

The examples of cubic and hexagonal neighborhoods

in the main text are in terms of equal real-space grid

increments; actual structures often have different grid

distances along each axis. An initial sorting of distances

takes care of small discrepancies and warns of wildly

unequal axis divisions.

A merged list of features is kept, with joins intermin-

gled with maxima. For each maximum, a list of joins

referencing it is noted; for each join, the set of features

which were seen in its neighborhood. Also kept are the

array index (translatable to grid indices) and the density

value for each feature. Since the marks are assigned

sequentially, a feature with a smaller mark has a higher

(or equal) density value than one with a larger mark.

To determine whether a candidate for a join involves

new information, a search of pre-existing connectivity

information is made, by constructing 'fans' of connec-

tions from one of the marks seen in the neighborhood.

Starting with a maximum, all its joins are added to the

list, then all new maxima contained in those joins, and

so on, until either all of the features in the original

neighborhood are found, or the specified search depth

('remoteness') is exceeded. If the fan of connected

features does not extend to all the original neighboring

features, the candidate becomes a new join (and has its

own feature number). Maxima connected by a common

join have a remoteness of 3, those with two intervening

joins and an intervening maximum have a remoteness

of 5, and so on.

References

GOMIS-Rt3TH, F. X., CRESS, L. F. & BODE, W. (1993). EMBO J. 12,

4151-4157.

GOMIS-Rt3TH, F. X., STOCKER, W., HUBER, R., ZWILLING, R. • BODE, W.

(1993). J. Mol. Biol. 229, 945-968.

GREER, J. (1974). J. Mol. Biol. 82, 279-301.

HILDITCH, C. J. (1969). Mach. lntell. 4, 403-420.

JOHNSON, C. K. (1977). Report of Workshop on Computer Graphics

in Biology, 22-24 June 1976, Columbia Univ., pp. 39-46. NIH,

Biotechnology Resources Programs, USA.

JOHNSON, C. K. (1978). Acta Cryst. A34, S-353.

JONES, T. A. (1978). J. Appl. Cryst. 11, 268-272.

JONES, T. A. & THIRUP, S. (1986). EMBO J. 5, 819-822.

Page 14

708

CORE TRACING

KNUTH, D. E. (1973). The Art of Computer Programming, Vol. 3, Sort-

ing and Searching, pp. 170-178. Reading, Massachusetts: Addison-

Wesley.

P'r-LUGRATH, J. W., SAPER, M. A. & QUIOCHO, F. A. (1984). Methods and

Applications in Crystallographic Computing, edited by S. HALL & T.

ASmAKA, pp. 404--407. Oxford: Clarendon Press.

SWANSON, S. M. (1979). J. Mol. Biol. 129, 637--642

SWANSON, S. M. (1993). Am. Crystallogr. Assoc. Annu. Meet., May

1993, Poster PB 19.

WmLIAMS, T. V. (1982). PhD thesis, Univ. of North Carolina, Chapel

Hill, USA.

ZHANG, D., BOTOS, I., GOMIS-ROTH, F. X., DOLL, R., BLOOD, C.,

NJOROGE, F. G., FOX, J. W., BODE, W. & MEYER, E. (1994). Proc.

Natl Acad. Sci. USA. Submitted.