1.  Representing & Operating on Terrain

This is the big problem currently occupying me.

1.1  Representing Terrain

Current Methods

The volume of terrain data is large and increasing, because of new gathering techniques such as IFSAR and LIDAR. Digitizing the land area of the earth at a 1m horizontal resolution, would produce over 1014 points. Current representational techniques for terrain elevation are either array based or triangle based. An array or matrix of elevations has the major advantage that it is very simple to store and to operate on. However, because the Earth's surface is not developable, there are problems. First, the array must be aligned to some one coordinate system. If equal angular steps in latitude and longitude are taken, then the spacing is nonuniform in meters. The USGS does this for level-1 DEMs. On the other hand, if equal spacing in meters is used, as the USGS does for higher level DEMs, then adjacent cells don't join cleanly, and different rows will have different numbers of points. For these reasons, a Triangulated Irregular Network, to be discussed later, can be attractive.

Nevertheless, if I do choose to represent terrain with an array, many extensions are possible, such as a Fourier series. This is one of large class of representations that form a function f(x) as a linear combination of a a set of basis functions. The linear combination concept is quite powerful. Many operations on f are themselves linear, and so may be performed separately on the basis functions, and then the results combined. Examples include differentiation and integration. In addition, the Fourier series representation compats well with the underlying physics of materials, electricity, sound, and so on.

E.g., a drum sounds different from a violin string depending on which individual frequencies constructively interfere. Truncating a Fourier series representing a signal produces another legal signal, which is the low-pass filter of the original.

However a Fourier series is unsuitable for terrain because the truncated Fourier series produces something that is not real terrain. The result is too continuous and has too many local minima. In addition, a Fourier series, of any order, cannot accurately represent a discontinuous function. Indeed, the attempt causes the series to overshoot on each side of the discontinuity; this is known as the Gibbs phenomenon. Increasing the degree of the series does not reduce the amplitude of the overshoot, but merely makes it narrower.

As with a Fourier expansion, a B-spline on one variable is a linear combo of a set of basis functions; however here each basis function has a finite support, which induces the desirable property of local control. Changing one parameter, such as a control point or a weight, changes the function only over a local region. In contrast, changing one coefficient in a Fourier expansion, changes the resulting function everywhere. This makes it harder to model physical processes that have local effects.

A wavelet basis set also has finite support, but the wavelets are not necessarily smooth (C2). This is appropriate when there is no reason for the function being approximated, such as terrain, to be smooth. However, the basis set is infinite, as for a Fourier series. Therefore, an arbitrarily close approximation can usually be found by using enough terms. The degree of approximation can be reduced by truncating hi-order terms. These advantages account for wavelets' popularity.

Infinite series representations, such as wavelets and Fourier series, have the desirable property of progressive transmission. Suppose that I am transmitting the terrain over a slow communication link. The receiver can begin reconstructing the surface as soon as the first few coefficients arrive. Each new arriving coefficient improves the receiver's approximation. This allows the surface to be previewed, and the transmission to be halted if desired.

The major competition to array-based representations is a Triangulated Irregular Network (TIN), first implemented, in the GIS context, in 1973 by me, see Triangulated Irregular Network. This has the major advantage that it is not tied either to a particular coordinate system, or to a developable surface. A TIN is an approximate, or lossy, representation, unless every point is included, which would remove the purpose of using a TIN. However, as described in the previous paragraph, an exact representation is overkill. Given a cell of terrain and a desired error tolerance, e, the output is a TIN accurate to within e. The TIN is probably not optimal, in the sense of having a minimal number of triangles, since determining that probably requires exponential time. However, it is reasonable.

The TIN is actually a linear bivariate, non tensor-product spline. If the terrain is considered to be mostly C2 continuous, then a higher degree spline would be more efficient. However, our preliminary experiments do not see this.

Terrain Properties and Representation Evaluation

With a formal foundation for terrain, which doesn't yet exist, we could formally ask about best algorithms, instead of testing heuristics on test samples. Nevertheless, some important properties are obvious, such as the following.

  1. Real terrain is more irregular than databases such as DEM-1 cells. Algorithms tested only on that data might unknowingly be exploiting that artificial smoothness.
  2. Terrain is not differentiable many times, i.e., is generally not Cn for n>1. Indeed, the physical phenomena that generate terrain generally do not depend on, or generate, high order continuity. The major exception is the curvature, in the horizontal plane, of stream beds.
  3. In places, the terrain is C-1, i.e., discontinuous. Indeed, although techniques such as contour lines have difficulty representing them, these may be the most important features for many users.
  4. There might conceivably be scale variance because of physical properties such as the finite strength of rock.
  5. The data is heterogeneous; different regions have different statistics. For example, river basins occur mostly above sea level, while mid-ocean ridges occur under sea level. Some regions above sea level are karst terrain, with sink holes, while other regions have rivers.
  6. The heterogeneity gets worse if we consider other planetary bodies, such as the Moon, because of the varied formation mechanisms, such as impact craters or large volcanoes.
  7. There are long range correlations, such as river basins, that often extend from one side of a continent almost to the other ocean. In other words, terrain is often not spatially symmetric in x and y.
  8. Neither is it symmetric in Z. There are many local maxima but few local minima, since they usually fill in and become lakes.
  9. Nevertheless, 2D sampled data sets may have local minima, even if the original terrain does not, and the sampling is exact. Informally, this happens when the outlet channel for the apparent local minimum flows between two sample points. This effect causes problems for drainage network programs. If it is not remedied, typically by filling the local minima, then there will be no long drainage networks, since they will terminate quickly in these false local minima.

What is a good evaluation metric for a terrain representation schemes? All schemes are approximate; the question is what types of errors are more tolerable. The appropriate quality metrics are nonlinear; a simple metric, such as RMS error, is inappropriate. Preserving features, such as gullies, is more important than maintaining exact elevations. On a more abstract level, representation approximations that preserve the important derived properties, such as visibility and mobility, and critical. When representing multiple related layers, preserving the relationships is more important than accuracy in each layer separately.

A representation in which legal terrain was easier to represent than illegal terrain is desirable. For example, if terrain is represented by contour lines, then nothing in the data structure prevents contour lines from crossing each other, which would represent a mathematical impossibility. A program written in a defensive programming style would have to check its input contours for legality, which is tedious. On the other hand, if terrain is a grid of elevations, then this mathematical impossibility doesn't happen. This level set form of representation made a difficult problem vanish, by incorporating the important relationships between adjacent lines. This is another aspect of the importance of maintaining internal consistency mentioned below.

A current research question. is to take this legality enforcement to the next level, by making it hard to represent, say, terrain with local minima. This formal foundation might be either deep or shallow. A shallow formulation attempts to capture terrain's behavior w/o modeling the behavior's cause. E.g., a shallow model of people's height might ignore physiology, and assume only that the heights are normally distributed. On the other hand, a deep model of terrain would incorporate relevant formative agents, such as water erosion, uplift, and downcut.

1.2  Overdetermined PDE

This is my new approximation method for terrain elevation. Finding an array of elevations from contour elevations and isolated points is a classic problem, with many references. However, it seems not to be as completely solved as it may seem. The amount of actual implementation experience with many of the proposed algorithms is unclear, especially since the test cases are often small. Artifacts are often apparent in the output. Production mapmaking often involves hand-tweaking of the data, for example to infer mountain tops.

My method is conceptually simple, and handles both isolated points and continuous contour lines of data, possibly with gaps. It infers mountain tops automatically. The process goes as follows.

  1. Construct an over-determined system of linear equations, based on a Laplacian (heat flow) partial differential equation (PDE), as follows. The unknowns are the elevations at every point, including the points of known elevations.
  2. For every known point, create an equation making it equal to its known elevation.
    zij = hij
  3. For every point, known or unknown, create an equation making it the average of its four neighbors. (Border points will have 2 or 3 neighbors.)
    4zi,j = zi-1,j + zi+1,j + zi,j-1 + zi,j+1
  4. Optionally weight the latter class of equations by a factor R relative to the former class. The larger that R is, the more that smoothness is emphasized at the expense of accuracy. Experimentally, a little less accuracy allows a lot more smoothness.
  5. Solve this overdetermined sparse system.

Notes:

  1. This method is an approximation, not an interpolation. Given the quality of the source data and the increased representation efficiency allowed by using an approximation, this is quite ok.
  2. A nice feature of this method is its conceptual simplicity. It is basically solving a Laplace PDE, but with one major innovation. Making even the known points to be the average of their neighbors allows information to flow across contour lines so that the slope is continuous across the contours. That remedies a failing in some competing methods with springs and PDEs, to which it is superficially similar. However, I are not aware of prior art for the key idea of making even the points on known contour lines to be approximately the average of their neighbors. Indeed, this would not have been feasible before the availability of large sparse overdetermined equation solvers.
    The input may be a mixture of continuous lines of data and isolated points, which is not true of all competing methods. Local maxima are automatically inferred, again not often true with other methods.

    Approximating a Surface Through Nested Square Contours, With Various Smoothness
  3. Preliminary experiments are shown in the above figure, where the contour lines are nested squares. The corners make that particularly difficult to fit a smooth surface. This method works quite well, and shows that a small amount of approximation adds a large amount of smoothness, so that the original contour lines are not evident in the resulting surface.
  4. The major unsolved problem is solving the overdetermined system of equations. We start by using Matlab; solving on a 400x400 grid with 160000 unknowns, is easy. The question to be determined is when the intermediate storage swell will make Matlab infeasible, and force us to find a better solver.
  5. The Laplacian PDE might be replaced by a thin plate PDE,
    zxx2+2 zxy2+zyy2=0,
    with the goal of minimizing the curvature energy, and equation
    20 zij = 8(zi-1,j + zi+1,j + zi,j-1 + zi,j+1) - 2(zi-1,j-1+zi-1,j+1+zi+1,j-1+zi+1,j+1) - (zi-2,j + zi+2,j + zi,j-2 + zi,j+2)
    The question is whether the extra complexity is worth it.

Wavelet-Based Compression

We've tried lossy wavelet-based compression methods, with quite good results. However those wavelets, which were devised for image compression, which has different evaluation criteria from terrain, are no longer state-of-the-art. Therefore, progress might be possible.

2.  Scooping

This longer term, more speculative, and deeper part of our research is to model terrain using basis operations derived from the terrain features, and how it was formed, such as hills and gulleys. Terrain has been described in such terms, probably forever, in geography. Considerable effort has also been invested earlier in terrain modeling. Nevertheless, some things are new: the amount of data, the available computing power, and the mathematical tools. There are several challenges here. Conceptually, the problem is identify descriptive parameters that are sufficiently efficient. Operationally, since these operations are nonlinear, decomposition of a problem into subproblems, application of a linear operator, followed by recomposition into the solution, is inapplicable.

The appropriate nonlinear operators remain to be determined. However, I intend to start with a scooping operator. This operator, analogous to scooping earth out of the side of a hill, will initially proceed as follows. Starting from a given point, it will proceed in an downhill direction from there along a straight line trajectory for a given distance. It will scoop out a new gully of some width, whose bottom has some slope. As described, there are five parameters, although that could be varied. The scooping operator has several properties.

  1. It will not create a local minimum. This desirable feature contrasts to every other known terrain representation method.
  2. It naturally lends itself to the creation of complex drainage systems, again in contrast to other representations.
  3. It is quite nonlinear, and so has a power not available to linear methods.
  4. It is more appropriate to modeling eroded plateaus than, say, volcanos. However, much of the earth's surface is, approximately, eroded plateaus.

Deposition and Erosion is another possible operator. (Terrain erosion is unrelated to erosion in image processing.) This means mean to dump a volume of earth at some point on the terrain and let it, and the nearby terrain, slump for some distance. If a zero material is added, but a region is allowed to slump, then we have an operator to smooth things out. This is operator is also nonlinear, but might create local minima. That models the real world, where a landslide might dam up a new lake.

The big research question is how to model some given terrain with these operators. I plan to proceed by implementing an interactive testbed and playing, in order to develop an intuitive sense. I expect to see unexpected things, which always happens when working with real data. A prime example of this was finding terrain where higher observers were not necessarily more visible than lower ones, \citep{fr-hinbv-94}. Another was seeing the maximum error occasionally get worse as a TIN was refined.

Nevertheless, an initial approach to working with scooping operators could go as follows. Identify where the next scoop should occur. Optimize its parameters. Repeat. A good scoop should remove a large fraction of the earth in its footprint, and must not remove too much. The general goal is to model terrain, not to model how it was formed. A smaller set of operators is better. If one operator can create terrain that had several geological formative mechanisms, then so much the better.

2.1  Operating on Terrain

Important operations include compression, visibility, mobility, drainage, representing multiple related data layers, and multiple data source conflation.

Compression --- Lossy and Lossless

One necessary terrain elevation operation is compression. It is possible that, at some future time, storage devices will be so cheap and small that compression will no longer be necessary, but this will not occur soon. Indeed, new generations of computers arise, such as Personal Digital Assistants (PDAs), which have severely restricted power and resource bounds, so the compression is still necessary. More compact data can also lead to an increase in computation speed. Indeed, less I/O is necessary, and more data will fit into the caches of every level, which leads to greater speed. This concept can be validated with a simple program to write and read back larger and larger arrays. This is why tools such as Edgebreaker are so valuable.

An array-based representation can model (the sampled points of) the terrain exactly, before lossy compression techniques are applied. However a lossless representation is an unnecessary overkill, because the original data are inaccurate\label{approx}. E.g., consider the widely used level-1 DEM data, where one cell is an array of 1201x1201 points spaced at 3 arc-seconds, covering a 1 degree by 1 degree region. The objective for absolute accuracy is 130m horizontal and 30m vertical, 90% of the time, altho the relative accuracy is probably better. Therefore, since the original data has only limited accuracy, exact compression is wasteful. A lossy compression suffices, and is much more compact. However, is the reconstructed lossily compressed terrain accurate enough to support the other operations?

2.2  Hydrology

Determining water flow through various regions of the terrain is a major application, because of the damage caused by flooding, and the money spent preventing it. There are many drainage models, such as the Corp of Engineers' HEC-1 and HEC-HMS, the Soil Conservation Service's TR-20 and TR-55, the Watershed Modeling System's Rational Method, the Hydrologic Simulation Program - Fortran (HSPF) (much modified since), and the USGS-NHA-FEMA National Flood Frequency (NFF) Model. One excellent flow calculation program on large, LIDAR-derived, databases, is the TerraFlow project.

However, these algorithms require external storage to process large datasets. We can remove this restriction, as follows.

First, identifying local minima is necessary because otherwise they will trap all the drainage networks early enough that no long networks will form. These local minima are not caused only by measurement errors, but also by the sampling process. That is, the outflow from an apparent local minimum might flow between two elevation sample points, and so never be seen in the elevation array. Identifying these problems is complicated since the catchment region of a local minimum may be a region of many pixels, which all have to be found and elevated so that the water will flow out.

My idea is to use the flow between two points as an equivalence relation, do its transitive closure, and find connected components, using my fast connected component program. It was designed to find connected components in a 3D cellular array. A typical input, with 512x544x544 = 151,519,232 voxels, took 15 seconds on a 1600 MHz Pentium to find all 534,723 6-connected components. (26-components are equally handleable.) My program also handles 2D data as a special case. One test, on a 18573x19110 array, found the 32858 components in 25 seconds on a the same machine. Therefore we are confident in being able to find these regions, quickly, on large databases.

Next, calculating the flow through each point can be structured as a sparse system of linear equations. On a level-1 DEM, there will be 12012 unknowns, but only 5 times that many nonzero entries. Solving this is well within the reach of existing sparse matrix solvers.

Visibility

The visibility index of a point on the terrain, or the area of the terrain visible by an observer at that point, is an important derived property. One future application will be for communication on the surface of the Moon. The Moon has no ionosphere to reflect radio, nor any stable orbits for communication satellites. Therefore communication between distant points that are out of sight of each other will probably require either fiber optic cables, or radio relay towers. Siting those towers is a visibility problem. This is also relevant to Mars, because that planet's weak ionosphere will limit high frequency radio propagation.


Sample Viewsheds

The viewshed of an observer is the specific region of the terrain that that particular observer see. The above figure shows some sample viewsheds. Other relevant buzzwords include Observation and Fields of Fire and Concealment and Cover.

This multiple observers case is particularly interesting and complex, and has many applications, such as cell phone towers siting. Here, the identities of the observers of highest visibility index are of more interest than their exact visibility indices, or than the visibility indices of all observers. Again, a military planner needs to put observers so that there is nowhere to hide that is not visible from at least one. This leads to a corollary application, where the other side's planner may want to analyze the first side's observers to find places to hide. In this case, the problem is to optimize the targets' locations, instead of the observers'. In addition, the problem may be constantly changing as the observers are moving or vanishing. This is reminiscent of multilevel game theory.

A new potential application is posed by a group of people using short-range quasi-line-of-sight radios, perhaps in urban areas. Siting base stations to serve all the users is too strong a condition. It may be sufficient that the radios be able to hand off messages to each other, in a wireless version of the internet. Determining whether all the users form a connected set, telling the users where to move to form a connected set is an intervisibility problem.

Again, a planner for a scenic area may consider each place where a tourist might be to be an observer, and then want to locate ugly infrastructure, such as work yards, at relatively hidden sites. S/he may wish site a forest clearcut to be invisible to observers driving on a highway sited to give a good view. Finally, an architect may be trying to site a new house while following the planning board's instruction that, ``You can have a view, but you can't be the view.''

When computing visibility, the speed of execution on large datasets is of more importance than may be apparent. Many prototype implementations, demonstrated on small datasets, do not scale up well. That may happen either because of the size and complexity of the data structures used, or because of the asymptotic time behavior. In addition, large datasets may contain cases, which did not occur in the small test sets, that require tedious special programming by the designer. In a perfect software development process, all such cases would have been theoretically analyzed a priori, and treated. However, in the real world, testing on the largest available datasets increases some confidence in the program's correctness.

A large enough quantitative increase in execution speed leads to a qualitative increase in what we can do, such as by enabling massive searches for optimal solutions. Finally, although the size of available data is growing quickly, it is not necessarily true that available computing power is keeping pace. There is a military need to offload computations to small portable devices, such as a PDA. A PDA's computation power is limited by its battery, since, approximately, for a given silicon technology, each elemental computation consumes a fixed amount of energy. Batteries are not getting better very quickly; increasing the processor's cycle speed just runs down the battery faster.


Viewshed Uncertainty

How much we really know about visibility is a serious questions, because small changes in the input can cause large changes in the input. The above figure shows the viewshed of a section of the Lake Champlain West cell, seen from Mt Marcy in the lower left. The black regions are fairly certainly hidden; the light grey regions visible, with the shade indicating the elevation. The light and dark grey crosshatched regions are more of less probably visible; the difference being as follows. When a line of sight is extended from the observer to each potential target, it will almost always run between adjacent elevation points (posts). Various rules are possible to calculate an interpolated elevation for where the line of sight passes between those two points, p1 and p2, with elevations z1 and z2. If the line of sight's elevation there, z, is lower than the interpolated elevation (for any pair of adjacent points that this line of sight passes between), then the target is hidden. A darker level of grey shows that this point is hidden with respect to more of the possible interpolation rules.

The black region shows targets for which z < min(z1,z2), and the light grey region targets for which z > max(z1,z2). The other two cases use a terrain elevation, zl, that was linearly interpolated from z1 and z2. In the dark gray region, min(z1,z2) < z < zl, and in the light gray region, zl < z < max(z1,z2).

These grey regions together are half the whole plot. In other words, trivial changes in how elevations are interpolated between the known posts, changed the, computed, visibility of half of all the targets. Nevertheless, the visibility status of the other half of the targets is reasonably certain; so an important problem is to find them. This task is complicated by the nonlinearity of visibility, which is like a step function. Pondering that concept led to the idea of just good enough computation.

By analogy to using significant digits, there is no point in extracting information that isn't there. In this context, if the input elevations are noisy, and the output visibilities are very sensitive to the input, then a much quicker, less accurate, computation might produce just as good output. This will produce a serious quantitative speed increase will may translate into a qualitative improvement in what can be done.

2.3  Mobility

Determining how to move over terrain is closely related to visibility because either staying in contact with, or staying out of contact with a set of observers is often a criterion. Then there are the considerations of ground slope, surface type, etc. There has been considerable work in mobility over many years. However, this potential future area of interest may still be open because of the new higher resolution, terrain databases and new representation techniques.


50' contour lines cross the lake shoreline

Representing Multiple Related Data Layers

There may be several correlated layers of data for the same surface, such as elevation and slope. The slope may be needed for determining soil erosion, or helicopter landing feasibility. We may want to store the slope explicitly since computing it from elevations is error-prone, especially if the elevations are inaccurate. However, multiple correlated layers of data cause problems when they are lossily compressed. Indeed, the layers may become inconsistent with each other. Users may consider this to be worse than being inaccurate. Here are some examples from a commercial mapping product. The above figure is a section of a map with a layer of surface features, such as a lake shoreline, and another layer of elevation data, represented by 50 foot contour lines. It may be observed that the lake surface's elevation appears to have about 100 feet of relief. In other words, this fails to observe various spatial vertical behavior rules.


Stream flows obliquely to the contours

The above figure shows another example of an elevation layer, and a feature layer, containing a stream. (Ignore the double line feature in the left, which is a road.) Unfortunately, the stream is not flowing normally to the contours, as it should. What is frightening is the users of commercial GIS packages often think that these problems are natural.

2.4  Broader Impact

The growing impact of computer cartography is shown by the number of SW packages adding mapping tools, or having 3rd party mapping tools added. They include the mapping toolbox of Matlab, and the Spatial part of Oracle. Then there are the many companies specifically devoted to computer cartography, such as ESRI, DeLorme, Mapinfo, and Tele Atlas, to name only a few. This is a large and growing market segment. Currently, each vendor develops his own proprietary algorithms, which often leave something to be desired, as the above two figures show. Having a public set of high quality tools available to everyone will help everyone.

This market segment is growing because it impacts such diverse areas of society, from visual nuisance mitigation to flood damage minimization to military planning. Following are some impacts, just for the important application of visibility determination.

3.  Efficient Spatial Algorithms and Data Structures

The previous problem is an application of my long term unifying theme of efficient spatial algorithms and data structures, emphasizing small, simple, and fast data structures and algorithms. This is applicable to computational cartography, computer graphics, computational geometry, and geographic information science. This theme has continued since before PNPOLY, an 8-line subroutine I wrote in 1970 to test whether a polygon contains a point. In C, PNPOLY has only eight lines of executable code. I still get queries about PNPOLY, because it is still smaller, faster, and more robust than competing methods.

3.1  Geometric Operations on Hundreds of Millions of Objects

The current goal here is that anything I implement should work for N=1,000,000,000. My 3D connected component program does a universe of over 1000x1000x1000 voxels easily. My program to find mass properties of the union of lots of polygons (currently, squares) does 4&middot' 108 edges and will soon do 109. This program demonstrates the functioning of several useful lower level operations, like edge intersection and point location.

A big reason for testing algorithms on large examples is that it's fun. However, it's possible that people aren't generally running large examples now because most current implementations are much too slow, if they don't crash. Large examples are an excellent stress test of the whole system. Also, these algorithms scale down; when large examples take minutes, small examples run too fast to measure.

As computers get faster and more capacious, efficiency becomes ever more important. Processing very large datasets requires special techniques. Here are some implementations and the techniques used.

  1. Connect is a connected component program for a 1000x1000x1000 binarized cellular universe. Two voxels are in the same component if they are adjacent (either orthogonally, or optionally diagonally). A typical execution time to find all 4,539,562 components in one universe of 1,212,153,856 voxels, 608,589,768 of which were empty, was 96 CPU seconds on a 1600 MHz Pentium. For a 2-D example with 18573x19110=354,930,030 1-bit pixels, the CPU time to find the 32858 components was 25 seconds.

    2-D Connected Components
    The above figure shows a detail from that example, with the output components colored randomly.

    Area of Union of Many Squares
  2. Union3 is a fast algorithm for computing the volume, area, and other mass properties, of the union of many polyhedra. The above figure illustrates the problem, on a smaller example, in E2. Union3 is well suited for parallel machines. A prototype implementation for identical random cubes has processed 20,000,000 polyhedra, containing half a billion subfacets, on a dual processor Pentium Xeon workstation in about one hour.
    Union3 processes all the polyhedra in one pass instead of repeatedly combining them pair by pair. The first step finds the candidate output vertices. These are the 3-face intersections, edge-face intersections, and input vertices. Next, the candidates are culled by deleting those inside any polyhedron. The volume is the sum of a function of each survivor. There is no statistical sampling. Input degeneracies are processed with Simulation Of Simplicity. Since Union3 never explicitly determines the output polyhedron, messy non-manifold cases become irrelevant. No complicated topological structures are computed. Union3's simple flat data structures permit it to fork copies of itself to utilize multi-processor machines. The expected time is linear in the number of input, even when the number of intersections is superlinear.
    The principal data structure is a 3-D grid of uniform cells. Each cell records overlaps of itself with any edge, face, or polyhedron. Intersection tests are performed only between objects overlapping the same cell. However, if a cell is completely contained in some polyhedron, (covered), then no intersection tests are performed in it, since none of those intersections would be visible. Indeed, altho there may be a cubic number of intersections, all but a linear number occur in these covered cells, and are never computed. Therefore the final time is linear.
  3. Nearpt3 is a couple of subroutines to find nearest points in E^3. Preprocess takes a list of fixed points, and preprocesses them into a data structure. Query takes a query point, and returns the closest fixed point to it.
    Nearpt3 is very fast and very small. It is designed to process datasets with at least hundreds of millions of points. Test data from the following sources was used: the Stanford 3D Scanning Repository, Stanford University's Digital Michelangelo Project Archive of 3D Models, Georgia Institute of Technology's Large Geometric Models Archive, and the University of North Carolina at Chapel Hill Walkthru Project.
    We used the largest available PLY file datasets from each source. The largest dataset tested to date is the St Matthew PLY file from the Digital Michelangelo Project Archive, with 184,098,599 vertices. We used 184,088,599 as fixed points and 10,000 as queries. Total I/O, preprocessing, and query time on a 2.2MHz Xeon was 160 CPU seconds.

    Nonzero Intersections of Counties and Hydrography Regions
  4. Another implementation locates many points in a large planar subdivision in expected constant time per point, and also finds all the intersections among a large set of short edges. These operations are embedded in the map overlay application: finding the areas of all the non-empty intersections of the polygons in 2 planar maps. In the example shown in the above Figure, the first map is counties of the coterminous US, with 55068 vertices, 46116 edges, and 2985 polygons. The second map is hydrography polygons, with 76215 vertices, 69835 edges, and 2075 polygons. The total time (excluding I/O) was only 1.58 CPU seconds.
  5. My '''linear expected-time object-space hidden surface algorithm''' relies on the useful property that, if the input faces are IID (independently and identically distributed), then the expected number of visible edge pieces is linear in the number of input edges, and I can find them all in expected linear time.
    This is remarkable since the total number of intersections of the projected input edges might be superlinear if the edges' lengths do not shrink fast enough. However, in that case, most of those intersections are hidden, and I can delete them in groups w/o spending even constant time per one.

Several design principles guide these algorithms.

  1. A careful consideration of what topology to represent explicitly is one major design strategy. E.g., if a Triangulated Irregular Network is being computed, then the only explicit data type is the triangle; edges are implicit.
  2. Eschewing pointers whenever possible, in favor of more compact strucures, is another policy.
  3. Local topological data structures are strongly favored. E.g., finding the area of a general polygon (possibly including many separate or nested components with holes) requires knowing only the locations and neighborhoods of the vertices, w/o any nonlocal info, even edges. This technique also reduces the number of special cases and degenerate conditions.
  4. A uniform grid is superimposed on the data to quickly cull out pairs of edges that cannot intersect. For reasonable data distributions, this finds the number of intersections in time linear in their number.
  5. In cases such as the area of the union of the squares, many edge intersections are inside other squares and so do not cause output vertices. The edge segments causing these unnecessary intersections can often be identified, with the uniform grid, and deleted, w/o intersecting them pair by pair. Thus, even a quadratic number of such useless intersections do not affect the algorithm's linear (in the number of output vertices) expected running time.
    That last result is quite significant.

An adversary can select very unevenly distributed input to make all these algorithms execute very slowly. However, this has never been observed on real data. Indeed, some of the data structures are much more robust against uneven data than is apparent at first glance. All of the algorithms described have expected execution times linear in the size of the input plus the output. There are no log factors. These techniques also parallelize well.

3.2  Broader Impact

The impact is on Computational Geometry and its applications. The problems are spatial, and the solutions will draw on previous progress in that domain. In return, these applications will motivate new areas of research in the field. The scientific infrastructure is enhanced by my continuing to make code and data available on my website. This work also foster immediate collaborations between the fields of Computational Geometry and Geography. Indeed, I have associated with geographers and drawn motivation from their problems since high school. Future collaborations with geology and civil engineering are also forseen. However, there are also broader impacts on society.

The impact of my work is effected by the new representations and algorithms to be developed, and also by the code to be made freely available, as I have done throughout my career. This will also impact the research community by providing a base on which to build other projects.


<< | Research Page List | >>