Manage large collections of multidimensional objects (2D/3D)
A spatial object
Contains (at least) one spatial attributes that describes its location and/or geometry
A spatial relation
Is an organized collection of spatial objects of the same entity
Spatial Data
Representation
Points (Cities in large-scale map)
Extent (rivers, forest, etc.)
Vector (approximation by geometric objects)
Raster (A set of pixels in the grid)
Application
Spatial data
GIS
Segemented images
Components of CAD constructs or VLSI circuit
Stars on the sky
…
Spatial database
Users of mobile devices
Geographers, life scientists
…
Features of spatial
Dimensionality
There is no total ordering of objects in the multidimensional space that preserves spatial proximity
Complex spatial extent
No standard definitions of spatial operations and algebra
Relationa indexes (like B+ trees) and query processing methods (sort-merge join, hash-join) are not applicable
Spatial access methods (SAMs) for spatial data have to be defined
Index spatial objects
Facilitate efficient processing of simple spatial query types (e.g. range queries)
Spatial Relationships
A spatial relationship associates two objects according to their relative location and extent in space. Sometimes also called “spatial relations”.
Can refer to a database relation which stores spatial objects.
Classification
Topological relationships
Distance relationships
Directional relationships
Topological relationships
Each object is characterized by the space it occupies in the universe (A set of pixels).
A set of relationsips between their boundaries and interiors
Boundary
Interior (some may not have, points, line segments, etc.)
A hierarchy of relations
intersect (or overlaps)
equals
inside
contains
adjacent
disjoint
Distance relationships
Associate two objects based on their geometric (Euclidean distance), and it’s usually abstracted into human mind.
Distance relationships are expressed either explicitly or by some abstract distance class.
Directional relationships
Associates two object based on their relative orientation according to a global reference system.
Spatial Queries
Applied on one (or more) spatial relations to retrieve objects staisfying some spatial relationships
Nearest neighbor query
Spatial join
Range query
Spatial selction
window query
Spatial Query Processing
Evaluating spatial relationships on geometric data is slow.
A spatial object is approximated by its minimum bounding rectangle (MBR)
Process
Filter: The MBR is tested against the query predicate
Refinement: The exact geometry of objects that pass the filter step is tested for qualification
Spatial Access Methods (SAMs)
The problem of indexing spatial data
No dynamic access method with good theoretical worst-case guarantees for range queries
SAMs aim at the minimization of the expected cost.
Indexing of multidimensional points
Point access methods
Divide the apce into disjoint partitions and group the points according to the regions they belong
Not effective for extended objects (may need to be clipped into several parts which leads to data redundancy and affects performance negatively).
Object clipping can be avoided if we allow the regions of object to overlap.
Optimization
Group the objects below into 3 groups of 4 objects each such that the MBRs of the groups have the minimum overlap
Hard optimization problem
The R-tree
Concept
Group object MBRs to disk blocks hierarchically
Each group of object is a leaf of the tree
The MBRs of the leaf nodes are grouped to form nodes at the next level
Grouping is recursively applied at each level until a single group (the root) is formed
Elements
Leaf node entries: <MBR, object-id>, all leaves are in same level
Non-leaf node entries: <MBR, ptr>, pointing to entries
Root: have at least two children
Non-root node parameters
M
m
m <= M/2
Usually m = 0.4 M
Range searching using an R-tree
Range_query (query W, R-tree node n)
If n is not a leaf node
For each index entry e in n such that e.MBR intersects W
Visit node n’ pointed by e.ptr
Range_query (W, n')
If n is a leaf
For each index entry e in n such that e.MBR intersects W
Visit object o pointed by e.object-id
Test range query against exact geometry of o; If o intersects W, report o
May follow multiple paths during search
Different search predicates are used for different realtionships with W
Construction of the R-tree
Dynamically constructed/maintained
Insertions/deletions interleave with search operations
Insertion similiar to B+ Tree, but with special optimization algorithms
Choose the path where a new MBR is inserted
Split overflow nodes
Underflows in deletions
Deleting the underflow leaf node
Re-insert the remaining entries
R*-tree
Only different in the insertion algorithm (compared to R-tree), aiming at constructing a tree of high quality
A good tree
nodes with small MBRs
nodes with small overlap
nodes that look like squares
nodes as full as possible
Optimization
Minimize the area covered by an index rectangle (small area means small dead space)
Minimize overlap between node MBRs (Minimizes the number of traversed paths)
Minimize the margins of node MBRs (Square-like nodes, smaller number of intersections for a random query, better structure)
Optimize the storage utilization
Nodes in tree should be filled as much as possible
Minimizes tree height and potentially decreases dead space
Insertion heuristics (Select the path)
Least MBR enlargement after insertion
Least MBR overlap after insertion
Node Spliting
Determine the split axis
For each axis (i.e. x and y axis)
Sum=0;
sort entries by the lower value, then by upper value
for each sorting (e.g. lower value)
for k=m to M+1-m
place first k entries in group A, and the remaining ones in group B
Sum = Sum + margin(A) + margin(B)
Choose axis with the minimum Sum
Distribute entries along axis
Along the split axis, choose the distribution with minimum overlap
If there are multiple groupings with minimal overlap choose <A,B> such that area(A)+area(B) is minimized
Insertion heuristics: Forced Reinsert
Forced Reinsert
When R*-tree node n overflows, instead of splitting n immediately, try to see if some entries in n could possibly fit better in another node
Find the 30% furthest entries from the center of the group
Re-insert them to the tree (not to be repeated if another overflow occurs)
Slightly more expensive, but better tree structure:
less overlap
more space is utilized (more full nodes)
Bulk-loading R-trees
Given a static set S of rectangles, build an R-tree that indexes S.
Method 1: iteratively insert rectangles into an initially empty tree
Feature
tree reorganization is slow
tree nodes are not as full as possible: more space occupied for the tree
Method 2 (x-sorting): bulk-load the rectangles into the tree using some fast (sort or hash-based) process
sort rectangles using the x-coordinate of their center
pack M consecutive rectangles in leaf nodes
build tree bottom-up
Feature
R-tree is built fast
good space utilization
results in leaf nodes that are have long stripes as MBRs
Method 3 (Hilbert sorting): use a space-filling curve to order the rectangles
much better structure, but still the nodes have large overlap
Method 4 (sort-tile-recursive): Sort using one axis first and then groups of sqrt(n) rectangles using the other axis
Usually the best structure compared to other bulk-loading methods
K Nearest Neighbor Search
Given a spatial relation R, a query object q, and a number k <|R|, find the k-nearest neighbors of q in R.
We can have more than one k-NN sets (with multiple possible equidistant furthest points in them).
Distance measures and MBRs
Distances between MBRs lower-bound the distances between the corresponding objects
dist(MBR(oi),MBR(oj)) ≤ dist(oi, oj)
Distances between R-tree node MBRs lower-bound the distances between the entries in them
The distance between a query object q and an R-tree node MBR lower-bounds the distances between q and the objects indexed under this node
Depth-first NN search using an R-tree
Start from the root and visit the node nearest to q
Continue recursively, until a leaf node nl is visited.
Find the NN of q in nl.
Continue visiting other nodes after backtracking as long there are nodes closer to q than the current NN.
Large space can be pruned by avoiding visiting R-tree nodes and their sub-trees
Should order the entries of a node in increasing distance from q to maximize potential for a good NN found fast
Can be easily adapted for k-NN search
Requires at most one tree path to be currently in memory – good for small memory buffers
Characteristic of all depth-first search algorithms
Recall that the range search algorithm is also DF
However, does not visit the least possible number of nodes
Also, not incremental – more on this later…
代码语言:javascript
复制
1. visit root
dist(q,M1)<dist(q,oNN)
must visit node M1
2. visit M1
dist(q,m1)<dist(q,oNN)
must visit node m1
3. visit m1
check a,b,c
found new NN:
oNN = a, dist(q,oNN) = sqrt(5)
4. backtrack to M1
check m2dist(q,m2) = 3 >= sqrt(5):
No need to visit node m2
check m3dist(q,m3) = sqrt(5) >= sqrt(5):
No need to visit node m3
5. backtrack to root
check M2dist(q,M2) = sqrt(2) < sqrt(5):
must visit node M2
6. visit M2
check m4dist(q,m4) = sqrt(2) < sqrt(5):
must visit node m4
7. visit m4
check i,j,k
found new NN:
oNN = k, dist(q,oNN) = sqrt(2)
8. backtrack to M2
check m5dist(q,m5) >= sqrt(2):
No need to visit node m5
check m6dist(q,m6) >= sqrt(2):
No need to visit node m6
9. backtrack to root
check M3dist(q,M3) >= sqrt(2):
No need to visit node M3
10. backtrack from root
Algorithm terminates
oNN =k with dist(q,oNN)= sqrt(2) found
Best-first NN search
Put all entries in a priority queue and always “open” the closest one, independently of the node that contains it.
Thus the best (i.e., closest) entry is always visited first.
A more efficient algorithm (given large enough memory)
Optimal in the number of R-tree nodes visited for a given query q
Uses a priority queue to organize seen entries and prioritize the next node to be visited
Adaptable for k-NN search and incremental NN search
In the previous example, we have visited fewer nodes compared to DF-NN algorithm
Only nodes whose MBR intersect the disk centered at q with radius the real NN distance are visited (see if you can you prove this)
The algorithm can be adapted for incremental NN search
After having found the NN can we easily (incrementally) find the next NN without starting search from the beginning?
put objects on the heap
never prune, but wait until an object comes out
The algorithm can be used for k-NN search
use a second heap to organize the NN found so far (same can be done for DF-NN)
no need if we just use the inc. version of the algorithm
… but: The heap can grow very large until the algorithm terminates
代码语言:javascript
复制
Step 1: put all entries of root on heap Q
Q = M1(1), M2(sqrt(2)), M3(sqrt(8))
Step 2: get closest entry (top element of Q):
M1(1). Visit node M1. Put all entries of
visited node on heap Q
Q = M2(sqrt(2)), m1(sqrt(5)), M3(sqrt(5)), M3(sqrt(8)), m2(3)
Step 3: get closest entry (top element of Q):
M2(sqrt(2)). Visit node M2. Put all entries of
visited node on heap Q
Q =m4(sqrt(2)), m1(sqrt(5)), M3(sqrt(5)), M3(sqrt(8)), m2(3), m5(sqrt(13)), m5(sqrt(17))
Step 4: get closest entry (top element of Q):
m4(sqrt(2)). Visit node m4. m4 is a leaf node, so update NN if some object in m4 is closer than the current NN:
oNN = k, dist(q,oNN)= sqrt(2)
Q =m1(sqrt(5)), M3(sqrt(5)), M3(sqrt(8)), m2(3), m5(sqrt(13)), m5(sqrt(17))
Step 5: get closest entry (top element of Q):
m1(sqrt(5)). Since sqrt(5) >= dist(q,oNN)= sqrt(2), search stops and oNN is returned as the NN of q
incremental NN search
Example 1: find the nearest large city (>10,000 residents) to my current position
Solution 1:
find all large cities
apply NN search on the result
could be slow if many such cities
also R-tree may not be available for large cities only
Solution 2:
incrementally find NN and check if the large city requirement is satisfied; if not get the next NN
Example 2: find the nearest hotel; see if you like it; if not get the next one; see if you like it; …
Spatial Joins
Most algorithms focus on the efficient processing of the filter step.
Most spatial predicates on actual objects reduce to intersection of MBRs in the filter step. Thus all algorithms consider mainly the intersect predicate.
Types
intersection joins
Semi-join: Find the cities that intersect a river
Similarity join: Find pairs of hotels, restaurants close to each other (with distance smaller than 100m)
Closest pairs: Find the closest pair of hotels, restaurants
All-NN: For each hotel find the nearest restaurant
Iceberg distance join: Find hotels close to at least 10 restaurants
Three categories of spatial join algorithms
Both inputs are indexed (e.g., synchronized tree traversal)
One input is indexed (e.g., indexed nested loops)
Neither input is indexed (e.g., spatial hash join)
R-tree (Intersection) Join
Applies on two R-trees of spatial relations R and S
Node MBRs at the high level of the trees can prune object combinations to be checked
This pseudo-code version assumes that the trees have same height
Example
run for root(RA), root(RB)
for every intersecting pair there (e.g., A1, B1) run recursively for pointed nodes
intersecting pairs of leaf nodes are qualifying object MBR pairs
Optimization
space restriction
If an entry in n1 does not intersect the MBR of n2 it may not intersect any entry in n2.
Perform two scans in n1 and n2 to prune such entries
plane sweep
Sort entries in both nodes on their lower-x value (lower bound of x-projection)
Sweep a line to find fast all entry pairs that qualify x-intersection
for each of them check y-intersection
Worst-case sub-optimal. But very effective on the average
Worst-case optimal algorithms require advanced data structures for y-intersection. Large hidden constants, thus high cost for this problem size
Can be used with other spatial join algorithms
R-tree join
The most efficient algorithm (assuming that the relations are indexed)
Cannot be used for non-indexed inputs
unless we build on-the-fly R-trees
Comes with some I/O scheduling techniques for minimizing the page accesses
Joining non-indexed inputs
Spatial hash join
Partition based spatial merge join
Indexed Nested Loops
Seeded tree join and Bulk-load and Match build an on-the-fly R-tree
Slot-index spatial join applies hash-join using the entries of a high R-tree level
The refinement step
Step 1: find MBR pairs that intersect
Step 2: compare some more detailed approximations to make conclusions (a.k.a. geometric filter)
conservative approximations
e.g., convex hull
progressive approximation
e.g., maximum enclosed rectangle
Step 3: if still join predicate inconclusive, perform expensive refinement step
can be processed by computational geometry algorithms