Search & AnalyticsTech 3 of 6

Geospatial

Google S2 Geometry

The spherical geometry library behind Google Maps, Spanner, and every serious geospatial system

Use Cases

Spatial indexing for location-based queries (find all restaurants within 2km)Geo-fencing and point-in-region containment checksRide-sharing supply/demand matching by geographic cellsRegion covering for spatial database queriesProximity search at planetary scaleMap tile generation and spatial partitioning

Architecture

Why This Matters

Every location-based service eventually hits the same problem: how to answer "find all things near this point" without scanning every row in the database? The naive approach (compute the distance from the query point to every stored point) is O(n) and does not scale. PostGIS handles this with R-tree indexes, which works well up to tens of millions of rows. But on Spanner, DynamoDB, or Bigtable, R-trees do not exist.

S2 solves this by converting spatial queries into range scans on 64-bit integer keys. Any database with a B-tree index can do a range scan. That means spatial query support works on literally any storage system, from SQLite to Spanner, without needing specialized spatial index types.

Google built S2 internally and open-sourced the C++ implementation. It powers spatial features in Google Maps, Google Earth, Spanner's spatial queries, and Uber's H3 was partly inspired by it (though H3 took a different approach with hexagons). When someone searches for "coffee shops near me" on Google Maps, S2 is doing the spatial math.

How the Cell System Works

Start with the Earth. S2 projects the surface onto a cube with 6 faces. Each face is a square that maps to a region of the globe. The projection is not conformal (angles are not preserved) but it is area-preserving enough that cells at the same level have roughly similar sizes, within a factor of 2.

Each face gets recursively subdivided into 4 children using a quadtree. Level 0 is the face itself. Level 1 splits it into 4. Level 2 splits each of those into 4, giving 16 cells per face. At level 30, there are 4^30 cells per face, which is about 1.15 × 10^18 total cells across all 6 faces. That is more than enough to represent any point on Earth with sub-centimeter precision.

The cells are ordered along a Hilbert curve, which is a space-filling curve with a special property: it visits every cell exactly once while preserving spatial locality. Two cells that are adjacent on the Hilbert curve are (almost always) adjacent on the Earth's surface. This is the key insight that makes everything else work. Assigning each point its Hilbert curve position as a 64-bit cell ID means "find all points near X" becomes "find all cell IDs in a small numeric range around X's cell ID."

Compare this to a Z-order curve (used in some other spatial indexing schemes). Z-order curves have "jumps" where the curve crosses large spatial distances between numerically adjacent values. Hilbert curves minimize these jumps, which means fewer range scans for the same spatial query.

Using S2 for Spatial Queries

The standard pattern for building a location-based service with S2:

At write time: For each point (latitude, longitude), compute the S2 cell ID at the chosen level. Store the cell ID as an indexed column alongside the coordinates. For a ride-sharing app, level 12 (roughly 3.3 km²) works well. For a local search app, level 14 (roughly 200,000 m²) gives finer granularity.

At query time: Given a search center and radius, construct an S2Cap (spherical cap) representing the search area. Use S2RegionCoverer to compute a covering, which is a set of cell ID ranges that together cover the cap. Set max_cells to 8-12 for a good balance between coverage tightness and query count. Then issue one range scan per cell range against the indexed cell ID column. Union the results and filter out false positives (points that are within a covering cell but outside the actual search radius).

This approach works on any database. On Spanner, store the cell ID as a column in the primary key and use WHERE cell_id BETWEEN ? AND ?. On DynamoDB, use the cell ID as a sort key and issue BatchGetItem queries. On PostgreSQL, create a regular B-tree index on the cell ID column. No PostGIS extension needed.

Covering Algorithm Tuning

The RegionCoverer is where the precision-vs-performance tradeoff gets tuned. The key parameters:

max_cells controls how many cells the covering can use. More cells means the covering wraps the search region more tightly, reducing false positives. But each cell translates to a range scan, so more cells means more database queries. For most applications, 8-12 cells hit the sweet spot.

min_level and max_level constrain which levels the coverer can use. Setting min_level=12 and max_level=12 makes every covering cell exactly level 12 (roughly 3.3 km²). This simplifies indexing because all cell IDs are at the same level. Allowing mixed levels (say min_level=10, max_level=16), the coverer can use large cells for open areas and small cells for tight corners of the search region. Mixed levels give tighter coverings but require storing cell IDs at multiple levels per point, which complicates the indexing scheme.

A practical compromise: index points at a single level (say level 14), and let the coverer use cells at that same level or coarser (level 14 and below). A coarser covering cell at level 12 covers the range of all its level-14 children, which maps to a single range scan. This provides mixed-level covering benefits without multi-level indexing.

Performance Characteristics

Cell ID computation from lat/lng is O(1) and takes about 100 nanoseconds. A single core can compute 10 million cell IDs per second. This means write-time cell ID computation adds negligible overhead.

Region covering for a typical search radius (1-5 km) with max_cells=8 takes about 10-50 microseconds. This is fast enough to compute on every query without caching.

The real performance question is how many range scans the database can handle. With 8 covering cells, that means 8 range scans. On a well-tuned B-tree index, each range scan is O(log n + k) where k is the number of matching rows. For a table with 100 million points and a 2 km search radius, each range scan returns a few hundred rows, and the total query time is dominated by the database, not by S2.

At Uber's scale (tens of millions of active drivers/riders), the S2-inspired approach processes millions of spatial queries per second across their fleet. The cell ID lookup is so cheap that the bottleneck is always the database, never the spatial math.

Real-World Architecture Pattern

Here is how a food delivery app might use S2 for restaurant search:

Store each restaurant with its lat, lng, and a precomputed s2_cell_id column at level 14. Create a composite index on (s2_cell_id, cuisine_type). When a user searches for "pizza within 3 km," compute an S2 covering for the 3 km radius, then issue queries like WHERE s2_cell_id BETWEEN ? AND ? AND cuisine_type = 'pizza' for each covering cell. Merge results, compute exact distances from the user's location, filter out any points outside the true 3 km radius, and sort by distance.

This pattern turns a spatial query into an index-friendly range scan with a secondary filter. It scales to billions of points without special spatial indexes, works on any database backend, and is exactly what Google does internally with Spanner for location-aware services.

Pros

• Mathematically rigorous. Works on a sphere, not a flat plane, so distance calculations are accurate everywhere on Earth
• Hierarchical cell system gives you 30 levels of precision, from continent-sized to sub-centimeter
• Cell IDs are 64-bit integers, which means you can index them in any database with a B-tree
• Battle-tested at Google scale across Maps, Spanner, and dozens of internal services
• Hilbert curve ordering means spatially close points get numerically close cell IDs, perfect for range scans

Cons

• Steep learning curve. You need to understand spherical geometry and the cell hierarchy to use it well
• The C++ library is the reference. Ports to Go, Java, and Python vary in completeness
• No built-in persistence or query engine. It is a library, not a database
• Debugging spatial issues is hard without visualization tools
• Covering algorithms require tuning (max cells, min/max level) for your specific use case

When to use

• You are building a location-based service that needs to scale beyond PostGIS
• Spatial queries need to be converted to simple range scans on integer keys
• Your system runs on Spanner or another database without native spatial indexing
• You need consistent precision across the entire globe, not just near the equator

When NOT to use

• Simple distance calculations where the haversine formula is good enough
• You already have PostGIS and your query patterns are well-served by its spatial functions
• Your team does not have the time to learn spherical geometry concepts
• You need a turnkey geospatial database, not a library to build on top of

Key Points

•S2 projects the Earth onto a cube (6 faces), then recursively subdivides each face into 4 children using a quadtree. This produces 30 levels of hierarchy. Level 0 is an entire face (~85 million km²). Level 12 cells are about 3.3 km² and useful for city-level queries. Level 20 cells are about 0.3 m², good for individual buildings. Level 30 gets down to 0.7 cm².
•Cell IDs use a Hilbert space-filling curve, not a simple row-major ordering. The Hilbert curve preserves spatial locality far better than alternatives like Z-order curves. Two geographically adjacent points almost always have numerically adjacent cell IDs. This means a spatial query (find all points near X) becomes a small number of range scans on a 64-bit integer index. That works in literally any database.
•Region coverings are the core operation for spatial queries. Given a circular region (latitude, longitude, radius), S2 finds a set of cells that cover it. The tradeoff between precision and number of cells is controlled with min_level, max_level, and max_cells parameters. More cells means tighter coverage but more range scans. Fewer cells means some false positives that get filtered out in application code.
•S2 geometry is exact on a sphere. Most geospatial systems project onto a flat plane (like UTM or Web Mercator) and accumulate errors, especially near the poles and at long distances. S2 does all math on the unit sphere using geodesic calculations. Distance, area, and containment checks are accurate everywhere, which matters for global-scale systems.
•Cell ID encoding packs the face number (3 bits), the position along the Hilbert curve (up to 60 bits), and a sentinel bit for level information into a single 64-bit integer. This compact representation means billions of spatial references can be stored using the same storage as a bigint column. No special spatial types needed.
•S2 supports three main query types: containment (is this point inside this region?), intersection (do these two regions overlap?), and covering (what set of cells approximates this region?). All three reduce to integer operations on cell IDs once the data is indexed. Google uses this pattern in Spanner to support spatial queries without any special spatial index type.

Common Mistakes

✗Choosing the wrong cell level for the use case. Level 12 cells (~3.3 km²) work for ride-sharing and delivery matching. Level 16 cells (~150 m²) work for building-level precision. Using level 30 everywhere wastes storage and creates unnecessarily large coverings. Pick the minimum level that provides the precision actually needed.
✗Using too many cells in a covering. Setting max_cells to 100 produces a tight covering but requires 100 range scans. For a database query, 8-12 cells is usually the sweet spot. Some false positives will come from cells that partially overlap the search region, but filtering those in application code is cheaper than running 100 range scans.
✗Forgetting to handle the edge cases at cell boundaries. A point sitting exactly on the boundary between two cells might match one cell but not the other depending on rounding. Always query the covering cells plus their immediate neighbors (the 8 surrounding cells at the same level) for boundary-safe results.
✗Treating cell IDs as opaque without understanding the hierarchy. A level-12 cell ID contains all level-13 children as a contiguous range. Exploit this for hierarchical queries: to find all level-20 points within a level-12 cell, do a single range scan from cell_id_min to cell_id_max. This is free precision without extra index lookups.
✗Not pre-computing cell IDs at write time. If cell IDs are computed at query time, every query has to convert lat/lng to cell IDs, compute coverings, and then issue range scans. Instead, compute the cell ID at insert time and store it alongside the lat/lng. Then queries are pure index lookups.

Related Technologies