Search & AnalyticsTech 5 of 6

Geospatial

PostGIS

The spatial extension that turned PostgreSQL into a serious GIS database

Use Cases

Storing and querying geographic features (points, lines, polygons)Location-based search (find all stores within 10 km)Route planning and network analysisLand parcel management and urban planningGeospatial data pipelines for mapping and GIS workflowsFleet tracking with real-time spatial queries

Architecture

What PostGIS Actually Is

PostGIS is a PostgreSQL extension that adds spatial data types, spatial indexing, and about 800 functions for working with geographic data. When people say "spatial database," PostGIS is usually what they mean. It has been around since 2001, it is the reference implementation for OGC (Open Geospatial Consortium) standards, and pretty much every GIS tool on the planet can talk to it.

What makes it different from using S2 or H3 is that PostGIS understands actual geometry. It does not just discretize the Earth into cells. It can store a polygon representing the exact boundary of Central Park, compute the intersection of that polygon with a 500-meter buffer around a subway station, and return the area of the resulting shape in square meters. S2 and H3 work with cells and indices. PostGIS works with shapes.

How Spatial Indexing Works

The core of PostGIS's performance is the GiST (Generalized Search Tree) index, which implements an R-tree for spatial data. An R-tree groups nearby objects into bounding boxes, then groups those bounding boxes into larger bounding boxes, recursively. The result is a tree where each internal node contains the bounding box of all its children.

Running SELECT * FROM restaurants WHERE ST_DWithin(geom, my_point, 1000), PostGIS first asks the R-tree: "which bounding boxes overlap with a 1000-meter buffer around my_point?" This is the filter step and it is fast because it only compares rectangles. The R-tree returns a set of candidate rows. Then PostGIS runs the exact geometry test on each candidate: "is this point actually within 1000 meters of my_point?" This is the refine step and it is slower but only runs on the candidates, not the whole table.

For a table with 10 million rows and a point query, the filter step might return 200 candidates, and the refine step confirms 150 of them. That is 200 geometry calculations instead of 10 million. The R-tree makes this O(log n + k) instead of O(n).

R-trees have a weakness though: they do not parallelize as well as hash-based schemes. When the table is split across partitions, each partition has its own R-tree and queries that span partitions have to merge results. For a single node with up to ~100 million features, this is not a problem. Beyond that, consider combining PostGIS with S2 or H3 for the indexing layer and PostGIS for the complex spatial operations.

Geometry vs Geography

Geometry treats data as flat, Cartesian coordinates. Operations are fast because the math is plane geometry (Euclidean). But distances are in the coordinate system's native units. With SRID 4326 (WGS84, lat/lng), distances come back in degrees. A degree of longitude at the equator is ~111 km, but at 60° latitude it is ~55 km. Calculating "find everything within 1 degree" gives wildly different physical distances depending on location on Earth.

Geography treats data as points on a spheroid (WGS84 ellipsoid). Distances are always in meters. The math is more complex (geodesic calculations), so queries are about 3-5x slower than equivalent geometry queries. Not all functions support geography. There are about 30 geography functions versus 300+ geometry functions.

The practical approach: store data as geography(Point, 4326) for primarily distance queries on a global dataset. Use geometry(Point, 4326) for primarily spatial operations (intersection, union, containment) and handle distance separately by casting to geography or transforming to a local SRID.

Production Setup

PostGIS performance tuning starts with PostgreSQL tuning:

shared_buffers: 25% of RAM, same as regular PostgreSQL
effective_cache_size: 75% of RAM
work_mem: 256MB-1GB for complex spatial operations (buffer, union on large polygons allocate significant memory)
random_page_cost: 1.1 on SSDs (default 4.0 assumes spinning disks and makes the planner avoid index scans)
max_parallel_workers_per_gather: 4-8 for parallel sequential scans on large spatial tables

PostGIS-specific considerations: set postgis.gdal_enabled_drivers to enable raster functionality. Configure postgis.backend to GEOS for most operations. For large polygon datasets, increase geqo_threshold to prevent the genetic query optimizer from kicking in on complex spatial joins (it usually makes worse plans than the deterministic optimizer for spatial queries).

Spatial index maintenance matters more than most people think. After bulk loading, run VACUUM ANALYZE on spatial tables. The statistics collector needs to understand the data distribution for the query planner to make good decisions. For tables with clustered spatial data (like all restaurants in a city), consider CLUSTER table USING spatial_index to physically reorder rows on disk to match the spatial index order. This improves cache locality for range queries significantly.

Capacity Planning

A single PostGIS instance handles up to 100 million features comfortably on modern hardware (16 cores, 64GB RAM, NVMe). Beyond that, query planning overhead on the R-tree grows. At 500 million features, consider table partitioning by region or using a sharding strategy.

Storage: a point geometry takes 32 bytes (16 bytes for header, 16 bytes for X/Y coordinates). A polygon with 100 vertices takes about 1.6 KB. A table with 100 million points takes about 3.2 GB for geometry data alone, plus the GiST index which is typically 1.5-2x the data size. So roughly 8-10 GB total for 100 million indexed points.

Query performance for ST_DWithin on 100 million indexed points: ~2-5ms for a 1 km radius returning ~100 results. For ST_Intersects with a complex polygon against 100 million points: ~10-50ms depending on polygon complexity. For spatial joins between two 1-million-row tables: 1-10 seconds depending on the operation.

Failure Scenarios

Scenario 1: Missing spatial index causes full table scan. A developer creates a spatial table, loads 20 million rows, and starts querying with ST_DWithin. Queries take 15 seconds each. The team assumes PostGIS is slow. The actual problem is that no one ran CREATE INDEX. Without a GiST index, every ST_DWithin call computes the distance for all 20 million rows. Detection: run EXPLAIN ANALYZE on the query. A "Seq Scan" instead of "Index Scan" means the index is missing or not being used. Recovery: create the index. On 20 million rows, CREATE INDEX USING GIST takes 2-5 minutes. After that, queries drop from 15 seconds to 5 milliseconds.

Scenario 2: Coordinate order swap puts everything in the wrong place. A team imports GeoJSON data where coordinates are in (latitude, longitude) order but PostGIS expects (longitude, latitude). All their points end up mirrored across the equator and the prime meridian. Restaurant locations in San Francisco appear somewhere in the Sahara Desert. Nobody notices until a user reports the map is wrong. Detection: spot-check a few records against known locations. Run SELECT ST_AsText(geom) FROM table LIMIT 5 and verify the coordinates make sense. Prevention: always validate a sample of imported data visually before loading the full dataset. GDAL's ogr2ogr handles coordinate order correctly for most formats, so use it instead of writing custom import code.

Pros

• Full OGC standards compliance (WKT, WKB, GeoJSON, KML, GML). If a GIS tool exists, it probably talks to PostGIS
• R-tree spatial indexing via GiST handles complex polygon queries efficiently
• Raster support for satellite imagery, elevation models, and remote sensing data alongside vector data
• Topology support for network analysis, routing, and connectivity queries
• Massive function library. 800+ spatial functions covering everything from distance to voronoi diagrams

Cons

• Inherits PostgreSQL's single-writer limitation and vacuum overhead
• R-tree indexes are less parallelizable than hash-based spatial schemes like S2 or H3
• Complex geometry operations (buffer, union on large polygons) can be CPU-intensive
• Scaling horizontally requires Citus or manual sharding. Not straightforward.
• Learning curve for the full GIS stack (SRIDs, projections, coordinate systems) is significant

When to use

• You need a full-featured GIS database with standards compliance
• Complex spatial operations: polygon intersection, buffering, union, voronoi, routing
• Your team already runs PostgreSQL and wants to add spatial capabilities
• Integration with the broader GIS ecosystem (QGIS, GeoServer, MapServer, GDAL)

When NOT to use

• Simple proximity queries where S2 or H3 cell IDs on a regular B-tree would suffice
• Billions of points with simple spatial lookups (consider a key-value store with S2 indexing)
• Real-time streaming geospatial data at very high write throughput
• Your workload is purely analytics/aggregation by area (H3 in a data warehouse is simpler)

Key Points

•PostGIS has two spatial types: geometry and geography. Geometry operates on a flat Cartesian plane and is fast, but distances are in the coordinate system's units (degrees for WGS84, which are meaningless as distance). Geography operates on a spheroid and returns accurate distances in meters, but is slower and supports fewer functions. For most web applications, use geography for distance queries and geometry for everything else.
•The GiST (Generalized Search Tree) index implements an R-tree for spatial data. It works in two phases: the index returns candidate rows whose bounding boxes overlap the query region (fast, uses the index), then PostGIS applies the exact geometry test on those candidates (slower, but on a small set). This two-phase filter-and-refine approach is why spatial indexes are so effective.
•ST_DWithin is the right function for 'find everything within X meters.' It uses the spatial index, handles the spheroidal math, and is significantly faster than ST_Distance < X because ST_Distance cannot use the index. This is the single most common PostGIS performance mistake: using ST_Distance in a WHERE clause instead of ST_DWithin.
•SRIDs (Spatial Reference System Identifiers) define the coordinate system. SRID 4326 is WGS84 (lat/lng, what GPS provides). SRID 3857 is Web Mercator (what Google Maps uses for rendering). SRID 32632 is UTM Zone 32N (good for European distance calculations). Using the wrong SRID produces wrong answers. Always store in 4326 and transform to a local SRID for distance/area calculations when flat-plane precision is needed.
•PostGIS supports 3D geometries (XYZ) and measured geometries (XYZM) for applications like elevation models, flight paths, and LiDAR point clouds. Most web applications only need 2D, but the 3D support is genuinely useful for civil engineering, mining, and environmental science use cases.
•The topology module maintains the relationships between spatial features (shared edges, nodes, faces). When a parcel boundary is updated, all adjacent parcels update automatically. This prevents gaps and overlaps that plague simple polygon storage. It adds complexity but is essential for cadastral (land ownership) systems.

Common Mistakes

✗Using ST_Distance in WHERE clauses instead of ST_DWithin. ST_Distance computes the distance for every row, then filters. ST_DWithin uses the spatial index to narrow candidates first. The performance difference is easily 100x on large tables. This is the most common PostGIS mistake and it shows up in almost every beginner project.
✗Storing coordinates in the wrong order. PostGIS uses (longitude, latitude) order, which is (X, Y) in Cartesian terms. GeoJSON also uses (lng, lat). But most human-readable formats and APIs use (latitude, longitude). Swapping these puts points in the wrong hemisphere. Always double-check the coordinate order when importing data.
✗Not creating spatial indexes. PostGIS does not create them automatically. After loading data, run CREATE INDEX idx_geom ON table USING GIST (geom). Without this index, every spatial query does a sequential scan. On a table with 1 million rows, the difference between indexed and unindexed ST_DWithin is seconds vs milliseconds.
✗Using geometry type with ST_Distance and expecting meters. Geometry ST_Distance returns distance in the SRID's units. For SRID 4326, that is degrees. One degree of latitude is ~111 km, but one degree of longitude varies by latitude. For distance in meters, use the geography type or transform to a local projected SRID first.
✗Loading massive datasets without COPY and without disabling indexes first. INSERT statements one by one on a 50-million row spatial dataset can take days. Use COPY for bulk loading, drop the spatial index before loading, and recreate it after. Index creation on the full dataset is much faster than incrementally updating the index per row.

Related Technologies