Gorilla Compression (Delta-of-Delta + XOR)
Architecture
The Problem: 16 Bytes Per Sample Adds Up Fast
A time-series sample is a timestamp-value pair: an int64 timestamp (8 bytes) and a float64 value (8 bytes). That is 16 bytes per sample, uncompressed.
For a monitoring system scraping 1 million time series every 15 seconds, that is 67,000 samples per second, or 5.76 billion samples per day. At 16 bytes each: 92 GB/day for a medium-sized deployment. At hyperscale (500M samples/sec): 43.2 trillion samples/day = 691 TB/day. No storage budget survives those numbers.
Generic compression algorithms help. LZ4 achieves 3-4x on time-series data. ZSTD gets 5-7x. But Gorilla achieves 12x, because it exploits two properties of time-series data that general-purpose compressors cannot see.
Timestamp Compression: Delta-of-Delta
The Insight
Metric scraping happens at fixed intervals. If you scrape every 15 seconds, consecutive timestamps differ by exactly 15,000 milliseconds. The delta is constant. And the delta of the delta is zero.
Worked Example
Consider six consecutive timestamps from a 15-second scrape:
Raw timestamps (unix ms):
T0 = 1709827200000
T1 = 1709827215000
T2 = 1709827230000
T3 = 1709827245000
T4 = 1709827260000
T5 = 1709827275000
Deltas (T[n] - T[n-1]):
D1 = 15000
D2 = 15000
D3 = 15000
D4 = 15000
D5 = 15000
Delta-of-deltas (D[n] - D[n-1]):
DD2 = 0
DD3 = 0
DD4 = 0
DD5 = 0
Every delta-of-delta is zero. Gorilla uses variable-length bit encoding for these values:
DD = 0 → 1 bit (just a '0' flag)
DD in [-63, 64] → 9 bits (2-bit header + 7-bit value)
DD in [-255, 256] → 12 bits (3-bit header + 9-bit value)
DD in [-2047, 2048] → 16 bits (4-bit header + 12-bit value)
Otherwise → 36 bits (4-bit header + 32-bit value)
With regular intervals, every timestamp after the second costs 1 bit. The first timestamp is stored in full (64 bits), the second stores the delta (14 bits), and every subsequent timestamp is a single bit.
What Happens When a Scrape Is Late?
If the fourth scrape arrives 1 second late (at T3 = 1709827246000 instead of 1709827245000):
Deltas:
D3 = 16000 (instead of 15000)
D4 = 14000 (next one catches up)
Delta-of-deltas:
DD3 = 16000 - 15000 = 1000 → 16 bits (fits in [-2047, 2048])
DD4 = 14000 - 16000 = -2000 → 16 bits
One late scrape costs 32 extra bits instead of 2. This is why consistent scrape intervals matter: they are not just an operational nicety, they directly control compression efficiency.
Value Compression: XOR Encoding
The Insight
Consecutive metric values tend to be similar. CPU utilization might be 47.2%, 47.3%, 47.3%, 47.2%. A counter that has not incremented has the exact same value. When you XOR two identical IEEE 754 float64 representations, you get all zeros. When you XOR two similar values, most bits are the same and the XOR result has long runs of zeros.
Worked Example
Consider five consecutive CPU utilization readings:
Values: 47.2, 47.3, 47.3, 47.2, 48.1
IEEE 754 hex representation:
V0 = 0x4047999999999999 (47.2 — stored in full, 64 bits)
V1 = 0x4047A66666666666 (47.3)
V2 = 0x4047A66666666666 (47.3)
V3 = 0x4047999999999999 (47.2)
V4 = 0x404819999999999A (48.1)
XOR with previous:
V1 ⊕ V0 = 0x00003FFFFFFFFFFF → nonzero, ~15-25 bits
V2 ⊕ V1 = 0x0000000000000000 → zero! → 1 bit
V3 ⊕ V2 = 0x00003FFFFFFFFFFF → nonzero, ~15-25 bits
V4 ⊕ V3 = 0x00018000000000001 → nonzero, ~15-25 bits
The encoding for XOR results:
XOR = 0 (identical values) → 1 bit (just a '0' flag)
XOR ≠ 0, same leading/trailing zeros as previous → 2 + meaningful bits
XOR ≠ 0, different zero pattern → 2 + 5 + 6 + meaningful bits
The "leading/trailing zero" optimization is key. When V1 XOR V0 produces 0x00003FFFFFFFFFFF, there are 18 leading zero bits and 0 trailing zero bits. The encoder stores only the 46 "meaningful" bits plus metadata indicating where they sit. If the next XOR has the same zero pattern (which consecutive similar values often do), the metadata is skipped entirely.
Facebook's Gorilla paper measured production metrics and found: 51% of values XOR to exactly zero (cost: 1 bit each). 30% compress to ~26.6 bits. 19% need ~36.9 bits. The weighted average: ~9.5 bits per value.
Putting It Together: End-to-End Example
Take a single time series cpu_usage{host="web-01"} with 8 samples at 15-second intervals, where CPU hovers around 47%:
Sample Timestamp (ms) Value TS bits Value bits
------ ----------------- ------ -------- ----------
0 1709827200000 47.2 64 64 (stored in full)
1 1709827215000 47.3 14 ~22 (delta + XOR meaningful bits)
2 1709827230000 47.3 1 1 (DD=0, XOR=0)
3 1709827245000 47.2 1 ~22 (DD=0, XOR nonzero)
4 1709827260000 47.3 1 ~22 (DD=0, similar XOR)
5 1709827275000 47.3 1 1 (DD=0, XOR=0)
6 1709827290000 47.2 1 ~22 (DD=0, XOR nonzero)
7 1709827305000 48.1 1 ~22 (DD=0, XOR nonzero)
----- -----
Total: 84 bits ~176 bits = 260 bits ≈ 33 bytes
Uncompressed: 8 samples × 16 bytes = 128 bytes
Compressed: ~33 bytes
Ratio: 3.9x (small sample), approaches 12x as series lengthens
The first sample is expensive (128 bits). But from sample 2 onward, the average cost is roughly 11 bits per sample. Over a 2-hour chunk (480 samples at 15s intervals), the amortized compression ratio converges to the theoretical 12x.
The Math at Scale
Average bits per timestamp: ~1.5 (most delta-of-deltas are 0)
Average bits per value: ~9.5 (51% are 1 bit, rest are 15-37 bits)
Total per sample: ~11 bits = 1.37 bytes
Uncompressed: 16 bytes per sample
Compression ratio: 16 / 1.37 ≈ 12x
At production scale:
| Throughput | Uncompressed/day | Gorilla-compressed/day | Savings |
|---|---|---|---|
| 1M samples/sec | 1.38 TB | 118 GB | 1.26 TB |
| 50M samples/sec | 69 TB | 5.9 TB | 63 TB |
| 500M samples/sec | 691 TB | 59 TB | 632 TB |
The difference between Gorilla and a generic 4x compressor at 500M samples/sec is 86 TB/day of additional storage. Over a year, that is 31 PB of excess storage cost.
Chunk Structure and Block Layout
Samples are not compressed individually. They are grouped into chunks, typically covering a fixed time window:
- Prometheus: 2-hour blocks. Each block contains compressed chunks for all active series in that window. An index file maps series IDs to chunk offsets.
- VictoriaMetrics: Monthly partitions with parts (similar to LSM-tree levels). Small parts are periodically merged into larger parts.
- Facebook Gorilla (original): 2-hour in-memory blocks. Each block starts with a header containing the base timestamp. Samples stream-append into the block. Once the window closes, the block becomes immutable.
Chunk boundaries introduce fixed overhead: a header with the base timestamp, the first sample stored in full, and alignment padding. This is why very short chunks (under 30 seconds of data) compress poorly. The overhead is amortized over the chunk duration.
Variants and Adaptations
Every major TSDB has adapted Gorilla's core ideas:
Prometheus TSDB uses Gorilla encoding within 2-hour blocks stored on local disk. The block structure includes an index (label-to-series mapping), chunk files (Gorilla-compressed samples), and tombstones (for deletions). The compactor merges multiple 2-hour blocks into larger blocks for efficient querying.
VictoriaMetrics applies Gorilla compression in the hot path (vminsert to vmstorage) and layers ZSTD on top for cold data exported to S3 as Parquet files. The combination gets Gorilla's 12x for recent data and an additional 2-3x from ZSTD for archived data, pushing effective compression to 25-35x on cold storage.
M3DB (Uber) developed M3TSZ, a Gorilla variant optimized for their workload of 6.6 billion active time series. The key modification is more aggressive encoding of the leading/trailing zero metadata to reduce overhead for series with highly variable values.
InfluxDB IOx uses Apache Arrow columnar format with Gorilla encoding specifically for float columns. Timestamps use delta encoding in the Arrow representation. This hybrid approach integrates Gorilla's strength (float compression) with Arrow's strength (columnar scans, predicate pushdown).
Limitations
Irregular timestamps destroy compression efficiency. If your data arrives at random intervals, delta-of-deltas are rarely zero. Example: timestamps arriving at intervals of 15s, 13s, 18s, 14s produce delta-of-deltas of -2, 5, -4, each costing 9 bits instead of 1 bit. Compression drops from 12x to 5-7x, at which point ZSTD alone performs comparably with less complexity.
Labels and metadata are not compressed by Gorilla. The algorithm handles only the numeric timestamp-value pairs. The metric name, label keys, and label values (e.g., {service="checkout", region="us-east-1"}) require a separate inverted index. In VictoriaMetrics, the mergeset index handles this. In Prometheus, the index file within each block handles it.
Decode is sequential within a chunk. Because each sample's encoding depends on the previous sample (delta-of-delta, XOR with predecessor), you cannot jump to an arbitrary sample within a chunk. You must decode from the beginning. This is acceptable for typical metric queries (which read time ranges), but means random access within a chunk is O(n), not O(1). Prometheus mitigates this by keeping chunks small (2 hours) and using the index for coarse-grained seeking.
The first sample in every chunk is expensive. The base timestamp (64 bits) and first value (64 bits) are stored uncompressed. This 128-bit overhead is amortized across the chunk. A 2-hour chunk at 15-second intervals has 480 samples, making the overhead negligible (0.2%). A 30-second chunk has only 2 samples, making the overhead 50%.
Key Points
- •Gorilla exploits two properties unique to time-series data: timestamps arrive at regular intervals (delta-of-delta is usually zero = 1 bit), and consecutive float values are often identical or change slowly (XOR produces mostly zeros = 1 bit). Generic compressors like LZ4 or ZSTD cannot see this structure
- •Delta-of-delta encoding stores each timestamp as the difference between consecutive deltas. With a regular 15-second scrape interval, every delta is 15s, every delta-of-delta is 0, and every timestamp after the second costs exactly 1 bit
- •XOR encoding compresses float64 values by XOR-ing consecutive samples. According to Facebook's Gorilla paper, 51% of all production metric values XOR to exactly zero with their predecessor (1 bit). 30% compress to ~26.6 bits. 19% need ~36.9 bits. Average: ~9.5 bits per value
- •The combined result is ~11 bits per sample (1.37 bytes) compared to 128 bits uncompressed (16 bytes). That is a 12x compression ratio. At 500M samples/sec, this is the difference between 54 TB/day and 4.5 TB/day of storage
- •Every modern time-series database uses Gorilla or a variant: Prometheus TSDB, VictoriaMetrics, M3DB (Uber), InfluxDB IOx, and TimescaleDB. The algorithm is the foundation of practical TSDB storage
Used By
Common Mistakes
- ✗Assuming Gorilla compression works equally well on irregular data. Delta-of-delta encoding relies on consistent intervals. If timestamps arrive at random intervals, delta-of-deltas are rarely zero and timestamps cost 9-36 bits each instead of 1 bit
- ✗Confusing the theoretical 1.37 bytes/sample with production reality. Chunk headers, label storage overhead, index structures, and compression boundary effects push the effective rate to 1.5-2 bytes per sample in real systems
- ✗Applying Gorilla to non-numeric time-series data. The algorithm compresses float64 timestamp-value pairs only. Labels, string fields, and metadata require separate compression (inverted index for labels, dictionary encoding for strings)
- ✗Using excessively short chunk durations. Each chunk has fixed overhead (header, block boundary markers). Very short chunks (under 30 seconds of data) waste a higher percentage of space on overhead, reducing the effective compression ratio
- ✗Not layering a general-purpose compressor on top for cold storage. Gorilla is optimized for fast encode/decode in the hot path. For cold data read infrequently, applying ZSTD on top of Gorilla-compressed blocks can squeeze out another 2-3x