fix(pageserver): handle dup layers during gc-compaction #10430

skyzh · 2025-01-16T16:47:51Z

Problem

If gc-compaction decides to rewrite an image layer, it will now cause index_part to lose reference to that layer. In details,

Assume there's only one image layer of key 0000...AAAA at LSN 0x100 and generation 0xA in the system.
gc-compaction kicks in at gc-horizon 0x100, and then produce 0000...AAAA at LSN 0x100 and generation 0xB.
It submits a compaction result update into the index part that unlinks 0000-AAAA-100-A and adds 0000-AAAA-100-B

On the remote storage / local disk side, this is fine -- it unlinks things correctly and uploads the new file. However, the index_part.json itself doesn't record generations. The buggy procedure is as follows:

upload the new file
update the index part to remove the old file and add the new file
remove the new file

Therefore, the correct update result process for gc-compaction should be as follows:

When modifying the layer map, delete the old one and upload the new one.
When updating the index, uploading the new one in the index without deleting the old one.

Summary of changes

Modify finish_gc_compaction to correctly order insertions and deletions.
Update the way gc-compaction uploads the layer files.
Add new tests.

github-actions · 2025-01-16T17:45:29Z

7414 tests run: 7027 passed, 0 failed, 387 skipped (full report)

Flaky tests (5)

Postgres 17

test_pageserver_gc_compaction_idempotent[after_restart]: release-arm64-with-lfc
test_explain_with_lfc_stats: release-arm64-with-lfc

Postgres 16

test_metrics_normal_work: release-x86-64-with-lfc

Postgres 15

test_metrics_normal_work: release-arm64-with-lfc

Postgres 14

test_metrics_normal_work: release-arm64-with-lfc

Code coverage* (full report)

functions: 33.5% (8495 of 25343 functions)
lines: 49.3% (71428 of 144828 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
1bb54ef at 2025-01-23T22:03:06.285Z :recycle:}

jcsp · 2025-01-16T17:48:40Z

This deserves a test, as it was a near-miss data loss issue. I think we talked about doing something like just gc-compacting, then restarting and doing it again being enough to repro the issue?

skyzh · 2025-01-16T21:12:14Z

@jcsp I added two test cases plus one chaos test. The two test cases ensures discard layer due to duplicated layer key warning is hit.

VladLazar

It's more risky to fully fix the index-part bug so I'd like to work around it by asking gc-compaction not producing such updates.

Shard ancestor compaction already does this in a safe manner, with the important caveat that the current generation is greater than the generation of the layer being re-written (see compact_shard_ancestors and the call to rewrite_layers).

The trick is in how you call schedule_compaction_update. compacted_from arg must contain only dropped layers (no re-writes) and compacted_to must contain only re-writes or new layers. I see that you were already discarding re-writes within the same generation in KeyHistoryRetention::discard_key, so this should work fine.

skyzh · 2025-01-17T15:35:19Z

compacted_from arg must contain only dropped layers (no re-writes) and compacted_to must contain only re-writes or new layers. I see that you were already discarding re-writes within the same generation in KeyHistoryRetention::discard_key, so this should work fine.

Yeah I thought I was doing it in the correct way that same layer in same generations are discarded... so I assume there's still a bug around remote_client