Storage decisions look cheap during installation and expensive during recovery.
Most Proxmox storage mistakes are not performance mistakes. They’re recovery mistakes discovered too late.
There is no universally correct Proxmox storage choice. ZFS buys integrity and snapshots at the cost of RAM and complexity. LVM-thin stays simple until you hit its limitations. Ceph solves distributed storage problems most homelabs never actually encounter. The right answer depends less on benchmarks and more on how you expect failures, backups, upgrades, and recovery to behave later.
Storage is not just performance. Storage determines how failures behave.
This guide covers what Proxmox installers don’t explain about Proxmox storage choices, the operational character of LVM-thin, ZFS, and Ceph, an operator-maturity ladder for picking the right option, what breaks first in each storage type, the hidden cost of switching storage later, and the decision matrix for matching Proxmox storage to workload.
TL;DR
- LVM-thin: boring, simple, operationally efficient, snapshots functional but limited
- ZFS: integrity and recovery confidence in exchange for RAM and complexity
- Ceph: distributed systems engineering disguised as storage, rarely fits homelabs
- First Proxmox server: start with LVM-thin
- Single-node serious homelab: ZFS earns its complexity
- 3-node learning cluster: Ceph if you accept the operational overhead
- Changing storage later is more expensive than choosing carefully now
Where Proxmox actually stores VM files
Before choosing Proxmox storage, understand where data physically lives.
Proxmox separates storage locations from storage backends. A storage location is defined in /etc/pve/storage.cfg and points to a specific path or device. The backend type (LVM-thin, ZFS, directory, NFS, Ceph) determines how files are organized within that location.
VM disks land in different places depending on the storage type:
- LVM-thin storage: VM disks are logical volumes inside a thin pool. Visible via
lvs. The pool itself is on a physical volume (usually a partition on the boot drive or a separate disk). - ZFS storage: VM disks are ZFS volumes (zvols) inside a ZFS pool. Visible via
zfs list -t volume. The pool is built from one or more vdevs (disks, mirrors, raidz). - Directory storage: VM disks are
.rawor.qcow2files inside a regular filesystem directory. Visible via standardls. - Ceph storage: VM disks are RBD (RADOS Block Device) images inside a Ceph pool. Visible via
rbd ls. The pool is distributed across multiple OSDs (object storage daemons) on multiple nodes. - NFS/iSCSI: VM disks live on a remote storage server, accessed over network. Different operational model entirely.
The web UI hides this. The operational reality matters when something breaks. A VM disk that’s a file inside /var/lib/vz/images/ can be copied with cp. A VM disk that’s a ZFS zvol needs zfs send. A VM disk in Ceph needs rbd export. Recovery workflows depend on knowing where data actually lives.
What Proxmox installers don’t explain
The Proxmox installer presents storage options in a way that implies they’re roughly equivalent choices. They are not.
The Proxmox installer offering ZFS does not mean your hardware is ready for ZFS. Consumer SSDs without power-loss protection, RAID controllers with proprietary on-disk formats, and hosts with 16GB RAM running heavy workloads — all “support” ZFS in the technical sense. They will also generate operational problems that LVM-thin would have avoided.
A few things the installer doesn’t tell you:
LVM-thin is default for a reason. Proxmox defaults to LVM-thin on most installations because it’s the most forgiving operationally. The installer doesn’t say this. New operators see ZFS in the dropdown and pick it because it sounds more advanced.
ZFS RAID needs proper hardware. ZFS expects to see raw disks. RAID controllers that present hardware arrays as single logical disks defeat ZFS’s redundancy model. The installer accepts the configuration without warning. Resilver behavior under failure becomes unpredictable.
Consumer SSD endurance gets ignored. ZFS writes more data than LVM-thin for the same VM workload (metadata, copy-on-write, scrubs). Consumer SSDs can wear out significantly faster than many operators expect under heavy VM and snapshot workloads. The installer asks nothing about disk wear ratings.
ARC memory gets misunderstood. ZFS will eat up to 50% of system RAM by default for cache. On a 16GB host, that’s 8GB before any VM starts. The installer mentions ZFS RAM requirements briefly and most operators skip past it.
Ceph appears as an option for clusters. This gives the impression that Ceph is something you can casually add. Operationally, Ceph is its own infrastructure layer. The installer doesn’t mention the network requirements, the operational learning curve, or the failure scenarios specific to small clusters running Ceph.
Many operators accidentally choose storage by following installer defaults instead of understanding failure behavior. The defaults are reasonable for typical cases. They are not optimal for every case. Understanding what each choice means later — when something breaks — is more valuable than what the choice means at install time.
LVM-thin — the safe default
LVM-thin builds thin-provisioned logical volumes on top of standard Linux LVM. It allocates space lazily — a 100GB VM disk that uses 20GB actually consumes 20GB on the underlying physical volume. Multiple VMs share the thin pool’s free space.
What LVM-thin gives you:
- Simple operational model — standard Linux LVM tooling applies
- Low RAM overhead (no ARC, no caching layer)
- Predictable performance (no compression, no checksums consuming CPU)
- Works with any disk including those behind RAID controllers
- Snapshot capability exists but isn’t designed for the same operational workflows ZFS provides
What LVM-thin does not give you:
- Built-in data integrity checking (silent corruption goes undetected)
- The snapshot-and-replication-based backup experience most operators expect after using ZFS
- Native compression or deduplication
- Built-in replication or send/receive primitives
- Pool-level redundancy across multiple disks (relies on underlying RAID or single-disk reality)
The operator tradeoff: LVM-thin keeps storage out of your way until you actively need more from it. If you don’t know what you need yet, LVM-thin won’t punish you for finding out later — except that switching storage later is its own pain (see “Hidden cost of changing storage”).
Who LVM-thin punishes:
- Operators who forget to monitor thin pool free space (running out is catastrophic)
- People expecting advanced snapshot or integrity features
- Anyone overcommitting storage without alert thresholds
- Operators who don’t notice when a single physical disk fails
LVM-thin works well for: first Proxmox installations, single-disk hosts, homelabs where backups happen on external storage, environments where operational simplicity matters more than feature richness.
ZFS — when it earns its complexity
ZFS is integrity and recovery confidence for operators willing to pay the RAM and complexity tax.
What you’re really paying for with ZFS is operational confidence during bad days. The features that look unnecessary on day one — checksums, snapshots, native replication, compression — are what you need when a disk fails, a VM gets corrupted, or a host needs migration.
What ZFS gives you:
- End-to-end data integrity (checksums detect silent corruption)
- Fast atomic snapshots that don’t degrade VM performance
- Native send/receive for replication between hosts
- Inline compression that often improves performance (less I/O)
- Pool-level redundancy (mirrors, raidz1/raidz2/raidz3) configured at filesystem level
- Adaptive replacement cache (ARC) that significantly speeds up reads for working sets that fit
- The ability to scrub data periodically and catch problems before they cause failures
What ZFS costs you:
- RAM. Lots of it. Default ARC uses up to 50% of system memory. The amount you “save” for VMs is less than the host’s total RAM minus ARC minus baseline overhead.
- Consumer SSD wear. ZFS writes more than LVM-thin for equivalent workloads.
- Pool expansion limitations. Adding a single disk to existing raidz is not supported (until very recently with raidz expansion, still constrained). Plan pool topology carefully.
- Resilver time. Replacing a failed disk in a multi-TB pool can take hours to days, during which the pool is in degraded state.
- Complexity. ZFS has its own vocabulary, tuning parameters, and failure modes. Operators who never read the documentation will eventually be surprised.
ZFS feels expensive right until the first corrupted VM disk that ZFS catches and a non-ZFS system would have served silently to a confused operator three weeks later.
Who ZFS punishes:
- Low-RAM hosts (under 32GB, ZFS contention with VMs gets uncomfortable)
- Cheap consumer SSDs (endurance failure faster than expected)
- Hosts with RAID controllers in pass-through mode that aren’t actually pass-through
- Operators who never check scrub results or resilver health
- People who expect ZFS to be a transparent layer they can ignore
ZFS works well for: single-node serious homelabs with adequate RAM, hosts where data integrity matters, environments planning replication-based backup strategies, operators willing to invest 5-10 hours learning ZFS basics before relying on it.
For the RAM implications specifically, see our Proxmox RAM sizing guide covering ZFS ARC behavior in detail.
Ceph — when distributed makes sense (rarely in homelabs)
Ceph is distributed systems engineering disguised as storage.
The framing matters. Ceph isn’t “ZFS for clusters.” It’s an entirely different category — a distributed object storage system that presents block storage to Proxmox. Operating it well requires understanding distributed consensus, network behavior, and recovery patterns that simply don’t exist in single-node storage.
What Ceph gives you:
- Storage that survives node failures without VM downtime (with HA configured)
- Storage capacity that scales by adding more OSDs across more nodes
- No single point of storage failure
- Live migration without shared storage hardware (Ceph IS the shared storage)
- The ability to lose a node and have VMs continue running elsewhere
The problem with “simple HA”
Most operators discover Ceph while looking for high availability. The natural assumption is that adding Ceph to a 3-node cluster gives them HA storage automatically. Operationally, this is rarely the experience.
A 3-node Ceph cluster with default replication factor 3 means every node holds a copy of every piece of data. Lose one node and the cluster enters degraded state with no recovery target until the failed node returns. Lose a second node and data becomes inaccessible. The cluster survives single-node failures only if the failed node returns reasonably quickly.
Adding a 4th node helps with quorum and recovery targets but introduces another monitor and another set of OSDs to maintain. The complexity grows.
Network requirements are often understated. Ceph needs reliable, low-latency networking between nodes. Consumer 1Gbps networking is frequently inadequate for production-like Ceph operations. Rebalance traffic during recovery saturates the network and slows everything else. 10Gbps networking is the realistic minimum for Ceph that you’d want to depend on.
Resource overhead is significant. Each OSD runs as a service consuming RAM and CPU. A 3-node cluster running Ceph has meaningfully less available capacity for VMs than the same cluster running local storage.
Most operators discover Ceph complexity during recovery, not during installation.
Operational complexity compounds during failures. A degraded Ceph cluster takes longer to diagnose than a degraded ZFS pool. The dashboard provides health information but interpreting it requires understanding Ceph’s internal model.
Ceph can survive node failures. It can also create debugging sessions that last longer than the outage itself.
Ceph solves problems most homelabs do not actually have. Most homelab “HA” requirements can be satisfied with reliable backup + restore procedures rather than live failover. Most homelab “scaling” needs are satisfied by buying a larger drive. Most homelab “cluster storage” requirements come from wanting to learn cluster storage, which is a legitimate goal — just not the same as needing it.
Many homelab Ceph deployments exist primarily because operators want to learn Ceph. That’s a valid reason. It’s very different from needing Ceph. The distinction matters because learning Ceph means accepting operational overhead as a teaching cost. Needing Ceph means accepting the same overhead as a business cost. Confusing these leads to disappointment.
Who Ceph punishes:
- Operators on unstable or low-bandwidth networking
- Anyone with partial understanding (Ceph rewards deep knowledge, punishes shallow)
- Operators wanting “simple HA” (Ceph is not simple)
- Tiny clusters pretending to be enterprise environments
Ceph works well for: 4+ node clusters with dedicated 10Gbps networking, operators with time and motivation to learn distributed systems, environments where storage failover requirements genuinely justify the complexity, learning labs specifically built to explore Ceph behavior.
Small homelab reality
Most homelabs are not what storage discussions assume.
Storage articles often start from assumptions that don’t match the reader: redundant disks, ECC RAM, 10Gbps networking, dedicated backup hardware, multiple nodes, UPS protection, and time to maintain it all. The typical homelab looks different.
The real typical homelab has:
- One SSD or one HDD (often whatever was left over from a desktop upgrade)
- 16-32GB RAM (with no plans for more)
- A single mini PC or repurposed thin client
- One external backup drive that gets remembered occasionally
- No UPS, or a small consumer UPS without proper integration
- 1Gbps networking shared with the household
- Limited time for storage maintenance
Ceph recommendations copied from enterprise environments often ignore this reality entirely. ZFS recommendations assume RAM that the host doesn’t have. Backup strategies assume hardware nobody bought.
The honest small-homelab pattern: one host, LVM-thin or single-disk ZFS, regular backups to external storage, accept that the host going down means downtime until backup restore. This is fine. It’s not enterprise. It doesn’t pretend to be.
The articles describing 3-node clusters with Ceph and replicated ZFS pools are describing aspirational configurations or learning environments. Useful as exposure to what’s possible. Not useful as immediate practical guidance for someone with a single mini PC.
Match storage choice to the actual hardware and time available, not to what enterprise-grade homelabs run.
Operator maturity ladder
Proxmox storage choice maps to operator experience and operational goals, not just workload size.
| Operator stage | Recommended storage |
|---|---|
| First Proxmox node | LVM-thin |
| Single-node serious homelab | ZFS |
| Multi-node HA experimentation | Ceph (carefully) |
| Wants least operational overhead | LVM-thin |
| Wants snapshots + integrity | ZFS |
| Wants distributed storage | Ceph |
| Low-RAM host (under 32GB) | LVM-thin |
| Storage-heavy workloads, adequate RAM | ZFS |
| 4+ node cluster with 10Gbps networking | Ceph |
| Production-like backup-first environment | ZFS with PBS |
The ladder isn’t strictly linear. Operators may stay at LVM-thin for years without “needing” to upgrade. Others jump to ZFS on first install because they already understand ZFS from elsewhere. The point is to match storage choice to actual operational maturity and goals, not to chase what sounds advanced.
What should most people use?
Brutal short version for readers who want the answer fast.
| If you are… | Use… |
|---|---|
| New to Proxmox | LVM-thin |
| Running one serious single host | ZFS |
| Building HA cluster to learn distributed storage | Ceph |
| Running mini PCs with low RAM | LVM-thin |
| Protecting important long-term data | ZFS |
| Running a test/lab environment | LVM-thin |
| Backing up to PBS | ZFS on PBS host |
| Wanting simple operational life | LVM-thin |
| Wanting integrity guarantees | ZFS |
| Building production-like environment with team | ZFS or Ceph depending on scale |
These are starting points, not absolute rules. Specific workload patterns may justify exceptions. The recommendations bias toward operational safety over feature richness.
What breaks first per storage
Each storage type has characteristic failure modes that show up before others.
LVM-thin — what breaks first:
- Thin pool exhaustion. The pool runs out of free space while VMs continue writing. VMs become read-only or error out. Recovery requires freeing space (deleting snapshots, removing VMs, or extending the pool) before VMs can resume normal operation.
- Single-disk failure with no underlying RAID. All VMs on that pool become inaccessible. Recovery requires restoring from backup or replacing the disk and rebuilding everything.
- Metadata exhaustion in the thin pool. Less common than space exhaustion but harder to recover from. Requires pool reconfiguration.
- Silent data corruption. No checksums means corruption is detected only when something tries to read corrupted data and notices, often weeks after the actual corruption event.
ZFS — what breaks first:
- RAM pressure during high VM load. ARC tries to release memory under pressure but can’t always release fast enough for sudden VM allocation needs. VMs fail to start with “not enough memory” errors despite apparent free RAM.
- Resilver pain on large pools. Replacing a failed disk in a multi-TB pool can take days. During this time, the pool is degraded and another failure during resilver causes data loss in raidz1 configurations.
- Fragmentation in heavily-written pools. ZFS performance degrades over time on pools that see constant overwrite patterns. Defragmentation requires destroying and recreating the pool.
- SSD endurance failure. Consumer SSDs hit their write limits faster on ZFS than LVM-thin. Pools start reporting write errors and need disk replacement.
Ceph — what breaks first:
- Network instability causing OSD flapping. OSDs lose communication briefly, get marked down by monitors, then come back. Each flap triggers rebalancing. Cluster spends more time recovering than serving.
- Quorum weirdness on small clusters. With 3 monitors and any monitor outage, quorum requirements become tight. Adding a 4th node and 4th monitor for proper redundancy is often necessary.
- Recovery storms after node failures. When a failed node returns, the rebalance traffic saturates network and disk I/O. VMs become slow or unresponsive during recovery — sometimes for hours.
- PG (placement group) imbalance. Without proper initial sizing of PGs, some OSDs end up holding much more data than others. Performance suffers asymmetrically.
- Misconfigured replication factor with cluster too small. Pool set to replicate 3x with only 3 nodes means losing any node leaves pool in degraded state with no recovery target.
Common across all three: operator inattention. Storage problems usually announce themselves through monitoring before they become disasters. Operators who don’t watch the dashboard discover the problem during the next outage instead.
The hidden cost of changing storage later
Proxmox storage choice is much harder to change than it looks at install time.
Storage migrations look easy right before they become projects.
The naive assumption: “I’ll start with LVM-thin and switch to ZFS later if I need to.” The operational reality is more painful.
What changing storage actually requires:
- VM migration windows. Each VM must be stopped or live-migrated to different storage. Live migration between storage types requires that target storage exists on the same host or accessible cluster member. Cold migration requires downtime.
- Restore-based migration when live isn’t available. Backup the VM, recreate it on the new storage, restore. Each VM takes its backup window plus restore window of downtime.
- Snapshot incompatibility. Snapshots taken on LVM-thin can’t move to ZFS. ZFS snapshots can’t move to Ceph. Migration usually means losing snapshot history.
- Local-to-shared storage pain. Moving from local storage to shared (Ceph, NFS) means rethinking backup strategies, replication patterns, and cluster behavior. This is not a simple operation.
- Replication redesign. If you had ZFS-based replication between hosts and move to Ceph, the entire replication architecture changes. PBS-based backups continue working but in-host replication needs reconfiguration.
- Backup expansion during migration. Running old and new storage simultaneously during migration usually requires temporarily having backups of everything, doubling backup storage requirements.
- Cluster rebalance pain. Adding Ceph to an existing cluster generates significant initial data movement as the cluster takes ownership of VM data. This is high-load operation.
- Datastore path issues. VM configurations reference specific storage IDs. Moving between storage types means editing VM configs for every affected VM.
A homelab with 5 VMs can absorb a storage migration over a weekend. A homelab with 30 VMs cannot. The migration becomes a multi-week project with rolling downtime.
The operational lesson: initial storage choice should reflect not just current needs but realistic expectations of how the lab will grow. Choosing LVM-thin “to keep things simple” and then accumulating 30 VMs creates a future migration project. Choosing ZFS upfront for a host that has the RAM to support it avoids that future cost.
Choosing carelessly at install time means paying for the choice every time storage strategy needs to evolve.
Decision matrix — workload vs storage type
For specific workload patterns, optimal storage choice is more constrained.
| Workload | Best fit | Why |
|---|---|---|
| Single Linux services VM | LVM-thin | Simple, low overhead |
| Docker host running 10+ containers | ZFS | Snapshot before container changes |
| Windows desktop VM | LVM-thin or ZFS | Either works; ZFS for snapshots |
| Windows Server with AD | ZFS | Data integrity matters for directory services |
| Database VM (MySQL, PostgreSQL) | ZFS with tuning | ARC helps but tune recordsize |
| Backup destination (PBS host) | ZFS | Compression + integrity for backup data |
| Media server VM | LVM-thin or ZFS | Either; ZFS if you want snapshots |
| Lab VMs that get destroyed often | LVM-thin | No point investing in snapshot complexity |
| Mission-critical home services | ZFS with replication | Survives single-disk failure |
| 3+ node HA cluster requirement | Ceph | If you accept the overhead |
| Shared storage for live migration | Ceph or NFS | Local storage prevents proper live migration |
| Edge device that needs fast recovery | LVM-thin | Quicker restore from rebuild |
Edge cases worth noting:
- VMs that run Docker should usually use ZFS at the host level even if containers don’t directly use ZFS — Docker storage drivers don’t substitute for VM-level integrity.
- Database VMs benefit from ZFS recordsize tuning matched to database block size (PostgreSQL 8K, MySQL InnoDB 16K). Default 128K recordsize causes write amplification.
- VMs intended for long-term archive should be on ZFS specifically for periodic scrub detection of bit rot.
- Test VMs that get rebuilt regularly don’t justify ZFS complexity.
Proxmox storage RAM tax and sizing
Each storage type has different RAM implications that affect host sizing.
LVM-thin RAM cost:
- Essentially zero beyond standard kernel overhead
- The full host RAM minus baseline is available for VMs
- This is one of LVM-thin’s biggest operational advantages
ZFS RAM cost:
- ARC defaults to 50% of system RAM
- Practical recommendation: 1GB ARC per 1TB of pool storage minimum
- Compression doesn’t reduce RAM needs significantly
- On 32GB host with ZFS: realistic VM allocation budget is ~16-20GB after host overhead and ARC
Ceph RAM cost:
- Each OSD process consumes 4-6GB RAM typical
- Per-node overhead for monitors, managers
- A node running 4 OSDs needs ~20GB just for Ceph services
- Less RAM available for VMs than equivalent local storage
For specific RAM sizing implications, see our RAM sizing guide covering ARC behavior, ballooning, and Windows VM allocation.
Snapshots, backups, and replication — what each actually does
A common storage misconception worth addressing directly.
Snapshots are not backups. A snapshot captures filesystem state at a point in time. The snapshot data lives on the same storage as the original VM. Storage failure destroys both. Snapshots provide rollback capability for software changes, not protection against storage failure.
Backups are not replication. A backup copies data to separate storage at a point in time. The data is independent of the source. Storage failure on the source doesn’t affect the backup. Backups protect against source loss but require restore time.
Replication is not backup. Replication continuously copies data to another storage system. The destination is always close to current. But replication also copies errors, corruption, and accidental deletions. A rm -rf propagates to the replica in seconds.
What each actually does:
| Mechanism | Protects against | Doesn’t protect against |
|---|---|---|
| Snapshot | Software changes, configuration mistakes | Storage failure, fire, theft |
| Backup | Source failure, deletion, corruption | Time gap since last backup |
| Replication | Source host failure | Coordinated failures, source corruption |
Production-like environments use all three: snapshots for fast rollback, backups for disaster recovery, replication for fast failover. Homelabs usually start with backups only (correct minimum) and add the others as needs grow.
The misconception that snapshots are sufficient backup is the most common cause of “I had snapshots, why did I lose data?” incidents.
Common Proxmox storage mistakes
A short list of failures that account for most homelab storage problems.
Choosing ZFS without enough RAM. Host has 16GB RAM, operator installs Proxmox with ZFS, VMs fight ARC for memory. Performance is poor and unpredictable. Fix: either upgrade RAM or use LVM-thin.
Trusting consumer SSDs in ZFS pools. Consumer SSDs lack power-loss protection and have limited endurance. ZFS amplifies writes. Pool reliability suffers earlier than expected. Fix: enterprise SSDs for ZFS, or use HDDs for capacity tiers.
Letting LVM-thin pools fill to 100%. Thin pool exhaustion is catastrophic — VMs become unwritable. Always configure alerting at 70%, 85%, 95% thresholds. Fix: monitoring, capacity planning, free space reserves.
Running Ceph on consumer 1Gbps networking. Ceph rebalance traffic saturates the network. VMs become slow during normal Ceph operations. Fix: 10Gbps networking minimum for Ceph, or don’t run Ceph.
Treating snapshots as backup strategy. Storage failure destroys snapshots and originals together. Fix: separate backup destination, ideally on different physical storage and physical location.
Ignoring SMART warnings. Disks announce failure before they fail completely. Operators ignoring SMART data are caught by surprise. Fix: SMART monitoring with alerts.
Mixing disk sizes in ZFS mirrors. A mirror with mismatched disk sizes uses only the smaller size. The “extra” space on the larger disk is wasted. Fix: matched disk sizes in mirrors, or use raidz for heterogeneous disks.
Adding capacity by adding single disks to ZFS raidz pools. Until very recently, you couldn’t expand raidz by adding individual disks. Adding disks meant new vdev (which weakens redundancy) or pool recreation. Fix: plan pool topology before creating, accept inflexibility.
Not testing recovery procedures. A backup that hasn’t been restored is a hope, not a backup. Fix: periodic restore tests to a sandbox environment.
Monitoring storage — what to watch
Three categories of storage monitoring matter for operational awareness.
Pool/filesystem capacity:
- LVM-thin: monitor thin pool free space and metadata usage
- ZFS: monitor pool capacity, ARC hit rate, scrub schedule
- Ceph: monitor OSD utilization balance, PG state, monitor health
Disk health:
- SMART data on all disks (regardless of storage type)
- ZFS specific: scrub results, resilver progress when applicable
- Ceph specific: OSD up/down status, scrub error counts
Performance:
- I/O wait time on the host (
iostat,iotop) - Storage backend specific metrics (ZFS via
zpool iostat, Ceph via dashboard) - Network throughput for Ceph (Ceph rebalance is network-bound)
For ZFS specifically, automated scrubs should run monthly (default in most Proxmox installations) and the results should be checked. A scrub that finds errors is not failure — it’s the system working as designed. A scrub that finds nothing for years means either the system is healthy or scrubs aren’t actually running.
For LVM-thin, the critical metric is thin pool free space. Running out of space in a thin pool is much harder to recover from than running out in a regular filesystem.
For Ceph, cluster health is a composite metric. ceph status should show HEALTH_OK most of the time. HEALTH_WARN is normal during rebalance but shouldn’t persist for hours.
The boring storage decision is usually the right one
Proxmox storage choice tempts operators toward complexity. ZFS sounds more advanced than LVM-thin. Ceph sounds more enterprise than ZFS. The temptation is to choose based on what sounds capable rather than what fits operational reality.
Storage is not just performance. Storage determines how failures behave.
The boring storage decision — LVM-thin for first installations, ZFS when complexity is earned, Ceph rarely — produces the fewest 2 AM debugging sessions. The exciting storage decisions produce stories at conferences.
A homelab is meant to be operational, not a story. Choose the storage that lets you sleep at night and recover from the failures you’ll actually encounter.
The best storage layout is usually the one that fails predictably, recovers cleanly, and doesn’t require rebuilding your infrastructure philosophy six months later.