Deduplication is hit and miss, are there settings to determine what is matched? #17151

AncientMystic · 2025-03-15T01:55:24Z

AncientMystic
Mar 15, 2025

So i am noticing even if i move the exact same data ZFS misses a lot of duplicates in deduplication

example i copied 163gb of game files for a vm twice and the dedup function caught 72gb, yet 100% of the data is duplicated

Backups are the same, multiple identical copies of windows, linux, etc, it is catching around 44% or less of the duplicate files on average.

Another example, i was hoping to use it to have ai models in multiple VMs and not take up so much space since they are huge and so many apps want them in different locations or setup with different environments to run efficiently, so deduplication would be perfect for this, yet ai models that i have moved onto the dataset, ZFS somehow does not see a single one as duplicate despite there being duplicates of all of them for different VMs.

I am wondering is this to be expected or is there something i am missing, a setting i should specify to increase how thoroughly it checks files for deduplication, etc?

I am running ZFS on proxmox and I have 96GB of ram, so i am pretty sure i have enough at least to run deduplication.

IvanVolosyuk · 2025-03-15T01:58:44Z

IvanVolosyuk
Mar 15, 2025

Are you trying to dedup data in VM images / volumes?

3 replies

AncientMystic Mar 15, 2025
Author

They are written to zfs as raw, but not all of the files are in vm's, i am seeing the same performance from deduplication with files directly written, backups and raw vm disks.

amotin Mar 15, 2025
Collaborator

I am still not sure I understand properly whether you put the identical data inside the VMs on their file systems or directly on the ZFS dataset as files. Because if you put them inside several VM on guest file systems separately, even if VMs disks are backed by raw files, if the VM guest file system has block smaller than ZFS (quite likely), then different offsets between these can make the data not identical from ZFS side.

Alternatively, if that guess is wrong, but you are running ZFS 2.3, then this regression we found ourselves just recently: #17120 .

AncientMystic Mar 15, 2025
Author

I am still not sure I understand properly whether you put the identical data inside the VMs on their file systems or directly on the ZFS dataset as files. Because if you put them inside several VM on guest file systems separately, even if VMs disks are backed by raw files, if the VM guest file system has block smaller than ZFS (quite likely), then different offsets between these can make the data not identical from ZFS side.

Alternatively, if that guess is wrong, but you are running ZFS 2.3, then this regression we found ourselves just recently: #17120 .

Thank you that could be apart of the problem for vm disks at least. I will be sure to check that.

i have also tried it with files directly on the same dataset, that is where i am seeing an average of 44% success rate with exact files.

but i think a few vms are even worse, with even fewer files deduplicated, the ai models are the worst seeing no duplicates at all, ill bet you are right with that one, that must be why its completely missing all of them in that case at least.

I am running ZFS 2.2.7-pve1 according to proxmox.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication is hit and miss, are there settings to determine what is matched? #17151

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Deduplication is hit and miss, are there settings to determine what is matched? #17151

AncientMystic Mar 15, 2025

Replies: 1 comment · 3 replies

IvanVolosyuk Mar 15, 2025

AncientMystic Mar 15, 2025 Author

amotin Mar 15, 2025 Collaborator

AncientMystic Mar 15, 2025 Author

AncientMystic
Mar 15, 2025

Replies: 1 comment 3 replies

IvanVolosyuk
Mar 15, 2025

AncientMystic Mar 15, 2025
Author

amotin Mar 15, 2025
Collaborator

AncientMystic Mar 15, 2025
Author