[Feature] Duplicate photo detection #1968

zjy760401 · 2023-03-08T00:12:11Z

zjy760401
Mar 8, 2023

The feature

I want a feature that can detect duplicate photos to help me organize my pictures. Ideally, the feature should be able to handle photos of different resolutions. Sometimes the same photo might be published in a lower resolution on a mobile platform, while a high-resolution version is backed up on a computer. I hope the feature can automatically detect duplicates and present the options to the user for further action.

Platform

Server
Web
Mobile

AngelaDMerkel · 2023-05-26T00:08:58Z

AngelaDMerkel
May 26, 2023

It would also be useful to be able to detect updated versions of identical images. This is more applicable when publishing images from Mirrorless images, where multiple versions might be created.

3 replies

cibernox Jul 10, 2023

Also important or something to be considered: Dedupe across users.

It is very common in a family setup that my wife and I happen to have the same photo because we've shared them through messaging apps. Having those fotos/videos being dedupped (symlinks?) could save a significant amount if space and also processing.

tonya11en Jul 16, 2023

It might make more sense to have your filesystem perform deduplication for this scenario.

cibernox Jul 16, 2023

Good point, I didn’t think about that. That wouldn’t save processing time tho (and idk if the thumbnails and/or decompressed videos would be deduped, I believe compression algorithms are not 100% deterministic), so it may still be something worth pursuing.

agross · 2023-08-19T22:32:35Z

agross
Aug 19, 2023

I have this scenario:

I Imported my Google Photos Takeout to Immich
I found many of the imported images in my Dropbox-style "Camera Uploads", often in higher res/better quality as what Google's "Storage Saver" left me with

I still would want Immich to accept uploads of the same image in better quality. And some dedupe feature would be very nice in my current situation.

9 replies

flightmansam Sep 6, 2023

@agross This is flipping awesome! When a dupe is found, is it simlinked to the lesser file? or just delete the lesser?

Best!

agross Sep 7, 2023

When a dupe is found, is it simlinked to the lesser file? or just delete the lesser?

Excerpt from the README. A "group" refers to a set of duplicate assets:

If you click "Keep best asset" for the currently displayed group:

The best asset will be added to all albums of the group's other ("non-best") assets

The best asset will become a favorite if any asset in the group is a favorite

All "non-best" assets will be deleted

The group's information will be purged from your browser

Does that answer your question?

flightmansam Sep 8, 2023

Silly me, was right there! Yep answers my question!

janstadt Oct 10, 2023

Im just curious if this works for an external library? I believe the existing libraries are read-only but would love to run through a deduplication process on those as i have many years of dupes from a camera/hdd, to flickr, to google photos, to hopefully immich and would like to use the existing library feature instead of uploading them with the cli.

agross Oct 11, 2023

There is no external library support at the moment. All files are loaded from the local filesystem for duplicate detection. Duplicate metadata displayed while showing groups of dupes is loaded from the Immich API, so those assets need to be uploaded to Immich.

AngelaDMerkel · 2023-08-22T13:06:56Z

AngelaDMerkel
Aug 22, 2023

In #3816 it was mentioned that there is an open source tool available which can detect similar images, either by hash or by visual similarity, including where there are differing resolutions. If this had a manual approval process, it would be very useful for people dealing with multiple versions of the same image or rescales produced by messaging apps.

1 reply

cibernox Aug 22, 2023

IMO some kind of verification and approval version would be good, because with the "burst shot" of many cameras and phones that shoot half a docen photos in a split second, they will be almost identical but one of them will have me with my eyes shut 😆

flightmansam · 2023-09-06T22:59:01Z

flightmansam
Sep 6, 2023

I think this would be a stunning feature

0 replies

AtlasC0R3 · 2023-09-07T23:06:36Z

AtlasC0R3
Sep 7, 2023

Cameras quite often take photos in JPEG and RAW (or dng, nef, whatever) format, and Immich subsequently will upload both, without "stacking" them (as PhotoPrism calls it) and treating both as if they were entirely different despite being the same exact file name excepted for the file extension.
I'd like to see where this feature goes, should it be implemented at all (as it is, it seems there is radio silence from the devs). It'd be ideally preferable to keep both and dynamically switch between them (e.g. use the JPEG for thumbnail but view the higher resolution/quality file and download either file as chosen by the user), but it could very well be a mess to adapt preexisting databases to suddenly maybe have two identical photo files, I realize.

Post edit note: I realize this really, really isn't the right issue (discussion? I'm eternally confused at these things) to have picked, it seems #2479 fits my need much more, still, I feel both ideas can be adapted as one.

5 replies

jrasm91 Sep 8, 2023
Maintainer

Radio silence from the devs... what is that supposed to even mean? That comes off a bit passive aggressive to me, which is not appreciated. We are very active for the most part and have already been engaging in the discussion of stacked photos in the one you found after the fact.

We will more than likely have a stacked photos implementation at some point in the future. Since this is an open source project and contributors work on their own priorities, who knows when that will be.

AtlasC0R3 Sep 8, 2023

Radio silence from the devs... what is that supposed to even mean? That comes off a bit passive aggressive to me, which is not appreciated. We are very active for the most part and have already been engaging in the discussion of stacked photos in the one you found after the fact.

My apologies if this came off as an offense, there has been no transparent indication on the progress of this development, is what I meant.

AngelaDMerkel Sep 8, 2023

@AtlasC0R3 This suggestion seems more suited for a raw developer than Immich in this case. A RAW file is nearly always the same resolution as the JPG + RAW created in camera and to serve the RAW to a device for viewing it must be debayered and served as a JPG anyways because a raw is not a raster image file. I don't think there is anything to be gained from image stacking in this use case.

AtlasC0R3 Sep 8, 2023

At the very least then, Immich should be able to understand the two (say, JPEG and RAW image) and decide "cool, get rid of the JPEG, we already have the RAW" as some phones will only save RAW formats for certain modes and not for others.
That would then fall under general deduplication, I just believe stacking can also be an effect caused by deduplication instead of only outright deleting it, however I acknowledge it may be a bit more complicated to implement.

AngelaDMerkel Sep 12, 2023

@AtlasC0R3 I don't think this is deduplication at all, as RAW files are not images and every RAW interpreter will produce a very different result. DNG files can sometimes be linear, but that is rare. Identifying and stacking RAW+JPEG seems like a fair approach but not the result from a deduplication feature.

amit-handa · 2023-10-13T01:25:27Z

amit-handa
Oct 13, 2023

It can be done via digikam application as well

0 replies

janstadt · 2023-10-13T02:17:00Z

janstadt
Oct 13, 2023

So if i have a bunch of photos in my external library (already loaded in the db) and i run some dedupe on the directory, what will happen with the assets in the library? Do i have to rescan the entire library or will immich handle the missing photos fine?

2 replies

alextran1502 Oct 13, 2023
Maintainer

I believe you can run "Remove Offline File" option in the library action to remove them

@etnoy is this correct?

Batwam Dec 27, 2023

I can confirm that this works for files in and external library but the same option doesn't exist for files in the main library.

How do we refresh to remove files (and associated thumbnails/encoded-video) for files which have been removed from the main library? In my case, I regularly move the files to my external library (and convert them from heic to jpg in the process) so I'd want a way to remove the original one and replacement them with the ones from the external library folder.

Batwam · 2023-12-27T07:25:52Z

Batwam
Dec 27, 2023

with regards to duplicates, I think that it would be a good idea to have an optional "hide" feature which would hide files which have the same hash as an existing file. These could stay in the respective folders but only one photo is shown to avoid displaying the same picture twice. Any thoughts?

1 reply

cibernox Dec 27, 2023

Sounds like a non destructive start. After that having a button to manually trigger a deletion of the duplicates seems like the obvious next step.

madhavtummala · 2023-12-28T10:28:43Z

madhavtummala
Dec 28, 2023

Dupeguru is open source, and it has worked a fuzzy logic based matching on contents and not just file names. There is a slider to adjust the threshold of match as well. Can we port this code / write a version that can work with immich? Seems like an interesting project!

5 replies

agross Dec 28, 2023

https://github.com/agross/immich-duplicates/ supports fuzzy search and is ready-made for Immich. Did you try it?

janstadt Dec 28, 2023

Does this work with external libraries?

agross Dec 28, 2023

If Immich generates thumbnails for external assets then it should work.

janstadt Dec 28, 2023

Thanks for the reply. Even if external assets are set up as readonly?

agross Dec 28, 2023

Deleting duplicates would obviously fail in this case if Immich's API respects the readonly flag. Apart from that, you will be able to browse through and ignore duplicates, provided thumbnails for external assets exist.

aisbergde · 2023-12-29T04:44:38Z

aisbergde
Dec 29, 2023

I use digiKam and allDup or other external programs for deduplication. This works great, but currently it can be used only on external libraries. When you delete duplicates in the internal libraries, these deletings are not (yet) detected or accepted by immich.

2 replies

Batwam Dec 30, 2023

czkawka is pretty good too, let's you chose the method and level of similarity. Same with Geeqie

metheoryt Feb 12, 2024

I tried czkawka, dupeGuru and Video Duplicate Finder so far, and now trying digiKam - I have to say that digiKam outperformed all those tools by quality of duplicate search. Not mentioning rotated duplicates search, haven't tried it on digiKam, but it finds regular duplicates that these tools can't find.

metheoryt · 2024-02-12T18:51:14Z

metheoryt
Feb 12, 2024

Found another library for photo deduplicating, it is AI based so may perform better in many cases. This is for devs to consider to use in the server.

https://github.com/idealo/imagededup

0 replies

superdong69 · 2024-03-05T14:30:15Z

superdong69
Mar 5, 2024

Has there ever been any movement on this request? Just started using this wonderful app myself, and image duplication is a REAL pain.
Image exists on external library, phone uploads same photo, wife may have the same photo on her phone, and then we get triplicates.

2 replies

alextran1502 Mar 5, 2024
Maintainer

Duplicated will only be handled per user. We have plan to put in similarity detection on top of file's content comparison for duplication detection.

AngelaDMerkel Mar 5, 2024

Is there a reason duplication detection cannot be handled across users? Duplicates within a family and therefore across users is one of the more likely use cases for this. Enabling the admin (for example) delete some duplicates would be very useful.

vale46n1 · 2024-03-18T19:38:42Z

vale46n1
Mar 18, 2024

Hello everyone,

I'm excited to share that I've developed a Python script which launches a local Streamlit website, specifically designed for finding duplicates based on both deep learning or hashing.

What makes this project particularly interesting is its capability to be integrated directly with Immich using the local server API. This should offer a seamless experience for those looking to streamline their workflows.The script is all set for testing and evaluation.
Let me know if you're interested or have any questions about the integration and how it works.

Looking forward to your interest and feedback.

Ciao
Stefano

13 replies

vale46n1 Mar 21, 2024

Considering the discussion, I'm planning to eventually implement an external perceptive hash (pHash) system on an external database to enhance duplicate detection capabilities in my setup. This would complement the existing mechanisms and potentially offer more nuanced duplicate and similarity detection, especially for visually similar images that might not be byte-for-byte identical

Amerlander Mar 22, 2024

I would love to see this integrated in the official immich release. I have a huge external library (100K+ Photos, 10K Videos, ~2TB) and some photos are duplicates or just very similar so it would be great if they were grouped. The app, that must not be named, had a feature, that created animations or time-lapse videos of sets of similar picture.

etnoy Mar 22, 2024
Collaborator

Why externally? This could be done within immich. I can help if needed

superdong69 Mar 22, 2024

DEFINITELY integrated. This needs to be in Immich, badly.

vale46n1 Mar 22, 2024

In the meantime, I have updated the release to include the capability to determine if a file is from an external source.

mdchristians · 2024-03-22T16:44:03Z

mdchristians
Mar 22, 2024

I feel like introducing a more basic check on media ingestion could be done that would solve a good majority of use cases for people. Something as simple as calculating phash when a photo is uploaded or scanned from an external directory and checking the DB for a matching phash would work in most cases.

If a match is found, just mark the image as a "potential duplicate" and either ask the user to verify or just hide those potential dupes.

Obviously this doesn't include videos and is going to miss photos that have been edited or cropped in some way, but will work for actual duplicates. If this is something that seems like a good idea, I'd love to add a PR

1 reply

vale46n1 Mar 22, 2024

Fully agreed. I plan to implement this feature in my app using a separate database, but having it integrated directly into Immich would undoubtedly be more convenient

PathToLife · 2024-03-23T10:35:11Z

PathToLife
Mar 23, 2024

I've been exploring more fuzzy duplicate detection, where we get a similarity percentage from a more sensitive phash implementation.

Similarity detection is a difficult task as a family library of 100K+ images would mean 10 Billion comparisions if every image was compared with every other image. Ideally some smart algorithm would be used to help speed things up by reducing comparisons. I'm currently using the postgres GIN index to do this for us.

Below is an sql table which stores the result of similary comparision search on all library images. The sql table is currently external to immich, however the sha1 columns match the asset.checksum column in immich.
1.0 means exactly similar, 0.6 - 0.7 is typical for images with emojis added to them.

The similarity calculation is currently using the postgres pg_trgm extension to finding similar phash hex strings, this extension module is used by immich already in its searching feature.

The pHash implementation uses the python pHash library.
However I've found that the default pHash implementation is not sensitive enough for identical images with small editor modifications (such as the addition of a company logo). Thus I've upped the hash size to 16.

import imagehash

fp = 'some file path'
imagehash.phash(Image.open(fp), hash_size=16)

The hashes are then stored in a sql table. Note the varchar phash_16 field below is GIN indexed and used by the Postgres trgm extension to find similar values. As future work for a production implementation I'd look for the best data type for GIN indexes of pHash. For now varchar is good enough...?

CREATE TABLE IF NOT EXISTS hashes
(
    sha1      bytea
        constraint hashes_pk
            primary key,
    phash_16    varchar,
    file_size integer not null
);
CREATE INDEX phash_16_trgm_idx ON hashes USING GIN (phash_16 gin_trgm_ops);

Then we can ask Postgres to search for similar hashes like so (On 120K hashes Completed in 34 ms (execution: 23 ms, fetching: 11 ms)). Note the below is python psycopg2 postgres value parameters.
For production, the searching should be inside of the microservices nestjs for better maintainability. Python should only be used for the hashing?

SELECT sha1, phash_16, similarity(phash_16, %s) AS sml FROM hashes 
                    WHERE phash_16 %% %s AND sha1 != %s
                    ORDER BY sml DESC, phash_16;''',
                    (phash_16, phash_16, sha1)

After getting some results, we can lookup the sha1 in immich assets table and workout where the file is stored (original file path) for viewing.

SELECT "originalFileName", "originalPath", checksum
FROM assets
WHERE "checksum" in (E'\\x<SHA1-HASH-HERE>')

Love to hear thoughts on your findings or possible improvements to this :)

6 replies

alextran1502 Mar 23, 2024
Maintainer

We have embedding data for CLIP search and we can leverage that to search for similar images with the result distance is 0.1 (if I remember correctly, it means there is no differences between the two photos)

bo0tzz Mar 23, 2024
Maintainer

For searching/comparing phashes inside postgres, you might be better off leveraging the pgvecto.rs extension that we already use by storing the hashes as a binary vector. Then you can use a mechanism similar to the facial recognition to cluster them.

PathToLife Mar 24, 2024

Thanks for pointing me at the embedding vectors!

Here's the resultant query, the distance is set to 0.1.
Result will always contain one row representing self comparision with distance 0.

WITH pgv as (SELECT embedding from smart_search
                                           where "assetId" = uuid('<some-uid>'))
SELECT *
FROM (SELECT *, embedding <-> (select embedding from pgv) AS distance
      FROM smart_search) as s WHERE s.distance < 0.1;

Performance

Typical execution time on one image. Below is an image with 13 similar matches < 0.1
14 rows retrieved starting from 1 in 1 s 56 ms (execution: 1 s 48 ms, fetching: 8 ms)

This is significantly longer than the trqm GIN index on pHash_16 hex strings. Note we have less matches here as the smart_search table supports m4v media embeddings.
12 rows retrieved starting from 1 in 37 ms (execution: 28 ms, fetching: 9 ms)

The below is the distance for two images with exact phash_16 hashes (but different sha1).
The target image has a distance of 0 when compared with itself, and a distance of ~0.00049 when compared with the exact phash_16 match.

Observations

I sampled the distance value on ~100K images, just to get an idea on the numbers we might get.

Count Images	Average Distance	Min Distance	Max Distance
100K	~0.74 - 0.85	0	~1.5 - 1.6

It seems edited images give a match distance of 0.01 - 0.09
Where exact & edited images give a match distance of <0.009

Other

I'm not familiar with how CLIP embeddings behave, but I hope the embedding prediction won't change too much if the Clip Model was retrained?

It seems the CLIP embedding is being generated here:

immich/server/src/services/smart-info.service.ts

Line 90 in c85563d

const clipEmbedding = await this.machineLearning.encodeImage(

The model config seems to be ViT-B-32__openai. huggingface

bo0tzz Mar 24, 2024
Maintainer

@mertalev for the index to get used, the vector queries need a particular query pattern right?

mertalev Mar 24, 2024
Maintainer

Yes, you need to order by the comparison and set a limit for good performance.

Deses · 2024-03-24T19:53:21Z

Deses
Mar 24, 2024

How does Immich behave when an external program or the user manually deletes pictures from the database? Is it left with a broken reference to a missing picture or does it take care of the issue somehow?

2 replies

PathToLife Mar 24, 2024

There should be a delete API an external program or user can use.

Touching the database directly is not preferred as Immich changes things between updates, what works one day, won't work the other.

If you still prefer to edit the database directly, read the server code for the delete asset route and find what it is doing to the database.

Deses Mar 24, 2024

Yeah I'm definitely not touching the DB directly. 😂

It's just that by reading this feature request there were people talking about deleting pictures with things like ImageDedup and other tools, so I wondered what happened with the deleted pictures within Immich.

If I had to dedupe media I'd use #1968 (comment), or wait for it to be integrated.

yourjelly · 2024-03-25T20:22:44Z

yourjelly
Mar 25, 2024

I would say ideally this would auto stack duplicates, on top of that though there seems to be no way to define which image defines a stack and about half the time Immich is using the raw file as the stack cover image which is a darker image.

0 replies

vale46n1 · 2024-03-26T12:38:41Z

vale46n1
Mar 26, 2024

👋 Hi there,

I've decided to proceed with implementing the FAISS way for analyzing assets via the DB API in Immich. The plan is to interfaces directly with Immich's asset database to retrieve images for processing. Here are the key components of the implementation strategy:

1. Image Retrieval:

Utilize Immich's DB API to fetch the images that need to be analyzed. This will involve making authenticated requests to the API and handling pagination to ensure all relevant assets are retrieved.

2. Image Preprocessing:

Before vectorization, images will be preprocessed to ensure compatibility with FAISS's requirements. This includes resizing images to a uniform dimension (or maybe getting thumbnail jpeg) and normalizing pixel values. Preprocessing steps might look something like this in Python:

from PIL import Image
import numpy as np

def preprocess_image(image_path):
    # Resize the image to (224, 224) for uniformity
    img = Image.open(image_path).resize((224, 224))
    # Convert to numpy array and normalize pixel values to [0, 1]
    img_array = np.array(img) / 255.0
    return img_array

3. Feature Extraction and Vectorization:

To extract meaningful features from the images, we'll leverage a pre-trained deep learning model (such as ResNet, VGG, or any model suitable for image feature extraction) to generate a high-dimensional vector for each image. The output vectors will then be compatible with FAISS for indexing and similarity searches.

4. FAISS Indexing:

With the vectors generated, we'll create a FAISS index for efficient similarity searches. The index will allow us to quickly find similar and duplicate images by comparing vector distances. Here's a simplified example of creating a FAISS index:

import faiss

def create_faiss_index(vectors):
    dimension = vectors.shape[1]  # Dimension of the vectors
    index = faiss.IndexFlatL2(dimension)  # Using L2 distance
    index.add(vectors)  # Adding the vectors to the index
    return index

5. Similarity Search and Duplicate Identification:

For each vector, we'll perform a search against the FAISS index to identify similar and duplicate images based on defined parameters (e.g., distance thresholds). This step will categorize images into "similar" and "duplicate" based on their distance in the vector space, which will be adjustable based on specific needs.

6. Results Storage:

Finally, the analysis results, including the similarity scores and categories (similar or duplicate), will be stored in a separate vector database. This enables efficient retrieval and further analysis or review of the categorized images.

Technical Details:

The entire process will be automates the flow from image retrieval to result storage.
We'll ensure that the script is modular and easily configurable to accommodate different processing needs and parameters.
Special attention will be given to optimizing the preprocessing and vectorization steps to manage the computational load, especially when dealing with large image datasets.

I'm excited about leveraging FAISS for this project due to its efficiency and scalability in handling similarity searches.
I'll be starting on the implementation shortly and will keep the community updated on the progress.
Your feedback and suggestions are welcome!

2 replies

bo0tzz Mar 26, 2024
Maintainer

@vale46n1 before you put effort into this, you may want to know that one of the maintainers has started implementing duplicate detection inside Immich: #8228

vale46n1 Mar 26, 2024

Thank you for the heads up about the duplicate detection feature being developed in Immich, as mentioned in issue #8228. Before proceeding with my plan to implement a solution using FAISS, I'm curious about a few aspects of the in-progress work:

Duplicate Detection Method: Do you happen to know which method or algorithm the maintainers are employing for duplicate detection? For instance, are they using perceptual hashing (pHash) or another technique?
Similar Image Identification: Additionally, I'm interested in whether there will be support for identifying not just exact duplicates but also visually similar images. The FAISS-based approach I'm considering is particularly adept at this, allowing for fine-grained control over similarity thresholds.

Understanding these aspects will help me determine how best to complement the existing efforts or perhaps focus on areas that might not be covered by the internal implementation.

Thanks again for pointing out the ongoing work!

CraigInBrisbane · 2024-03-30T14:19:47Z

CraigInBrisbane
Mar 30, 2024

Would be nice to have a “how duplicate is this”. Sometimes I have 5 photos taken within a minute or the same scene. They won’t be 100% identical. But maybe 90%? So would be nice to be able to select the survivor and the rest will be victims.

2 replies

PathToLife Mar 30, 2024

The threshold for similarity will be highly dependent on the scene for the photo, sometimes 99.8% is good, sometimes 80% is good etc.. images with watermarks and logos have high simillarity with their unedited counterparts.

I think in your usecase of multiple images taken rapidly, pick one-or-more discard rest, it might be better to use time-based grouping rather than duplicate detection, but with better UI to automatically mark images for disposal.

Qhilm Aug 30, 2024

I think there's a stack feature upcoming that might help alleviate this problem a bit.

vale46n1 · 2024-04-01T19:05:55Z

vale46n1
Apr 1, 2024

I would announce that I've uploaded a revised version of the Immich Duplicate Finder https://github.com/vale46n1/immich_duplicate_finder. This latest update introduces an advanced image analysis enhancement, powered by Facebook AI Similarity Search (FAISS) and deep learning feature extraction. These improvements significantly enhance our ability to identify duplicate or similar images, thus streamlining asset management and improving data quality.

Key Features:

FAISS Vector Database Generation: Leveraging the power of FAISS, we now generate and store high-dimensional feature vectors extracted from images. This allows for efficient similarity searches and duplicate detection.
Advanced Feature Extraction: Utilizing a pretrained ResNet18 model, we extract meaningful features from images, ensuring high accuracy in identifying similarities.
Euclidean Distance for Similarity Measurement: By employing Euclidean distance measures, our system accurately finds and groups similar images, aiding in the decluttering and organization of image assets.
Streamlit Integration: For an improved user experience, we've integrated Streamlit, providing an intuitive interface for progress tracking and interactive data exploration.

Next week, I plan to implement features to visually distinguish duplicates and group them together similarly to Google Photos. Additionally, I aim to introduce functionality for identifying and categorizing animals and various objects into similar groups.

These would be enhancements that I hope could be directly implemented into Immich.

Currently, I'm developing these features to address personal requirements and look forward to sharing more updates soon.

6 replies

chriexpe Apr 13, 2024

I didn't know I needed it until I've seen your project, thank you!
By any chance do you plan to natively implement it to Immich instead of using it as another webui/interface?

alextran1502 Apr 13, 2024
Maintainer

FYI, we are working on the integral mechanism in Immich #8228

vale46n1 Apr 17, 2024

That' great! I revised the logic and the code. Now is working like a charm....

treggaz May 14, 2024

Hey vale46n1, just spun this up and I'm thoroughly impressed - truly great work and thank you for sharing it!

I do apologise if this is not the right place to ask but for duplicates that are deleted, is there some kind of block to stop these from being re-uploaded (ie through mobile auto backup)? Or the immich db can somehow retain the record of that particular upload to flag future uploads as duplicates?

I figure it may be too early for this level of integration until it's merged into the main project but figured you'd know what's possible!

Cheers :)

Qhilm Aug 30, 2024

Immich will prevent duplicates from being uploaded again – at least mine does =)

What I do still miss is that the copy that is kept is added back to all albums the other copies in the group were present.

For favorites, stars rating and descriptions, I don't have a simple solution. I think if a single copy of the group has stars/description, let's use this. If not, throw an error and ask for a manual decision? But that's not that important for me, I have zero ratings/stars and favourites at the moment. I do have some descriptions, but not that many.

blublub · 2024-10-04T09:19:46Z

blublub
Oct 4, 2024

Duplicates, as mentioned in this thread, can have multiple reasons:
1- same image more than once - user only wants to keep onlye one version
2- multiple immich users want the same images in their own library - in this case immich will need those files on the file system in "n" numbers.

1 - can be adressed by a duplicate finder and the manually deciding what to keepr and what to delete by each user

2 - is way more complex as immich would need to track and manage which files are identical and which users has them and then keeping track if any user deletes a file and another user wants to keep it - not a trivial task

Another way would be file level deduplication, i.e. via ZFS. So for immich all files/operations stay the same and all duplicates appear in the file system but the OS/FS decides which blocks are duplciate and stored only one - this would also dramatically decrease storage requierements in such enviroments.

I might give the ZFS option a try and attacha ZF volume to my immich VM and make a sym-link fpr the "common" directory and move the fies over to the ZFS driver - in theory that should work just fine as for micch "nothing changes"

4 replies

chkuendig Oct 23, 2024

I'm using rdfind to deduplicate multiple overlapping imports (Lightroom, iCloud, manual uploads) to replace duplicates with hardlinks. It would be nice if immich could detect that these duplicates are actually the same file and somehow merge or automatically stack them without needing a manual review.

Alternatively a way to automatically stack duplicates would be a okay alternative.

DBunevich Dec 12, 2024

In my case I rather would like to have "versions" of the same files.
Duplicates is not a problem for me as all is managed, but I used to save Hires and Lowres JPG files which are sort of duplicates, but of a different resolution. Also I don't need to get none of them deleted.
Certainly for browsing purposes I would like to use Lowres, but to download I would like to use Hires.
Is it something technically possible to be implemented?

Qhilm Dec 12, 2024

Isn't immich already generating previews for browsing, with a configurable resolution? That should fulfil your "low res" use case no?

DBunevich Dec 12, 2024

Isn't immich already generating previews for browsing, with a configurable resolution? That should fulfil your "low res" use case no?

Actually "Stacking" function does solve my problem. It is still quite manual, but, as I can see, it may be improved in the next year development plan.
There are some issues with stacking presently like if the photos are "stacked" in the timeline, they won't be stacked in "People" or "Albums". This is something which may be addressed in the future too.

[Feature] Duplicate photo detection #1968

The feature

Platform

Replies: 21 comments · 66 replies

jrasm91 Sep 8, 2023 Maintainer

alextran1502 Oct 13, 2023 Maintainer

alextran1502 Mar 5, 2024 Maintainer

etnoy Mar 22, 2024 Collaborator

alextran1502 Mar 23, 2024 Maintainer

bo0tzz Mar 23, 2024 Maintainer

Performance

Observations

Other

bo0tzz Mar 24, 2024 Maintainer

mertalev Mar 24, 2024 Maintainer

1. Image Retrieval:

2. Image Preprocessing:

3. Feature Extraction and Vectorization:

4. FAISS Indexing:

5. Similarity Search and Duplicate Identification:

6. Results Storage:

Technical Details:

bo0tzz Mar 26, 2024 Maintainer

Replies: 21 comments 66 replies

jrasm91 Sep 8, 2023
Maintainer

alextran1502 Oct 13, 2023
Maintainer

alextran1502 Mar 5, 2024
Maintainer

etnoy Mar 22, 2024
Collaborator

alextran1502 Mar 23, 2024
Maintainer

bo0tzz Mar 23, 2024
Maintainer

bo0tzz Mar 24, 2024
Maintainer

mertalev Mar 24, 2024
Maintainer

bo0tzz Mar 26, 2024
Maintainer