Notes from How Diffusion Models Work by DeepLearning.ai
- With Extra Noise
explorer_pC0437cXSo.mp4
Taught By Sharon Zhou
Noted by Atul
- Example used throughout the course: Generate 16X16 size sprites for video games.
- Goal : Given a lot of sprite images, generate even more sprite images
-
What does the network learn?
- Fine details
- General outline
- Everything in between
-
Noising Process (bob as ink drop analogy)
![image](https://private-user-images.githubusercontent.com/61497490/242649370-b16289e8-e954-4c1b-9cda-27e2956ef1ec.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkyOTcwNTQsIm5iZiI6MTczOTI5Njc1NCwicGF0aCI6Ii82MTQ5NzQ5MC8yNDI2NDkzNzAtYjE2Mjg5ZTgtZTk1NC00YzFiLTljZGEtMjdlMjk1NmVmMWVjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjExVDE3NTkxNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTRhOTI1ZjMxZDMwNjNiOGU4YzA5ZTJmNGMwNjg3Mzc1ZGViNjJkY2Y1MzQ2ODQ5OTAyMzc1ODcxODEyZGNjZWYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.1NjeNfQugMAzwGJCqRVehd6J-J77cowE2AYTOCJolmI)
-
Denoising Process (what should the NN think?)
- If its' Bob the sprite, keep it as it is
- If its likely to be Bob, suggest more details to be filled
- If its just an outline of a sprite, suggest general details for likely sprite(bob/fred/...)
- If its nothing, suggest outline of a sprite
-
Give the NN input noise, whose pixels are obtained from Normal distribution, and get a completely new sprite !
- Assume you have a trained NN
- At each denoising step, it predicts noise, and subtracts it to get a better image
- NOTE: At each denoising step, some random noise is added again to prevent "mode collapse"
- UNet Architecture
- Input and output of same size
- First used for image segmentation
-
Takes a noisy image, embeds into small space by downsampling, and upsamples to predict noise
-
Can take more info. in form of embeddings
- Time: related to timestep, and noise level added
- Context: guides generation process
-
Checkout
forward()
in sampling notebook
Learns the distribution of what is "not noise"
- Sample training image, timestep
t
, and noise, randomly- Timestep helps control level of noise
- randomisation ensures a stable model
- Add noise to image
- Input this into NN, which predicts the noise
- Compute loss between actual and predicted noise
- Backprop and learn
- Embeddings are vectors , for instance, strings represented as number vectors
- Given as input to NN along with training image
- Get associated with a training example, and its properties
- Uses: Generate funky mixtures by combining embeddings
- Context formats
- Text
- Categories, one hot encoded (Eg. hero, non-hero, spells ...)
- DDPM is slow!
- Multiple timesteps, and markovian nature
- Skips steps, making the process deterministic
- Lower quality than DDPM
Other applications : Music, Inpainting, Textual Inversion