VAE
A tiny VAE model to encode images into latent space for LDM A tiny variational autoencoder (VAE) model to encode images into latent space for latent diffusion models (LDM) in crunch computing power. In this implementation, I designed the vae using three residual blocks for both encoder and decoder. In the encoder, I also added the squeeze-and-excitation mechanism in the trunk of the residual blocks to enhance the ability of feature extraction without dramatically increasing computational resources like attention does. The pre-trained model on FFHQ 256x256 dataset can be found here. I use NVIDIA TITAN Xp with 12GB VRAM to train the model. And use the mixed precision techniques with batch size The VAE is training on FFHQ 256x256 dataset To enable inpainting with LDM, the encoder and decoder in this VAE are carefully designed so that a mask in pixel space can be easily mapped into latent space using simple steps like interpolation. The decoder is able to understand masked images in latent space and can reconstruct the image correctly for both masked and unmasked areas in pixel space. Repository (private)
Implementation
graph LR
Images[Images]--> E0(Lift)
E0(Lift) --> E1(Res0) --> E2(Res1) --> E3(Res2)
E3(Res2) --> E4Mean(Linear) --> Mean[Mean] --> Sample(Sample)
E3(Res2) --> E4Var(Linear) --> Var[Var] --> Sample(Sample)
Sample(Sample) --> Embedding[Embedding]
graph LR
Embedding[Embedding] --> D4(Lift)
D4(Lift) --> D3(Res2) --> D2(Res1) --> D1(Res0)
D1(Res0) --> D0(Project) --> Reconstructed
Training
28 to maximize the usage of VRAM. The learning rate is scheduled from 1e-4 to 0 using a cosine annealing schedule. Result
Original Reconstructed




Mask
Original Target Masked In Latent Space





