Deep Learning Tips & Tricks

Architecture Design and Basics:

Choose an appropriate architecture based on the problem: CNNs for image tasks, RNNs/LSTMs/GRUs for sequential data, Transformers for sequences and attention-based tasks, etc.
Start with simpler architectures and gradually increase complexity if needed.
Model selection should be driven by the size of your dataset, available resources, and problem domain.

Convolutional Neural Networks (CNNs):

Larger filter sizes (7x7, 5x5) capture more complex features but are computationally expensive.
Smaller filter sizes (3x3) with multiple layers can achieve similar receptive fields with fewer parameters.
Padding helps retain spatial dimensions and is essential for preserving information at the edges.

Smaller or Larger Kernel Size in CNNs:

Convolution layers with 5x5 or 7x7 kernels tend to increase receptive field faster than their 3x3 versions. However, many state of the art architectures prefer to use 3x3 convolution layers. To reach enough receptive field compared to larger kernels, they generally deploy multiple layers. In the below, you can see the receptive field covered by convolution layers with different kernel sizes
Receptive Field Comparison:
- 2 consecutive 3x3 conv layers = 1 5x5 conv layer
- 3 consecutive 3x3 conv layers = 1 7x7 Conv Layer
In fact, using multiple conv layers with small kernel instead of single layer with larger kernel introduces some advantages:
- Going through multiple non-linear activations instead of single one, this makes decision function in the model more discriminative
- Number of parameters and computations would be decreased in small kernels. Working with larger kernels especially in 3D domain like medical imaging is a bad decision. Small kernels are highly recommended at this point.

Receptive Field and Layer Depth:

Decide the number of layers based on the problem's complexity and desired receptive field.
Receptive field ratio guides the depth of the network, balancing between local and global features.

Striding and Pooling:

Strided convolution and pooling operations help to increase receptive field, thereby learning long-range dependencies.
While doing these, they also reduce spatial dimension of input. In that way, computational cost would be decreased.
However, reduced feature maps start to loose fine-grained information (low-level details) like boundaries and edges of organs. This decreases the quality of segmentation; hence, using skip connection can be good alternative to handle this problem.

Multi-Scale Input Processing:

Multi-scale input processing provides a way to look at and investigate the input image from two different perspectives.
In this scheme, input image is convolved with two parallel convolution pathways just like in siamese architectures.
While the first pathway operates on input image in normal resolution, which extracts detailed local appearance based features, the second branch processes same image in smaller size to come up with high level, more generalized features.
Reference Paper: "Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation"

Atrous (Dilated) Convolutions:

Atrous convolution is capable of increasing receptive field of feature maps without reducing resolution, so it is suitable for retaining fine-grained details in segmentation tasks.

Pooling and Segmentation:

Pooling extracts geometrically invariant features but might affect segmentation tasks negatively.
Skip connections help preserve spatial details while still leveraging hierarchical features.

Batch Normalization (Batch-Norm):

Batch normalization stabilizes training by normalizing inputs to each layer, improving convergence and generalization.
Usually placed before activation functions (convolution, ReLU) for better performance.

Depth-wise Separable Convolutions:

Depth-wise separable convolutions reduce computational complexity by splitting spatial and channel convolutions.
Suitable for scenarios with limited computational resources.

Attention Mechanisms:

Attention mechanisms enhance model performance by focusing on relevant parts of the input sequence or image.
Self-attention mechanisms are particularly useful for capturing long-range dependencies.

Gradient Clipping:

Gradient clipping prevents exploding gradients during training by capping gradient values.

Learning Rate Annealing:

Gradually reduce the learning rate during training to fine-tune model convergence.
Techniques like step decay, cosine annealing, and one-cycle learning rates can be beneficial.

Normalization Techniques:

Apart from batch normalization, explore layer normalization and instance normalization for different scenarios.
Normalization techniques help stabilize training and improve generalization.

Label Smoothing:

Introduce a small amount of noise to ground-truth labels to prevent the model from becoming overconfident in predictions.
Label smoothing can lead to better generalization and calibration of model probabilities.

One-Shot Learning:

Develop models that can learn from just a few examples, which is particularly useful for tasks with limited training data.
Siamese networks and metric learning techniques can be employed for one-shot learning.

Zero-Shot Learning:

Zero-shot learning involves training a model to recognize classes it has never seen during training.
This is achieved by leveraging auxiliary information or semantic embeddings.

SPP Layer (Spatial Pyramid Pooling):

SPP layers capture information at multiple scales, improving object detection by handling various object sizes.

Residual Networks (ResNets) and Skip Connections:

Residual connections facilitate gradient flow and alleviate vanishing gradient issues.
Skip connections help build deeper networks without suffering from degradation problems.

Topological and Geometric Insights:

Understand the topology of the data to guide architecture design.
Geometric invariance might not always be desirable, especially in tasks like segmentation.

Backpropagation and Optimization:

Use activation functions (ReLU, Leaky ReLU) to avoid vanishing gradients and improve convergence.
Employ optimization techniques like Adam, RMSProp, or SGD with momentum for faster convergence.
Learning rate scheduling can help in balancing exploration and exploitation during training.

Regularization and Dropout:

Regularization techniques like L2 regularization, dropout, and batch normalization help prevent overfitting.
Apply dropout in moderation to prevent underfitting.

Hyperparameter Tuning:

Experiment with learning rates, batch sizes, and network architectures.
Utilize techniques like random search or Bayesian optimization for efficient hyperparameter tuning.

Neural Architecture Search (NAS):

NAS automates the process of finding optimal neural network architectures.
Techniques like reinforcement learning and evolutionary algorithms are used for NAS.

Quantization and Model Compression:

Reduce model size and inference latency through techniques like quantization and pruning.
Quantization involves representing weights with fewer bits, while pruning removes less important connections.

Domain Adaptation:

Adapt a model trained on one domain to perform well on a different but related domain.
Techniques like domain adversarial training and self-training can be used for domain adaptation.

Semi-Supervised Learning:

Combine labeled and unlabeled data to improve model performance, especially when labeled data is scarce.
Techniques like pseudo-labeling and consistency regularization are common in semi-supervised learning.

Gaussian Processes in Bayesian Deep Learning:

Utilize Gaussian processes for uncertainty estimation and probabilistic modeling in deep learning.
Bayesian neural networks and variational inference techniques also contribute to uncertainty quantification.

Data Augmentation:

Apply data augmentation techniques (rotation, cropping, flipping) to increase the diversity of the training data.
Augmentation helps improve model generalization.

Transfer Learning and Pretrained Models:

Utilize pretrained models and fine-tuning to leverage features learned on large datasets.
Adapt pretrained models to your specific task to improve convergence and performance.

Meta-Learning:

Meta-learning involves training models to learn how to learn new tasks more efficiently.
Few-shot learning and model-agnostic meta-learning (MAML) are common meta-learning approaches.
Experiment with different meta-learning algorithms and adaptation strategies.

Knowledge Distillation:

Train compact "student" models to mimic the behavior of larger "teacher" models.
Knowledge distillation helps transfer knowledge from complex models to smaller ones.
Experiment with different temperature settings and loss functions for effective distillation.

Transformers:

Transformers are attention-based architectures that excel at modeling long-range dependencies in sequential and structured data (e.g., text, images, videos). They replace recurrence and convolution with self-attention mechanisms, enabling parallel processing and scalability.
Self-Attention Mechanism:
- Computes weighted interactions between all input tokens, dynamically focusing on relevant context. Multi-head attention extends this by capturing diverse relationships in parallel.
Positional Encoding:
- Injects spatial/temporal order into input embeddings (e.g., sine/cosine functions or learned embeddings) since Transformers lack inherent sequential bias.
Scalability:
- Pretrained on massive datasets (e.g., BERT for NLP, ViT for vision), Transformers transfer well to downstream tasks via fine-tuning. Larger models (e.g., GPT-4) achieve state-of-the-art results but require significant computational resources.
Efficiency Innovations:
- Techniques like sparse attention, axial attention, or memory-efficient variants (e.g., Linformer) reduce the quadratic complexity of self-attention for long sequences.
Tips and Tricks:
- Pretraining and Transfer Learning:
  - Start with pretrained models (e.g., Hugging Face’s Transformers library) and fine-tune on domain-specific data. Use task-specific adapters to avoid full retraining.
- Manage Sequence Length:
  - For long inputs, truncate, chunk, or use hierarchical attention. FlashAttention or mixed-precision training can optimize GPU memory usage.
- Hybrid Architectures:
  - Combine CNNs (local features) with Transformers (global context) in vision tasks (e.g., Swin Transformer, ConvNeXt).
- Warmup and Decay:
  - Apply learning rate warmup (gradually increasing LR) to stabilize early training. Follow with cosine decay for convergence.
- Regularization:
  - Use dropout in attention layers and feed-forward networks. Layer normalization before (not after) residual connections often improves stability.
- Hardware Optimization:
  - Leverage tensor cores (FP16/AMP) and model parallelism (e.g., pipeline or tensor sharding) for large models.

Mamba

The Mamba Model Architecture is designed to optimize performance through efficient resource usage and robust feature extraction. It integrates dynamic scaling and attention mechanisms to balance speed and accuracy.
Dynamic scaling:
- Adjusts model depth and width based on the complexity of the input data.
Integrated attention modules:
- Focuses on key features during inference for improved decision-making.
Memory efficiency:
- Optimizes computational resources, making it suitable for deployment in resource-constrained environments.
Tips and Tricks:
- Leverage pretraining:
  - Pretrain the Mamba model on similar tasks to capture domain-specific features, then fine-tune for your application.
- Use mixed precision training:
  - Reduce computational load and speed up training while maintaining accuracy by using mixed precision techniques.
- Monitor scaling dynamics:
  - Regularly analyze how dynamic scaling adjusts the model’s architecture during training to ensure it aligns with task complexity.
- Integrate custom attention:
  - Experiment with different attention mechanisms within the architecture to enhance feature focus and performance.

Mamba vs. Transformers

1. Core Architecture

Transformers:
- Built on self-attention mechanisms to model relationships between all input tokens (e.g., words, image patches).
- Fixed architecture (predefined layers, heads, dimensions) with positional encoding for sequence awareness.
- Dominates tasks requiring global context (e.g., machine translation, image classification with ViT).
Mamba:
- Emphasizes dynamic scaling to adapt model depth/width based on input complexity.
- Integrates hybrid attention-convolution modules for local-global feature balance.
- Designed for resource efficiency (e.g., edge devices, low-memory settings).

2. Attention and Context Handling

Transformers:
- Global self-attention: Captures interactions between all tokens, enabling long-range dependencies.
- Quadratic complexity (O(n^2)) limits scalability for long sequences (e.g., high-resolution images, genomics).
- Mitigations: Sparse attention (e.g., Longformer), chunking, or memory-efficient variants (FlashAttention).
Mamba:
- Uses adaptive attention focused on critical regions (reduces redundant computation).
- Hierarchical processing combines local convolutions with sparse attention for efficiency.
- Better suited for long sequences with linear or sub-quadratic complexity.

3. Computational Efficiency

Transformers:
- High memory/FLOPs cost due to dense attention and large parameter counts.
- Requires heavy optimization (mixed precision, model parallelism) for training/inference.
- Ideal for GPU/TPU clusters but struggles on edge devices.
Mamba:
- Dynamic scaling reduces redundant computations (e.g., skips layers for simpler inputs).
- Optimized for on-device deployment via parameter pruning and mixed-precision support.
- Lower latency in resource-constrained environments (e.g., real-time video processing).

4. Scalability and Training

Transformers:
- Scale exceptionally with data and parameters (e.g., GPT-4, PaLM).
- Pretraining on massive datasets is critical for downstream performance.
- Stable training with standardized techniques (warmup, layer normalization).
Mamba:
- Scales adaptively, avoiding over-parameterization for simpler tasks.
- Pretraining benefits exist but less dependent on extreme dataset sizes.
- Training dynamics require careful monitoring of scaling behavior.

5. Use Case Suitability

Transformers Excel At:
- Global context tasks: Language modeling, cross-modal retrieval (text-to-image).
- Large-scale pretraining: Transfer learning to diverse downstream tasks.
- High-resource environments: Cloud/TPU-based inference.
Mamba Excels At:
- Resource-limited applications: Edge devices, mobile/embedded systems.
- Dynamic input complexity: Tasks where input varies in difficulty (e.g., medical imaging with variable lesion sizes).
- Real-time processing: Autonomous systems, low-latency video analysis.

6. Strengths and Weaknesses

Aspect	Transformers	Mamba
Strengths	- State-of-the-art accuracy - Global context modeling - Massive scalability	- Computational efficiency - Dynamic adaptation - Edge compatibility
Weaknesses	- Quadratic complexity - High memory use - Overkill for simple tasks	- Less proven at extreme scales - Niche adoption - Complex dynamic tuning

7. Practical Tips

When to Choose Transformers:
- Your task requires global context (e.g., document summarization).
- You have abundant computational resources and pretraining data.
- You need a well-supported architecture (e.g., Hugging Face ecosystem).
When to Choose Mamba:
- Latency/memory constraints are critical (e.g., IoT devices).
- Input complexity varies widely (e.g., multi-scale segmentation).
- You want to avoid over-parameterization for smaller datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Dilated Convolution		Dilated Convolution
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning Tips & Tricks

About

Releases

Packages

Contributors 2

License

ayyucedemirbas/Deep-Learning-Tips-and-Tricks

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Tips & Tricks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages