F3OCUS: Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics

1Department of Engineering Science, University of Oxford
2Oxford-Suzhou Centre for Advanced Research
3Imperial College London
4University of Birmingham

*Corresponding Author

CVPR 2025

Overview of F3OCUS

F3OCUS overview

Highlights

🌟 Parameter-Efficient Finetuning (PEFT) of Vision-Language Models (VLMs) for resource-constrained Federated Learning.
🌟 Uses principal eigenvalue of layerwise Neural Tangent Kernels (NTKs) as client-specific layer importance score.
🌟 Distributed layer participation: Encourages diverse layer selection across clients for optimal performance.
🌟 Joint optimization: Uses multi-objective meta-heuristics on the server to optimize both importance and diversity.
🌟 MedVQA-FL Dataset: 707,962 VQA triplets, 9 modality-specific clients, 12 anatomical categories.
🌟 Extensive Experiments: 10,000+ client-level runs, 6 FL settings, 58 datasets, 4 VLMs.

Motivation

How can we efficiently and fairly fine-tune massive foundation models for real-world FL in domains with strict privacy and limited resources?

🔒 Challenges

(1) Resource Constraints: Fine-tuning foundation models exceeds client resources.
(2) Layer Selection: Naive selection overfits and ignores client-specific needs.
(3) Competing Goals: Balancing client adaptation and collaborative layer usage is hard.

🔑 Solutions

Client Layer Importance: NTK spectral analysis scores each layer per client.
Server-Side Layer Balancing: Meta-heuristics maximize adaptation and distribute layer updates.
Privacy-Preserving: Optimization happens without sharing raw data.

Distinction from prior works

Approach comparison

(a) Vanilla: selects layers only on local client data, neglecting collaboration.
(b) F3OCUS: jointly maximizes client-specific layer importance and distributes layer adaptation across clients.

Client-level Layer Importance

Layerwise Neural Tangent Kernel (LNTK)

We use the Layerwise Neural Tangent Kernel (LNTK) to score which layers should be adapted for each client. The LNTK captures how sensitive the model’s output is to each layer's parameters on a client’s data.

For neural network function \(f\) trained on client data \(\mathcal{X}\):

$$ \dot{f} = -\eta \Theta(\mathcal{X}, \mathcal{X}) \nabla_{f} F (\mathcal{X}; \theta_t) $$

The NTK matrix:

$$ \Theta(\mathcal{X}, \mathcal{X}) = \nabla_{\theta} f(\mathcal{X}; \theta_t) \nabla_{\theta} f(\mathcal{X}; \theta_t)^T $$

...can be split into per-layer (LNTK) components:

$$ \Theta(\mathcal{X}, \mathcal{X}) = \sum_{l=1}^L \Theta^l(\mathcal{X}, \mathcal{X}) $$ $$ \Theta^l(\mathcal{X}, \mathcal{X}) = \nabla_{\theta^l} f (\mathcal{X}; \theta^l) \nabla_{\theta^l} f(\mathcal{X}; \theta^l)^T $$

The principal eigenvalue \(\lambda^l_1\) of each layer’s LNTK determines how quickly that layer learns and aligns with the client’s data.

Layerwise LNTK ranks illustration
LNTK-based layer ranks across clients/rounds. Darker = higher importance.

We define the importance score for layer \(l\) on client \(i\) as:

$$ S^l_i = \frac{\lambda^l_{i, 1}}{\sum_{k=1}^L \lambda^k_{i, 1}} $$

This enables selective, efficient adaptation of the most relevant layers for each client.

Server-side Multi-objective Optimization

Balancing Adaptation and Layer Diversity

F<sup>3</sup>OCUS ranks illustration
Multi-objective Optimization-based layer ranks across clients/rounds. Darker = higher importance.

On the server, we refine layer selection to jointly maximize adaptation (via client-specific importance) and balance layer usage across clients. We formalize this as a multi-objective problem:

Objectives:

  • Maximize total client importance: \( \sum_{i=1}^N \sum_{l=1}^L S_i^l \)
  • Minimize variance in layer usage: \( \frac{1}{L} \sum_{l=1}^L (n_l - \bar{n})^2 \), where \( n_l \) is the number of clients choosing layer \( l \) and \( \bar{n} \) is the average.

Constraint: Each client can only select up to their own budget of layers.

$$ \begin{cases} \mathbf{max} & \sum\limits_{i=1}^{N} \sum\limits_{l=1}^L S_{i}^{l} \\ \mathbf{min} & \frac{1}{L} \sum\limits_{l=1}^{L} ( n_l - \bar{n} )^2 \\ \text{s.t.} & \sum_{l \in \mathcal{L}_i} m_{i}^{l} \leq L_{\text{i,max}}, \ \forall i \end{cases} $$

Layer selection histogram
Server-level optimization promotes diverse layer participation and avoids overfitting to a few layers.

Why Meta-heuristics? Traditional methods struggle as the search space is huge and no client data is available on the server. We use five representative meta-heuristic algorithms to find a set of optimal trade-offs:

  • NSGA-II: Evolves layer selections via crossover/mutation; balances exploration and exploitation using non-dominated sorting and crowding distance.
  • Artificial Bee Colony (ABC): Bees represent candidate selections, with phases for exploring, exploiting, and diversifying solutions based on importance and diversity.
  • Ant Colony Optimization (ACO): Ants build layer assignments using pheromone trails and importance scores, encouraging exploration and reinforcing successful patterns.
  • Simulated Annealing (SA): Explores layer assignments via probabilistic acceptance, gradually refining towards balanced, high-importance solutions as the "temperature" cools.
  • Multi-Objective Particle Swarm Optimization (MOPSO): Particles update assignments by combining their own best with globally best solutions, balancing adaptation and diversity.

Ultra-MedVQA Dataset

Ultra-MedVQA overview

Ultra-MedVQA is a large-scale, federated medical VQA benchmark featuring 707,962 image–QA triplets across 9 imaging modalities and 12 anatomical categories, distributed as 9 simulated clients. It maximizes real medical image utility and supports robust federated VQA evaluation.

  1. Preparation of Original Dataset:
    Compiled from 10 diverse medical datasets representing Chest X-ray (117,976), Optical Coherence Tomography (109,309), Colon Pathology (107,180), Dermatoscope (10,015), Fundus (1,600), Ultrasound (780), Blood Microscope (17,092), Kidney Cortex Microscope (236,386), Abdominal CT (107,624). Each dataset acts as a modality-specific client, split 80% train / 20% test.
  2. Question-Answer Template Design:
    Category and attribute labels are programmatically converted to QA pairs using structured templates. Example: “Q: What is the specific diagnosis for the lung in this image? A: Pneumothorax.” Questions also cover modality, anatomy, and tissue localization.
    • Question types: Modality Recognition (~10%), Anatomy Identification (~20%), Disease Diagnosis (~39%), Disease Grading (~1%), Tissue ID (~20%), Other Biological Attributes (~10%).
  3. Question-Answer Refinement:
    To maximize natural diversity, each question is paraphrased using ChatGPT-4o for varied style and expression, while preserving meaning.
  4. Manual Double Checking:
    All samples were further reviewed to ensure QA pair correctness and overall dataset quality.

Ultra-MedVQA provides a challenging, multi-modal federated VQA setting with fine-grained evaluation across anatomical regions (Colon, Lung, Skin, Eye, Breast, Kidney, Blood, Femur, Heart, Liver, Pancreas, Spleen) and modalities.

FL Settings, Datasets, and Tasks

We evaluate F3OCUS for selective fine-tuning of layers and adapters across four vision-language models (ViLT, ALBEF, Llava-1.5, BLIP-2) and multiple federated learning settings:

  • Visual Question Answering:
    • (i) Task 1: 5 clients, domain gap (SLAKE, VQA-RAD, VQA-Med 2019/2020/2021)
    • (ii) Task 2: 8 modality clients (CT, Ultrasound, etc.)
    • (iii) Task 3: 9 clients on Ultra-MedVQA
  • Multi-label Disease Classification:
    • (iv) Task 4: 4 clients (Open-I)
    • (v) Task 5: 10 clients (MIMIC), both with label shift (Dirichlet, 15 classes)
  • Heterogeneous Tasks:
    • (vi) Task 6: Combined VQA (3 clients) and disease-classification (2 clients)

Device Heterogeneity: We vary the number of trainable layers per client/task to simulate diverse resource budgets.

Qualitative Performance

Qualitative Performance
Qualitative Performance
t-SNE Visualization showing separability of the embeddings across different clients.

Explanation: The t-SNE plots above visualize the learned feature spaces for eight different medical imaging modalities. Compared to Federated Dropout and the standard approach of fine-tuning the last K layers, F3OCUS produces more distinct and well-separated feature clusters, demonstrating improved class discriminability and representation quality across clients. This highlights the effectiveness of our selective layer adaptation strategy for heterogeneous federated learning.

Quantitative Performance

Performance Table
Table: Performance comparison on VLM layer selection with heterogeneous resources across clients for six federated tasks and multiple model architectures.

Explanation: The table above summarizes the accuracy (Tasks 1-3) and F1-scores (Tasks 4-6) across different federated learning settings, including domain gap, modality gap, label shift, and task heterogeneity. We benchmark our method, F3OCUS, against a wide range of state-of-the-art baselines including pruning, layer selection, and meta-heuristic approaches across ViLT and BLIP models.

F3OCUS consistently outperforms all competing methods in every task and model setting, demonstrating superior adaptation and collaboration in heterogeneous federated scenarios. Notably, our approach achieves higher scores with fewer trainable parameters, highlighting its efficiency for real-world resource-constrained federated learning. The reduction in computation and communication needs is shown in the Table below for Task 1:

Performance Table
Table: Performance comparison with other PEFT methods on VLM layer selection across clients for Task 1.

Explanation: The table above presents a comprehensive comparison between F3OCUS and 5 leading PEFT baselines: LayerNorm Tuning (LN), LoRA, Bias Tuning, Prompt Tuning, and FedDAT. We evaluate each method in terms of communication overhead, computational requirements, parameter efficiency, and accuracy on all clients.

F3OCUS consistently surpasses all PEFT baselines except adapters and FedDAT. Notably, FedDAT and adapter-based methods fine-tune all adapters in each client, while F3OCUS selectively adapts only 4 adapter layers per client. This leads to a substantial reduction in communication cost (down to 9.7 MBits) and computation (80.6 GFLOPs), while maintaining or exceeding performance across clients.

Conclusion: These results highlight the efficiency and scalability of F3OCUS for real-world federated learning scenarios, where client resource constraints are critical.

BibTeX


@inproceedings{saha2025f3ocus,
  title={F3OCUS: Federated Finetuning of Vision-Language Foundation Models with Optimal Client Layer Updating Strategy via Multi-objective Meta-Heuristics},
  author={Pramit Saha and Felix Wagner and Divyanshu Mishra and Can Peng and Anshul Thakur and David A. Clifton and Konstantinos Kamnitsas and J. Alison Noble},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}