Consensus‑of‑CLIPs‑IT - Agreement‑Driven Selection of Multimodal Instruction Data

Abstract

Instruction‑tuned vision–language models (VLMs) inherit their capabilities—and their failure modes—from the quality of the instruction data used for alignment. Existing curation pipelines typically score image–text alignment with a single encoder or rely on heuristic filters, which miss multimodal groundedness (does the answer rely on the image?), answerability, and diversity across skills (OCR, charts, diagrams, multilingual). We propose Consensus‑of‑CLIPs‑IT, a selector that aggregates heterogeneous CLIP‑family encoders to score instruction triples \((\text{Image}, \text{Prompt}, \text{Answer})\). Our method fuses cross‑model Agreement, robust Disagreement, probabilistic Confidence, and an image‑conditioned Groundedness term that rewards answers that depend on the image beyond either the prompt or the answer alone. Coupled with diversity‑aware subset selection, the curated instruction sets consistently improve downstream VLMs (e.g., LLaVA, Qwen‑VL, IDEFICS) on MMMU, MMBench, MM‑Vet, TextVQA, ChartQA, DocVQA, and hallucination stress tests (POPE). Ablations show encoder diversity, groundedness, and robust fusion are key drivers. We release a reproducible toolkit for instruction‑data scoring and selection.

Keywords: instruction tuning, data selection, CLIP ensemble, groundedness, hallucination, OCR, multilingual

1 Introduction

Instruction tuning aligns VLMs to follow multimodal instructions, but naïvely mixing public datasets and synthetic dialogs produces brittle models: they hallucinate visual facts, overfit to easy caption‑like prompts, and under‑represent long‑tail skills. While single‑encoder filters suppress obvious noise, they inherit one model’s biases and fail to measure whether the answer actually uses the image.

Hypothesis. Heterogeneous CLIP‑family encoders encode complementary inductive biases (hierarchy, geometry, probabilistic uncertainty, multilingual calibration). Aggregating them with a grounded, instruction‑aware objective will better identify useful instruction triples, yielding VLMs that follow instructions more accurately and hallucinate less.

Contributions.

Consensus‑of‑CLIPs‑IT. A robust, instruction‑aware scoring function over triples \((I, P, R)\) that combines Agreement, Disagreement, Confidence, and Groundedness.
Diversity‑aware selection. Greedy \(k\)-center on concatenated multi‑encoder embeddings with skill‑aware stratification (OCR, charts, diagrams, multilingual).
Comprehensive evaluation. Consistent gains on reasoning and text‑rich benchmarks and reduced hallucination rates under matched training budgets.
Ablations. Encoder‑panel diversity, groundedness efficacy, fusion rules, data scale, and domain shift.
Toolkit. Open scorer + selection code with manifests for reproducibility.

3 Method: Instruction‑Aware Consensus Scoring

3.1 Setup

Each candidate is a triple \((I, P, R)\): image \(I\), instruction/prompt \(P\), and response/answer \(R\). A panel of \(M\) CLIP‑family encoders \(\{f_m\}\) provides similarities \(s_m(I, X)\) for text \(X \in \{P, R, P{+}R\}\) and (for probabilistic encoders) uncertainty proxies \(u_m(I, X)\).

We standardize each encoder’s similarities online:

\[ \tilde{s}_m(I, X) = \frac{s_m(I, X) - \mu_m}{\sigma_m}. \]

3.2 Human‑Readable Scoring Terms

Agreement (cross‑model alignment):

\[ \textbf{Agreement}(I,P,R) = \mathrm{median}_m\Big(\tilde{s}_m(I, P{+}R)\Big). \]

Disagreement (robust dispersion):

\[ \textbf{Disagreement}(I,P,R) = \mathrm{MAD}_m\Big(\tilde{s}_m(I, P{+}R)\Big). \]

Confidence (uncertainty bonus; lower variance is better):

\[ \textbf{Confidence}(I,P,R) = -\,\mathrm{mean}_m\Big(u_m(I, P{+}R)\Big). \]

Groundedness (image‑conditioned synergy beyond either text alone):

\[ \textbf{Groundedness}(I,P,R) = \mathrm{median}_m\Big(\tilde{s}_m(I, P{+}R)\Big) - \max\!\Big\{ \mathrm{median}_m(\tilde{s}_m(I,P)),\; \mathrm{median}_m(\tilde{s}_m(I,R)) \Big\}. \]

This rewards triples where image + (prompt+answer) align more strongly than either image+prompt or image+answer in isolation—i.e., cases where the answer truly depends on the image.

Final instruction score:

\[ \boxed{ \textbf{ConsensusScore}_{\text{IT}} = \text{Agreement} -\lambda\cdot\text{Disagreement} +\alpha\cdot\text{Confidence} +\gamma\cdot\text{Groundedness} } \]

3.3 Non‑Triviality & Safety Filters

Non‑captionness: down‑weight if \(\tilde{s}_m(I,P)\) and \(\tilde{s}_m(I,R)\) are both extremely high yet \(\text{Groundedness}\approx 0\) (caption‑like, low instructional value).
Harmful content: keep standard safety flags (NSFW, hate, personally identifiable info) and track them in reports; apply quotas for balanced language and domain coverage.

3.4 Diversity‑Aware Selection

Compute image and text embeddings per encoder; whiten (PCA), concatenate, and select with greedy \(k\)-center (ANN‑accelerated). We stratify by skill tags (OCR density, charts/plots, diagrams, documents, natural scenes, multilingual) so tails are preserved via per‑bucket percentiles.

3.5 Algorithm

Algorithm 1: Consensus-of-CLIPs-IT Selection
Input: triples {(I,P,R)}, encoders {f_m}, λ, α, γ, budget k, policy Π
for (I,P,R):
    compute z-scored similarities for (I,P), (I,R), (I,P+R)
    Agreement      ← median_m(z(I, P+R))
    Disagreement   ← MAD_m(z(I, P+R))
    Confidence     ← − mean_m(u(I, P+R))         # 0 if unavailable
    Groundedness   ← median_m(z(I, P+R)) − max{median_m(z(I,P)), median_m(z(I,R))}
    ConsensusScore_IT ← Agreement − λ·Disagreement + α·Confidence + γ·Groundedness
apply policy Π (global/stratified) to pick a provisional top-K by ConsensusScore_IT
run greedy k-center on concatenated embeddings of the top-K to get final k
Output: curated instruction set S_IT

4 Experimental Setup

4.1 Candidate Pools & Hygiene

Sources. Public instruction sets (e.g., VQA‑style, OCR‑rich, chart/diagram data, multilingual dialogs) and synthetic multimodal QA/instruction corpora.
Deduplication. Perceptual image hashing + multi‑encoder embedding dedupe across all pools; drop near‑duplicates with similar prompts and answers.
Tagging. Language ID, OCR density, layout (document/chart/diagram), and safety flags used for stratified selection and reporting.

4.2 Encoder Panel

OpenCLIP (ViT‑B/16, ViT‑L/14), MERU‑style hyperbolic, CyCLIP‑style geometry‑regularized, ProLIP‑style probabilistic, and SigLIP/SigLIP‑2 (multilingual). We ablate panel composition (§6).

4.3 Training Protocol (Instruction Tuning)

We instruction‑tune a fixed base VLM (e.g., LLaVA‑style or Qwen‑VL‑style) on curated data of sizes \(\{2\text{M}, 5\text{M}, 10\text{M}\}\) examples with matched budgets (same schedule, optimizer, batch, and steps). Seeds \(\{0,1,2\}\). No extra pretraining—only instruction‑tuning differs by data.

4.4 Evaluation

Reasoning & knowledge: MMMU, MM‑Bench (dev/test), MM‑Vet.
Text‑rich: TextVQA, DocVQA, ST‑VQA.
Charts & diagrams: ChartQA, Chart‑QA‑H, AI2D.
General VQA: VQAv2, OK‑VQA/A‑OKVQA.
Hallucination: POPE (object presence), CIDEr‑based faithfulness on caption‑like prompts. Report macro‑averages and 95% CIs (bootstrap); paired tests against baselines.

5 Results

5.1 Main Results (Instruction Tuning Only)

Table 1: Macro‑avg across MMMU / MM‑Bench / MM‑Vet.

Selector	Base	Score
Random mix	ViT‑L	XX.X
Single‑encoder CLIP‑score	ViT‑L	XX.X
SigLIP‑only filter	ViT‑L	XX.X
Consensus‑of‑CLIPs‑IT (ours)	ViT‑L	XX.X (+Δ)

Table 2: Text‑Rich & Charts.

Selector	TextVQA	DocVQA	ChartQA
Single‑encoder CLIP‑score	XX.X	XX.X	XX.X
Consensus‑of‑CLIPs‑IT	XX.X	XX.X	XX.X

Table 3: Hallucination (POPE ↓).

Selector	Overall	Common	Adversarial
Single‑encoder CLIP‑score	XX.X	XX.X	XX.X
Consensus‑of‑CLIPs‑IT	XX.X	XX.X	XX.X

Observations. Gains are largest on TextVQA/DocVQA and ChartQA (image‑dependent text), and on POPE (reduced hallucination), consistent with the role of Groundedness and Confidence.

6 Ablations & Analysis

Encoder diversity. Removing ProLIP variants increases hallucination; removing SigLIP variants hurts multilingual; removing MERU‑style harms hierarchical reasoning (MM‑Vet).
Groundedness term. Zeroing \(\gamma\) reduces TextVQA/ChartQA by ~Δ and increases POPE by ~Δ, confirming the synergy criterion matters.
Fusion rules. Median+MAD > mean; light confidence‑weighting helps when uncertainty is available.
Data scale. Benefits compound from 2M→10M examples, with diminishing returns; diversity control prevents oversampling easy captions.
Stratification. Per‑bucket percentiles preserve long‑tail skills and sustain gains at larger budgets.
Online vs. offline selection. Online filtering recovers most offline gains while reducing IO; helpful for rapid iteration.

Error modes. (i) Creative prompts answered correctly but weakly image‑dependent can be down‑weighted; (ii) domain jargon unseen by the panel can be under‑selected. Language/domain quotas mitigate (ii).

7 Limitations & Societal Impact

Limitations. Multi‑encoder scoring increases selection cost; rare domains absent from all encoders remain under‑curated; groundedness uses a contrastive proxy (no explicit detector), which, while robust, may miss fine‑grained spatial grounding. Societal impact. Better instruction curation reduces toxic content and hallucinations but encodes value choices; we publish language/domain mix, safety stats, and manifests and support takedown on request.

8 Conclusion

Consensus‑of‑CLIPs‑IT selects instruction‑tuning data—not pretraining pairs—by fusing cross‑model agreement, robust dispersion, uncertainty, and an explicit groundedness signal, plus diversity control. The curated triples improve multimodal following, text‑rich reasoning, and reduce hallucinations, under matched compute. We provide an open toolkit for reproducible instruction‑data selection.

Appendix A Practical Notes

Skill tagging. OCR density via lightweight text‑detector; chart/diagram via simple image classifiers; language via fast LID. Two‑stage scoring. Screen all candidates with small‑backbone panel; re‑rank top‑2k with large backbones + uncertainty. Hyperparameters. \(\lambda\in\{0.25,0.5,1.0\}\), \(\alpha\in\{0,0.25,0.5\}\), \(\gamma\in\{0.5,1.0,1.5\}\). Release artifacts. JSONL manifests with per‑triple scores (Agreement, Disagreement, Confidence, Groundedness), final ConsensusScore\(_\text{IT}\), and skill tags.

Drop‑in (LaTeX) for the Instruction‑Aware Equations

\newcommand{\Agreement}{\text{Agreement}}
\newcommand{\Disagreement}{\text{Disagreement}}
\newcommand{\Confidence}{\text{Confidence}}
\newcommand{\Groundedness}{\text{Groundedness}}

\[
\Agreement = \mathrm{median}_m\!\left(\tilde{s}_m(I, P{+}R)\right),\quad
\Disagreement = \mathrm{MAD}_m\!\left(\tilde{s}_m(I, P{+}R)\right),
\]
\[
\Confidence = -\,\mathrm{mean}_m\!\left(u_m(I, P{+}R)\right),
\]
\[
\Groundedness = \mathrm{median}_m\!\left(\tilde{s}_m(I, P{+}R)\right)
 - \max\!\left\{\mathrm{median}_m(\tilde{s}_m(I,P)),\,\mathrm{median}_m(\tilde{s}_m(I,R))\right\},
\]
\[
\textbf{ConsensusScore}_{\text{IT}} =
\Agreement - \lambda\,\Disagreement + \alpha\,\Confidence + \gamma\,\Groundedness.
\]