
IF-GEO: How Conflict-Aware Instruction Fusion Solves Multi-Query GEO Optimization
Deep dive into IF-GEO (arXiv:2601.13938), a breakthrough framework from USTC that addresses the critical challenge of optimizing content for multiple conflicting queries simultaneously. Introduces risk-aware stability metrics and a diverge-then-converge approach to GEO.
IF-GEO: How Conflict-Aware Instruction Fusion Solves Multi-Query GEO Optimization
Last Updated
February 20, 2025 • 12 min read
Key Takeaways
TL;DR: IF-GEO (Zhou et al., 2025) from the University of Science and Technology of China introduces a "diverge-then-converge" framework that solves a fundamental problem in GEO: when you optimize content for one query, it often hurts visibility for other queries. Key results:
- +14.17% mean visibility improvement on the primary objective metric — outperforming all baselines including Auto-GEO (+12.99%)
- 84.07% Win-Tie Rate — content improved or held steady for 84% of queries tested
- Lowest Downside Risk (0.0054) — meaning fewer queries suffer negative impact from optimization
- Introduces 3 new risk-aware stability metrics (WCP, WTR, DR) that go beyond simple mean improvement
This paper directly addresses a limitation in current GEO tools — including our own — and points toward the next generation of content optimization.
The Problem: Why Single-Query GEO Optimization Breaks Down
The Hidden Cost of Query-Specific Optimization
Most GEO optimization approaches — including the foundational strategies from Aggarwal et al. (2024) — optimize content for a single target query. You pick a query, apply strategies (add citations, statistics, fluency improvements), and measure whether your content gets cited more for that specific query.
But here's the problem the IF-GEO paper exposes: a single document needs to serve many different queries simultaneously. When you optimize for "What is the best CRM software?", the edits you make might actively hurt your visibility for "How much does CRM cost?" or "CRM implementation timeline."
This is not a theoretical concern. The paper's empirical analysis in Appendix A demonstrates that existing GEO baselines exhibit significant performance variance across query sets — optimizing for one query frequently causes negative gains for others.
Why This Happens: Conflicting Revision Requirements
Consider a product page that needs to rank for multiple queries:
| Query | Optimal Revision | Conflict |
|---|---|---|
| "Best CRM for small business" | Emphasize affordability, simplicity, quick setup | Wants short, simple content |
| "Enterprise CRM comparison" | Emphasize scalability, integrations, security | Wants detailed, technical content |
| "CRM implementation cost" | Emphasize pricing data, ROI statistics, timelines | Wants quantitative, fact-dense content |
Each query has different optimization preferences, but they all target the same document with a limited content budget. You can't make the page simultaneously simple and technically detailed, short and comprehensive.
The paper frames this as a constrained multi-objective optimization problem (Marler and Arora, 2004), where heterogeneous queries impose competing requirements under a fixed content budget.
IF-GEO: The "Diverge-Then-Converge" Framework
Architecture Overview
IF-GEO solves this with a two-phase approach (Zhou et al., 2025):
Phase 1 — Diverge: Mine Distinct Optimization Preferences
- Latent Query Mining: Predict representative queries that the document should serve (not just one target query)
- Edit Request Generation: For each representative query, formulate specific, structured edit requests — what changes would maximize visibility for this particular query
Phase 2 — Converge: Conflict-Aware Instruction Fusion
- Conflict Detection: Identify where edit requests from different queries contradict each other
- Priority Arbitration: Resolve conflicts through a global coordination mechanism
- Blueprint Synthesis: Generate a unified "Global Revision Blueprint" that balances all query needs
- Guided Editing: Apply the blueprint to produce a single, coherent revision
How Instruction Fusion Works
The key innovation is the conflict-aware instruction fusion step. Rather than simply averaging or concatenating edit requests (which would produce incoherent content), IF-GEO explicitly:
- Deduplicates overlapping directives (e.g., multiple queries wanting more statistics)
- Prioritizes edits with the broadest cross-query benefit
- Arbitrates conflicts by finding compromise formulations that serve multiple intents
- Constrains the total edit scope to preserve document coherence
The paper provides a detailed walkthrough in Appendix C, showing how a medical document about "coagulopathy" gets optimized. Different queries require different terminological emphasis, but the fusion step produces a single revision that clarifies key terms while maintaining medical accuracy — serving all query intents without degrading any.
Token Cost Breakdown
One practical question: how expensive is this compared to simpler approaches?
| Stage | Avg. Tokens per Document |
|---|---|
| Query Mining | 1,271 |
| Edit Request Generation | 1,750 |
| Instruction Fusion | 4,488 |
| Blueprint-Guided Revision | 2,820 |
| IF-GEO Total | 10,328 |
Compared to single-pass baselines:
| Single-Pass Method | Avg. Tokens |
|---|---|
| Cite Sources | 2,535 |
| Statistics Addition | 2,802 |
| Authority Expression | 2,484 |
IF-GEO costs roughly 4× more tokens than a single-pass baseline, but the cross-query stability gains are substantial. The instruction fusion stage accounts for 43% of the total cost — this is where the "intelligence" happens.
New Evaluation Metrics: Beyond Mean Improvement
Why Mean Improvement is Misleading
One of the paper's most valuable contributions is introducing risk-aware stability metrics for GEO evaluation. The authors argue — correctly — that mean visibility improvement alone is insufficient because it:
- Masks tail degradation: A method that improves 80% of queries by +5% but degrades 20% by -15% looks good on average but is dangerous in practice
- Conflates upside and downside volatility: High variance could mean "sometimes amazing, sometimes terrible" — which is very different from "consistently good"
Three New Metrics
IF-GEO introduces three risk-aware metrics that should become standard in GEO evaluation:
| Metric | Definition | What It Measures |
|---|---|---|
| WCP (Worst-Case Performance) | Performance of the worst-performing query | Safety floor — "How bad can it get?" |
| WTR (Win-Tie Rate) | % of queries that improved or stayed the same | Reliability — "How often does optimization help or at least not hurt?" |
| DR (Downside Risk) | Expected loss magnitude for degraded queries | Risk — "When it hurts, how much does it hurt?" |
How IF-GEO Performs on These Metrics
Results on the primary objective metric (Obj. Overall):
| Method | Mean ↑ | VAR ↓ | WCP ↑ | WTR ↑ | DR ↓ |
|---|---|---|---|---|---|
| Cite Sources | +2.54 | 0.0246 | -0.1156 | 74.21% | 0.0089 |
| Quotation Addition | +3.10 | 0.0321 | -0.1284 | 72.58% | 0.0099 |
| RAID | +2.80 | 0.0240 | -0.1248 | 71.43% | 0.0099 |
| Auto-GEO | +12.99 | 0.0416 | -0.0578 | 78.91% | 0.0083 |
| IF-GEO | +14.17 | 0.0386 | -0.0435 | 84.07% | 0.0054 |
Key observations:
- IF-GEO achieves the highest mean improvement (+14.17%) while also having the best stability profile
- 84.07% WTR means only ~16% of queries see any degradation — and when they do, the damage is minimal (DR = 0.0054)
- The classic GEO strategies from Aggarwal et al. (Cite Sources, Quotation Addition) improve visibility by only +2–3% on average, and their WTR of ~72–74% means roughly 1 in 4 queries actually gets worse
- Auto-GEO is strong on mean (+12.99%) but has higher variance (0.0416) and worse tail risk than IF-GEO
Comparison with GEO Baselines from the Original Paper
The paper benchmarks IF-GEO against the original 9 strategies from Aggarwal et al. (2024):
| Strategy | Mean Improvement | WTR | Verdict |
|---|---|---|---|
| Traditional SEO | +1.93 | 70.28% | Marginal gains, moderate reliability |
| Unique Words | -5.99 | 56.12% | Net negative — hurts more queries than it helps |
| Simple Expression | -0.88 | 65.24% | Slightly negative, unreliable |
| Authoritative Expression | -0.02 | 65.08% | Near-zero impact |
| Fluency Expression | -1.93 | 62.56% | Negative — oversimplification hurts |
| Terminology Addition | +1.31 | 69.33% | Small positive, moderate reliability |
| Cite Sources | +2.54 | 74.21% | Best single-strategy baseline |
| Quotation Addition | +3.10 | 72.58% | Good but inconsistent |
| Statistics Addition | +0.25 | 71.57% | Minimal impact in multi-query setting |
Critical insight: "Cite Sources" remains the most reliable single strategy (highest WTR among baselines at 74.21%), confirming the original GEO paper's finding. However, even the best single strategy only achieves a WTR of ~74% — meaning 1 in 4 queries degrades. IF-GEO's 84% WTR represents a 10 percentage point improvement in reliability.
Cross-Model Generalization
The paper also tests on Gemini-2.0-Flash (Table 7 in the paper), showing that IF-GEO's gains are not tied to a specific generative engine. The rankings and stability metrics hold across models, supporting the claim that conflict-aware optimization transfers across different AI platforms — a critical property for real-world deployment where content must perform across ChatGPT, Perplexity, Google AI Overviews, and Claude simultaneously.
Rank-Stratified Performance: Does IF-GEO Only Help Already-Good Content?
A common concern with optimization methods is that they only help content that's already well-ranked. The paper addresses this directly in Appendix D with rank-stratified analysis:
| Initial Rank | Mean Improvement (Obj.) | WTR |
|---|---|---|
| Rank 1 (already top) | +13.49 | 77.92% |
| Rank 2 | +8.56 | 77.36% |
| Rank 3 | +8.71 | 82.76% |
| Rank 4 | +12.24 | 87.14% |
| Rank 5 (lowest) | +12.14 | 81.43% |
Key finding: Lower-ranked content (Rank 4–5) achieves sizable improvements comparable to top-ranked content, with even better WTR (87.14% for Rank 4). This means IF-GEO doesn't just help the already-strong get stronger — it lifts underperforming content effectively and safely.
What This Means for GEO Practitioners
1. Stop Optimizing for Single Queries
The paper's most important practical insight: single-query optimization is a local maximum. If you optimize your product page for one target query, you may be sabotaging its performance for other valuable queries. Always consider the full query set your content should serve.
2. Measure Stability, Not Just Average Improvement
The WCP/WTR/DR metrics should become part of every GEO practitioner's toolkit:
- WTR > 80% should be the target — your optimization should help or at least not hurt the vast majority of queries
- Monitor Downside Risk — a low DR means even when optimization doesn't help, it doesn't cause significant damage
- Track Worst-Case Performance — your content is only as strong as its weakest query response
3. Content Budget is Real
You can't make a page simultaneously serve 50 different query intents at maximum visibility. IF-GEO's "content budget" concept is crucial: there's a finite amount of information and emphasis a document can carry. Strategic prioritization of which query intents to serve — and how to resolve conflicts between them — matters more than applying every optimization strategy at once.
4. "Cite Sources" Remains the Best Single-Strategy Default
Even in the multi-query setting, "Cite Sources" achieves the highest single-strategy WTR (74.21%) and a positive mean improvement. If you can only apply one optimization strategy, adding credible citations remains the safest bet.
Connection to Our Platform
The IF-GEO paper directly addresses a limitation in our current Content Optimizer, which uses a Thompson Sampling multi-armed bandit approach to iteratively optimize content chunks for a single target query.
Here's how our approach compares and where IF-GEO points the way forward:
| Aspect | Our Current Approach | IF-GEO's Approach |
|---|---|---|
| Query scope | Single target query | Multiple latent queries simultaneously |
| Strategy selection | Thompson Sampling (online learning) | Conflict-aware instruction fusion (one-shot) |
| Feedback signal | Actual LLM citation benchmarks (PAWC, rank) | Visibility metrics with risk-aware evaluation |
| Conflict handling | Not addressed — single-query only | Explicit deduplication, prioritization, arbitration |
| Stability metrics | Rank improvement, PAWC improvement | WCP, WTR, DR (risk-aware) |
IF-GEO's contributions suggest several improvements we're exploring:
- Multi-query evaluation: Running competitor benchmarks against a set of related queries, not just one
- Risk-aware metrics: Adding WTR and DR to our optimization dashboard so users can see stability, not just average improvement
- Conflict detection: Flagging when optimizing for one query might degrade performance for related queries
Limitations and Open Questions
What the Paper Acknowledges
- Token cost: IF-GEO costs ~4× more than single-pass baselines. For practitioners optimizing hundreds of pages, this adds up.
- Latent query quality: The framework's effectiveness depends on accurately predicting which queries the document should serve. Poor query mining leads to poor optimization.
- Single generative engine per evaluation: While cross-model results on Gemini are promising, the main experiments use one primary engine.
Open Questions for Future Work
- How does IF-GEO interact with content freshness? The paper focuses on one-time revision, but real-world content gets updated regularly. Do the conflict resolutions remain stable across content updates?
- Can the instruction fusion step be learned rather than prompted? The current approach uses LLM prompting for conflict resolution — a trained model might be more consistent and cheaper.
- What about competitive dynamics? IF-GEO optimizes in isolation. When competitors also optimize using similar frameworks, does the multi-query stability advantage persist?
Paper Details
| Title | IF-GEO: Conflict-Aware Instruction Fusion for Multi-Query Generative Engine Optimization |
| Authors | Heyang Zhou, JiaJia Chen, Xiaolu Chen, Jie Bao, Zhen Chen, Yong Liao |
| Institution | University of Science and Technology of China (USTC); Institute of Dataspace, Hefei Comprehensive National Science Center |
| Published | January 2025 |
| arXiv | 2601.13938 |
| Key contribution | "Diverge-then-converge" framework with conflict-aware instruction fusion for multi-query GEO, plus risk-aware stability metrics (WCP, WTR, DR) |
Frequently Asked Questions
What is IF-GEO and how does it improve on existing GEO methods?
IF-GEO (Instruction Fusion for Generative Engine Optimization) is a framework from USTC (Zhou et al., 2025, arXiv:2601.13938) that solves the multi-query conflict problem in GEO optimization. Unlike traditional GEO methods that optimize content for a single query — often degrading performance for other queries — IF-GEO uses a "diverge-then-converge" approach: first mining optimization preferences for multiple representative queries, then fusing them into a unified revision blueprint through conflict-aware instruction fusion. Results show +14.17% mean visibility improvement with 84.07% Win-Tie Rate, meaning content improves or stays stable for 84% of queries tested. This significantly outperforms the best single-strategy approach ("Cite Sources" at +2.54% mean improvement, 74.21% WTR).
What are risk-aware GEO stability metrics (WCP, WTR, DR)?
Risk-aware stability metrics introduced by the IF-GEO paper (Zhou et al., 2025) measure the safety and reliability of GEO optimization beyond simple mean improvement. Worst-Case Performance (WCP) measures how badly the worst-performing query degrades — your content's safety floor. Win-Tie Rate (WTR) measures the percentage of queries that improved or stayed the same after optimization — higher is more reliable. Downside Risk (DR) measures the expected loss magnitude when optimization causes degradation — lower means less damage when things go wrong. These metrics matter because mean improvement alone can mask tail degradation: a method that improves 80% of queries but severely hurts 20% may look good on average but damages real-world performance.
How does multi-query GEO optimization differ from single-query optimization?
Single-query GEO optimizes content for one target search query (e.g., "best CRM software"), using strategies like adding citations (+30–40% improvement per Aggarwal et al., 2024) or statistics (+20–25%). Multi-query GEO, as formalized by IF-GEO (Zhou et al., 2025), recognizes that a single document must serve many queries simultaneously. The challenge: different queries impose conflicting revision requirements under a limited content budget. Optimizing for "best CRM for small business" (wants simplicity) may hurt "enterprise CRM comparison" (wants technical depth). IF-GEO addresses this through conflict-aware instruction fusion that balances competing preferences, achieving improvements across the full query set rather than one query at the expense of others.
Which GEO optimization tool supports multi-query conflict-aware optimization?
Most current GEO tools — including AI Visibility's Content Optimizer — optimize content for a single target query using strategies like citation injection, statistics addition, and structure optimization. IF-GEO's multi-query approach (Zhou et al., 2025) represents the next frontier. Our platform currently uses Thompson Sampling with 6 optimization strategies (query echo, answer-first, authority injection, semantic densification, structure optimization, conciseness boost) and evaluates results through actual LLM citation benchmarks with PAWC scoring. We are actively researching how to integrate IF-GEO's conflict-aware instruction fusion into our optimization pipeline to support multi-query evaluation and risk-aware stability metrics.
Sources
-
Zhou, H., Chen, J., Chen, X., Bao, J., Chen, Z., & Liao, Y. (2025). "IF-GEO: Conflict-Aware Instruction Fusion for Multi-Query Generative Engine Optimization." arXiv:2601.13938. University of Science and Technology of China.
-
Aggarwal, P., et al. (2024). "GEO: Generative Engine Optimization." arXiv:2311.09735. Princeton University, Georgia Tech, IIT Delhi.
-
Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." arXiv:2005.11401. Meta AI.
-
Karpukhin, V., et al. (2020). "Dense Passage Retrieval for Open-Domain Question Answering." arXiv:2004.04906. Facebook AI.
-
Marler, R. T., & Arora, J. S. (2004). "Survey of multi-objective optimization methods for engineering." Structural and Multidisciplinary Optimization, 26(6), 369–395.
Ready to Optimize Your Content?
While multi-query conflict resolution is still emerging research, you can start optimizing today with proven GEO strategies. Use the AI Visibility GEO Optimizer to:
- Score your content across 6 GEO dimensions (Visibility, Authority, Retrievability, Verifiability, Freshness, Answerability)
- Benchmark against competitors to see who AI engines cite for your target queries
- Optimize content iteratively using RL-based chunk-level improvement
- Track your progress with PAWC and citation metrics
Author
More Posts

Measuring AI Visibility: GEO Metrics and Methodology
Technical guide to measuring AI visibility using metrics from the GEO research framework. Includes PAWC calculation, measurement methodology, and limitations.

Content Optimization for AI Citation: Research-Based Strategies
Research-backed strategies for improving content citation in AI search engines. Based on the GEO framework from Princeton/Georgia Tech/IIT Delhi and RAG system documentation.

What is GEO? Generative Engine Optimization Explained
GEO (Generative Engine Optimization) is the practice of optimizing content to be cited by AI search engines like ChatGPT, Perplexity, and Google AI Overviews. Learn the definition, key strategies, and how GEO differs from SEO.
Newsletter
Join the community
Subscribe to our newsletter for the latest news and updates