A Journey Through Prompt Engineering and Model Wrangling
Let me ask you something: how many AI-generated articles have you read this week that made you feel dumber after finishing them?
I'm guessing at least five. Maybe ten.
We're drowning in synthetic content. Every SEO with a ChatGPT subscription thinks they're Hemingway with a turbo button. The result? The internet now resembles what happens when you ask a drunk algorithm to explain a Wikipedia page to another drunk algorithm.
But here's the thing - and trust me on this - AI content generation can work. It can produce material that ranks in traditional search, gets cited by generative AI systems, and doesn't make readers want to claw their eyes out.
I know because I just spent 5-6 hours refining a system that does exactly that.
The challenge was straightforward: create a 3,000-word guide on "Mythical Places in Europe" for Untravel.com, a fictional alternative travel website I imagined for this experiment.
The target audience? Young couples (25-40) seeking authentic, folklore-rich destinations beyond the Instagram-saturated tourist traps.
Simple enough, right?
Except here's what made this different: instead of prompting one model and calling it done, I ran the same structured prompt through Claude, ChatGPT, Gemini, and Copilot.
Then I analyzed their outputs like a philologist dissecting ancient manuscripts, consolidated their best elements into a master brief, and executed content production across all four again.
The results? Surprising, instructive, and occasionally hilarious.
What emerged wasn't just a guide about Romanian ghost forests and Scottish fairy glens. It was a replicable framework for AI content that actually works; one that now takes 60 minutes to produce 3,000 words, 15 minutes for a product listing page, and 10 minutes for a product description.
That's not a productivity hack. That's a paradigm shift.
Let me show you how I got there.
Why Your Single-Prompt Approach is Leaving Money on the Table
The Call to Adventure: Understanding the Limitations
The Memory Degradation Problem (Why Long Prompts Break)
Think of LLMs as brilliant scholars with severe short-term memory issues.
They can hold about 200,000 tokens in their context window (that's roughly 150,000 words), but their working memory, aka the part that remembers what you asked them to do, degrades dramatically after the first 1,500 tokens.
This isn't speculation. I proved it.
When I asked ChatGPT to generate the full 3,000-word guide in one shot, it delivered 1,400 words.
It did so not because it couldn't count, but because by word 800, it had forgotten half the instructions about entity optimization, persona alignment, and internal link placement.
The prompt was a monster of nearly 3,000 words of instructions covering volumetric planning, constraint definitions, entity search frameworks, and SEO metadata requirements.
ChatGPT read it, understood it initially, then promptly forgot chunks of it while generating content.
Claude and Gemini did better but embedded their 3,000-word articles inside 11,000 words of production scaffolding: stop points, word count reminders, approval gates.
Technically compliant, practically unusable.
Why does this matter for SEO? Because incomplete content means incomplete entity coverage. And incomplete entity coverage means your content won't surface in AI Overviews, won't get cited by Perplexity, and won't rank for the semantic clusters that actually drive traffic.
And it matters for the users because, ultimately, that is how AI Slop is generated.
The Man-in-the-Loop Imperative
Here's the uncomfortable truth we SEOs need to accept: AI isn't a replacement. It's a collaborator that needs adult supervision.
The solution isn't longer prompts. It's structured prompts with approval gates.
I broke content production into three distinct tasks:
Analysis (competitive research, entity extraction, gap identification)
Outline (structure before substance)
Brief generation (the master blueprint)
Each task ended with a stop point. Each required human approval before proceeding.
Paradox: more checkpoints produced better final output. Why? Because I caught drift early.
When Gemini's outline prioritized Nordic destinations over Romanian Carpathians (contradicting my differentiation strategy), I corrected the course before 3,000 words of wrong content got generated.
The man-in-the-loop isn't inefficient. It's quality control that prevents expensive rework.
Architecting the Prompt: A Philological Approach to AI Instruction
Meeting the Mentor: Building the Framework
Multi-Task Chain of Thought (The Three-Act Structure)
The prompt architecture followed classic narrative structure:
Act I - Analysis: Entity extraction from competitors and synthetic AI responses. Cosine similarity mapping. Gap identification in SERP coverage.
Act II - Structure: Outline proposal with H2/H3 hierarchy and narrative focus per section.
Act III - Execution: Complete brief generation with volumetric planning, constraints, entity strategy, and quality checklists.
Why this sequence? Because it mirrors how humans actually plan content. You don't start writing before you know what you're writing about. AI shouldn't either.
The Reference Document Strategy
I fed the models seven document types:
Google SERPs (organic results).
AI Overviews (Google's synthetic answers).
Google Web Guides.
AI Mode responses (Google's conversational results).
ChatGPT, Claude, Gemini, Copilot Perplexity outputs (synthetic responses to the same query).
People Also Ask (three levels deep).
Query fan-outs (related searches).
Why synthetic AI responses? Because they represent what models recognize as good answers.
When you're optimizing for AI citation, studying what AI already cites is forensic intelligence, not guesswork.
Entity extraction from these sources revealed semantic patterns: "Hallstatt" appeared in 80% of responses. "Hoia-Baciu Forest" is 40%. "Lusatia" is 5%.
That's the differentiation opportunity right there.
Personas as Non-Negotiable Input
Generic prompt: "Write about magical places in Europe."
Result: Wikipedia clone in AI voice.
Persona-defined prompt: "Write for childless couples 25-40 seeking authentic folklore destinations, experiential tone, contemporary language, avoiding tourist traps."
Result: "The Carpathians preserve Europe's densest concentration of living folklore, where farmers still leave offerings at forest shrines."
The difference? Specificity of audience shapes vocabulary, sentence structure, and cultural references. Personas aren't marketing fluff but they're constraint systems that prevent generic outputs.
Agent Definition: The Role You Assign Shapes the Output
"You are a content writer" produces different results than "You are an SEO content strategist specializing in traditional search engines, generative AI, and digital PR campaigns."
The second triggers model behavior aligned with:
Entity optimization (for Knowledge Graphs)
Quotable phrase construction (for AI extraction)
Citation-worthy framing (for journalist pickup)
Role precision matters. Generic roles = generic outputs.
Volumetric Planning (The Weight System)
"Write 3,000 words" isn't enough instruction.
I specified:
Introduction: 200 words
Romanian Carpathians section: 450 words
Celtic Atlantic section: 400 words
Practical guidance section: 400 words
Why? Because word budgets prevent rushed conclusions, bloated introductions, and unbalanced depth.
When ChatGPT ignored the weight table, it produced 1,400 words with an anemic 80-word conclusion.
The 10% tolerance rule (2,700-3,300 words) accommodates natural writing flow while maintaining discipline.
Constraints as Creative Boundaries
I defined four constraint categories:
Structural: Each H2 opens with direct answer. No meta-commentary like "This section will discuss..."
Linguistic: Conversational language. Avoid Wikipedia formality.
Example: "limestone pavement" becomes "limestone eroded into puzzle-piece patterns."
Formatting: Lists require contextualizing paragraphs. No naked bullet points.
Factual: No invented statistics. If uncertain, phrase generally or omit.
These weren't restrictions; they were guard rails preventing AI's worst habits: academic stiffness, meta-narration, hallucinated data.
Outline Before Brief (The Two-Stage Rocket)
I forced outline approval before brief generation.
Why? Because fixing structure is cheap. Fixing 3,000 words of content built on wrong structure is expensive.
When Gemini's outline over-prioritized Nordic destinations (contradicting my "Romanian Carpathians as differentiation" strategy), we corrected it in 30 seconds. That course correction saved hours of rework.
Structure approval = semantic scaffold validation before construction begins.
The Four-Model Experiment: When AI Interpretations Diverge
Tests, Allies, Enemies: The Tournament
Brief Generation Showdown
Same prompt. Four models. Four radically different philosophies.
Claude:
16-page comprehensive masterpiece.
Every conceivable instruction documented.
Pedagogical examples (✓ correct / ❌ incorrect formatting).
Quality assurance frameworks.
Entity strategies with strategic notes.
Think IKEA instructions: photos, warnings, "here's what it should NOT look like" panels.
Copilot:
9-page balanced approach.
Demonstrated its mathematical thinking (volumetric adjustments: 3,500 → 3,360 → 3,300 words, showing the logic).
Quotable phrase examples within section guidance.
Technical precision meets operational clarity.
Gemini:
3-page minimalist framework.
Clear volumetric table, entity lists, production protocol.
Assumed writer competence. Think quick-start guide for someone who's built furniture before.
ChatGPT:
4-page strategic outline.
Excellent constraint articulation and SEO metadata formatting.
But missing:
production protocol details,
entity search,
section-by-section instructions.
Design concept document showing the final product, leaving assembly details implicit.
First major finding: Model "personality" affects brief utility. Claude assumes inexperienced writer. Gemini assumes experienced writer. Both valid, different use cases.
Comparative Analysis: The Compliance Matrix
I scored against 10 mandatory elements:
Requirement | Claude | Copilot | Gemini | ChatGPT |
|---|---|---|---|---|
Volumetric Planning | ✅ | ✅ | ✅ | ✅ |
Production Protocol | ✅ | ✅ | ✅ | ❌ |
Entity Search | ✅ | ❌ | ✅ | ❌ |
Internal Links | ✅ | ✅ | ✅ | ✅ |
Visual Elements | ✅ | ✅ | ✅ | ✅ |
SEO Metadata | ✅ | ✅ | ✅ | ✅ |
Section Instructions | ✅ | ✅ | ❌ | ❌ |
Final Checklist | ✅ | ✅ | ❌ | ❌ |
Stop Points | ✅ | ✅ | ✅ | ✅ |
Consolidation Instructions | ✅ | ✅ | ❌ | ❌ |
Compliance scores:
Claude 10/10.
Copilot 9/10 (missing entity search).
Gemini 6/10.
ChatGPT 5/10.
But here's where it gets interesting: compliance doesn't equal utility.
The Surprising Patterns
Claude's comprehensiveness paradox:
Included everything, risked overwhelming writers. Proposed 3,380 words "for editing flexibility" (technically exceeding the 3,300 max by 80 words).
Pedagogical brilliance, potential operational burden.
Copilot's persona mastery:
Best voice modeling. Natural phrases like "Bad weather is often the best special effect Europe can offer" and "If the gift shop arrives before the myth, the spell is already broken." These weren't instructions; they demonstrated the voice.
Critical gap: No entity search section.
Gemini's elegant minimalism:
Works beautifully for experienced content strategists.
Dangerous for junior writers who need hand-holding. Brevity as feature and bug simultaneously.
ChatGPT's strategic weakness:
Strong conceptual framing ("Liminal Europe," "Enchantment as emotional residue") but missing operational mechanics. A writer wouldn't know to follow the modular methodology or create entity hierarchies.
Key insight: The quality gap isn't about intelligence. It's about interpretation of completeness.
Claude interpreted "complete brief" as "everything needed for successful execution including quality control." ChatGPT interpreted it as "strategic framework with constraints." Both valid interpretations of ambiguous instructions.
This is why consolidation matters.
The Consolidation Strategy: Frankenstein or Masterpiece?
Approach to the Inmost Cave: Building the Master Brief
The Synthesis Decision
Why not just pick the "best" brief and move on?
Because each model excelled at different elements. Choosing one meant abandoning the others' strengths. That's not optimization but it's wasting time, energy, tokens and quality.
Copilot nailed persona voice but lacked entity search. Claude had forensic entity strategies but risked academic stiffness. Gemini captured authentic tone in three pages but left structural gaps. ChatGPT framed concepts beautifully but missed operational details.
The question wasn't "Which is best?" It was "What does each do better than the others?"
Consolidation Architecture
Foundation: Copilot (best persona alignment, operational completeness, voice modeling)
Strategic Integrations:
From Claude: Entity search framework with "Essential vs. Optional" classifications. Example: "Transylvania - for geographical orientation only" vs. "Maramureș - must appear, defines living folklore theme." This semantic hierarchy is what makes Knowledge Graph optimization work.
From Gemini: Voice-perfect moments. The OG description "Forget the gift shops. We're diving into the real folklore" became Option 3 in my metadata. Why? Because it is the brand voice, distilled to 13 words.
From Claude: Pedagogical constraint examples.
❌ "This is the direct answer: The Carpathians offer..."
✅ "The Carpathian Mountains preserve Europe's densest concentration of living folklore..."
These good/bad pairs prevent execution failures.
From Claude: Enhanced production protocol with Step 4 (Master Doc assembly) specifying output options (consolidated text in chat OR downloadable .docx/.md file). Eliminates deliverable ambiguity.
From Claude: Quality Assurance framework: 4-category audit (Content, Structural, Brand, Technical) with explicit verification items before delivery.
What I DIDN'T Consolidate:
ChatGPT's conceptual H2 structure ("Liminal Europe," "Enchantment as emotional residue"). Too abstract for practical travel planning. Readers need destinations, not philosophy.
Claude's 16-page comprehensiveness. Would have created a 25-page consolidated monstrosity. Comprehensiveness ≠ usability.
Gemini's brevity. Its conciseness works standalone but wouldn't integrate well with other elements without losing its elegance and effectiveness.
The 11-Page Master Brief
Final structure:
Pages 1-2: Context, objectives, constraints (Copilot base)
Pages 3-4: Volumetric planning + enhanced production protocol (Copilot + Claude's Step 4)
Pages 4-5: AI-friendly principles + Claude's constraint examples
Pages 5-7: Claude's entity search framework (filling Copilot's gap)
Pages 7-9: Copilot's section instructions with Gemini voice moments
Pages 9-10: Visual/UX + internal links (Copilot base)
Page 10: Claude's QA framework + Copilot's checklist
Page 11: SEO metadata with Gemini's OG descriptions added
Why 11 pages is optimal: Comprehensive without overwhelming. Junior writers get guidance. Senior writers can skim to entity strategy and section weights.
Repeatability: This becomes a template. Change the topic, personas, and entities but keep the structure.
That's how you go from 5-6 hours prototyping to 60 minutes execution.
Execution Reality Check: When Models Write the Actual Content
The Ordeal: Testing the Brief
The Production Protocol Ambiguity
I fed the consolidated master brief to all four models again. Generate the actual 3,000-word guide.
Critical finding:
Three models misinterpreted the "modular production protocol."
Claude, Copilot, Gemini: Delivered 3,000+ words of excellent content... buried inside 7,000-11,000 words of production scaffolding. Stop points everywhere. "⚠️ MODULE 1 COMPLETE - AWAITING APPROVAL." Word count updates. Review checklists. Quality verification frameworks.
They treated the production methodology as part of the deliverable instead of instructions for creating the deliverable.
ChatGPT: Delivered clean format (no scaffolding) but catastrophically wrong length: 1,400 words. 52% under target. Right interpretation of deliverable format, complete failure on volumetric requirements.
The problem? Section 13 of my brief said, "consolidate into Master Document" but never explicitly stated "exclude production scaffolding from final output."
AI interprets literally. Humans infer contextually. That one missing sentence cost me the first round of execution.
Content Quality Assessment
Once I extracted the actual article content from the scaffolding (and gave ChatGPT another chance with explicit word count enforcement), quality ranking:
Winner: Copilot (to my surprise, I must admit) - Most sophisticated prose, natural voice, perfect depth/readability balance. Example: "Bad weather is often the best special effect Europe can offer" appeared naturally in the practical timing section. Not instruction, not forced. It felt written by someone who'd actually been to the Carpathians at dawn.
Strong Second: Claude - Excellent execution, comprehensive entity coverage, but risked slight formality. Phrases like "pre-Christian palimpsest" and "myth-active landscapes" showed depth but needed careful editing to maintain conversational tone.
Solid Third: Gemini - Excellent content, most verbose. Tendency to over-explain what could be stated simply. Strong voice consistency but needed trimming.
Problematic: ChatGPT - Even with explicit word count reinforcement, struggled with section balance. Strong opening, weak middle sections, rushed conclusion. The conceptual framing it excelled at in brief generation didn't translate to content execution discipline.
The Persona Alignment Test
I asked: Which brief would make a 28-year-old couple actually book a trip to Romania's Apuseni Mountains?
Copilot: "Hoia-Baciu Forest has trees that grow genuinely deformed—documented but not fully explained. There's a clearing where nothing grows, a perfect circle analyzed for everything from soil composition to radiation. No definitive answer."
Claude: "The Hoia-Baciu Forest exhibits dendrological anomalies characterized by extreme morphological deviation. The central clearing presents a 30-meter diameter zone of vegetative absence, subject to extensive geomagnetic analysis."
Same information. Copilot version gets shared. Claude version gets skimmed.
Persona alignment isn't optional; it's the difference between content that converts and content that gets closed.
The Brief Revision Insight
One sentence would have fixed everything: "The production protocol describes your working methodology. The final deliverable should contain only the article content, images, and metadata—no scaffolding, no stop points, no word count updates."
That's the forensic insight: AI needs explicit boundaries between process and product. Human writers understand this intuitively. Models don't.
I updated the consolidated brief.
Section 13 now includes: "Final deliverable format: Article text (H1→conclusion) + image markers + SEO metadata table. Exclude all production commentary, stop points, and approval gates."
Problem solved. Execution clean on second run.
The Framework: Your Replicable AI Content System
Return with the Elixir: The Method
The 6-step protocol that emerged:
Structured prompt with multi-task Chain of Thought - Analysis → Outline → Brief generation, with approval gates between each
Reference document analysis - SERP + synthetic AI responses + People Also Ask (3 levels) + query fan-outs for entity extraction and differentiation opportunities
Persona and agent definition - Specific audience characteristics + role assignment ("SEO content strategist specializing in...")
Volumetric planning with weights - Word allocation per section (±10% tolerance), preventing unbalanced content
Multi-model brief generation → consolidation - Run same prompt through 4 models, extract best elements, synthesize master brief
Execution with explicit deliverable format - Clear boundary between production methodology and final output
The time reality:
Building this prototype: 5-6 hours (once)
Subsequent 3,000-word guides: 60 minutes
Product Listing Pages (PLP): 15 minutes
Product Description Pages (PDP): 10 minutes
Why prototyping took longer: I was discovering the process, not following it. Now it's template-based.
The scalability math: 100 product descriptions = 10 hours of work with persona alignment, entity optimization, and multi-channel visibility (traditional SEO + generative AI).
When to use this system:
Content requiring persona precision
Strategic pages (category, editorial, cornerstone)
Multi-channel optimization targets (organic search + AI Overviews + AI Mode)
Brand voice consistency at scale
When to simplify:
Minor edits, quick updates
Internal documentation
Content with no strategic SEO importance
The parallel processing advantage: four models work simultaneously. You're not waiting sequentially but orchestrating simultaneous execution and selecting the best output.
Conclusion: Integration, Not Replacement
Here's what this proves: AI content generation works when you treat models as collaborators with distinct strengths, not as replacements for strategic thinking.
The synthesis principle matters. Copilot's persona mastery + Claude's entity rigor + Gemini's authentic voice + human editorial judgment = content that ranks, gets cited, and doesn't make readers stupider.
What this disproves: "Just use ChatGPT" as a content strategy. Single-model approaches leave 60-70% of capability on the table.
The uncomfortable truth for our industry: prompt engineering is now a core SEO skill. Not optional. Not a "nice to have." If you can't architect multi-task instructions with constraint systems and approval gates, you're operating at 30% capacity.
The future isn't AI replacing SEOs. It's SEOs who understand AI replacing SEOs who don't.
Your turn: Take this framework. Adapt it. Break it. Improve it. Then tell me what you learned. Because that's how this works: we iterate together, we share findings, we build better systems.
The quest continues.
A final note: You can find the multitask prompt for generating new content here, along with the documents I used to feed it for the test described in the article. You cannot edit the prompt and the shared documents to test it, so you will need to create a copy and save it in your own Drive
Article by
Gianluca Fiorelli
With almost 20 years of experience in web marketing, Gianluca Fiorelli is a Strategic and International SEO Consultant who helps businesses improve their visibility and performance on organic search. Gianluca collaborated with clients from various industries and regions, such as Glassdoor, Idealista, Rastreator.com, Outsystems, Chess.com, SIXT Ride, Vegetables by Bayer, Visit California, Gamepix, James Edition and many others.
A very active member of the SEO community, Gianluca daily shares his insights and best practices on SEO, content, Search marketing strategy and the evolution of Search on social media channels such as X, Bluesky and LinkedIn and through the blog on his website: IloveSEO.net.
stay in the loop





