techmarketing . agency
Human hand using laptop outdoors
AI SEO 16 Jan 2026

Tracking AI citations: Profound vs manual prompt audits

Profound and manual prompt audits both have a place in tracking AI citations. Here's what we use each for, what we trust and where the gaps still sit.

If you are running an AI search programme in 2026, you will have asked the same question we did. Should we pay for a tracking tool like Profound, or run our own manual prompt audits, or both. We have done all three for different clients over the last year and the answer is not the same for everyone.

This post is the honest comparison. Where each approach earns its keep, where each one falls short and where we have landed on the question.

What Profound actually does

Profound is one of the more credible AI search tracking platforms that has emerged. It runs a configurable set of prompts across multiple LLM surfaces on a regular cadence, captures the responses including cited sources and aggregates the data into a tracking dashboard. You can see your citation share over time, compare to competitors and drill into specific prompts.

There are a handful of competitors in the space, including Athena and various plugins on top of Semrush and Ahrefs. The shape is broadly similar across them. We have spent the most time with Profound, so it is what this post is centred on, though much of the analysis applies to the wider category.

What we use Profound for

Three jobs where we find Profound earns its place:

Scale. Running 200 prompts manually across five LLM surfaces every two weeks is not realistic. Profound automates the scale and we can monitor a much wider prompt set than we could otherwise.

Trend tracking over time. Manual audits are a snapshot. Profound gives us a moving picture, which is what you actually need to understand whether interventions are working. A single audit tells you where you are. Six months of tracked data tells you what is changing.

Competitive baselining. Profound makes it easy to see citation share for you and three or four named competitors across the same prompt set. That comparative view is hard to assemble manually with any consistency.

Reporting cadence. For client reporting, having a stable dashboard that updates automatically is operationally valuable. We can show movement and isolate which interventions correlate with which shifts.

Where Profound has limits

We are honest about these because the tools are sold harder than they should be.

The prompt set is your work, not the tool’s. Profound runs the prompts you give it. A bad prompt set produces a confident dashboard tracking the wrong thing. We spend more time refining prompt sets than we do reviewing the dashboard.

LLM responses are non-deterministic. Run the same prompt twice and you can get different answers. Profound mitigates this by running multiple times and aggregating, but the aggregation introduces its own variance. A 5% movement in citation share might be real or it might be noise. We treat short-term movement with caution.

Some surfaces are still hard to track reliably. Microsoft Copilot inside Microsoft 365 is the most obvious one. The tool can sample the public Copilot surface but cannot see the tenant-aware version that many enterprise users actually interact with. There are also gaps in regional or language-specific surfaces.

Citation extraction is imperfect. When ChatGPT cites a source, the tool needs to parse the citation reliably. We have seen edge cases where the parsing misses a citation or attributes one incorrectly. The error rate is small but not zero, and it skews particularly when comparing close competitors.

Cost. Profound and equivalents are not cheap, especially if you want a wide prompt set across multiple surfaces tracked frequently. For a small B2B firm with a focused audit need, the spend may not pay back.

What manual audits give you that tools do not

Three reasons we still run manual audits alongside the tooling.

Texture and intent reading. A dashboard tells you that you are cited 3% of the time. A manual audit tells you that when you are cited, the model frames you as “a low-cost option” when you would rather be framed as “a security-led option”. That kind of qualitative read is hard to automate.

Following the conversation. When a model answers a prompt, a real prospect often asks a follow-up. “Tell me more about ”, “How does it compare to ”, “What do users say about it”. Manual audits let you walk through that conversational arc and see how positioning holds up. Tools struggle with this because the second prompt depends on the first response.

Catching framing shifts early. Models occasionally start describing a category in a new way, or start treating a competitor as the canonical example. Tools tracking citation share will not flag this until it shows up in measurable share movement. A human reading the responses notices it the first time it happens.

Trust calibration. When a tool gives you a number, you have to decide how much to trust it. The only way to calibrate is to run the same prompts manually and compare. We do this quarterly for clients on Profound and it is worth the time.

How we combine the two

The pattern that has settled out across our client work:

Profound or equivalent runs the headline tracking. A prompt set of 100 to 200 queries, refreshed every two to four weeks, across the four to five LLM surfaces that matter for the client’s category.

Manual audits run monthly on a smaller set. Usually 20 to 40 high-value prompts, run conversationally rather than one-shot. We capture not just the citation but the framing, the follow-up behaviour and the texture.

Quarterly calibration. A subset of the Profound prompts run manually to validate the tool’s data. If the gap is wide, we investigate.

Ad hoc deep dives. When something moves materially in the dashboard, we go manual to understand what is actually happening. Was it a real shift, a tool artefact or a one-off model behaviour change.

This is more work than running just the tool. It is also more honest about what we know.

Connecting tracking to action

Tracking on its own is a sunk cost. The point is to drive intervention decisions. The signals we look for:

  • Citation share dropping for a specific prompt cluster, which usually points to a third-party source change
  • Competitor citation share rising suddenly, which usually means they have done something new on G2, Reddit or trade press
  • A prompt that used to cite us starting to cite a community source instead, which tells us we need to be in that community
  • Framing shifts that mean even our citations are positioning us wrongly

Each of these triggers a different intervention. A drop in citation share for a comparison prompt drives comparison page work. A community citation gain drives Reddit and forum activity. A framing shift drives homepage and category page revision.

Our wider piece on tracking AI search traffic covers the analytics-side measurement. This post is specifically about the citation-side.

What this looks like for smaller firms

Not every firm needs Profound. For a B2B tech business with a focused product and a tight prompt set, manual audits run by a single person for two hours a month can be enough. The pattern we use:

  • 30 to 40 prompts captured in a spreadsheet
  • Monthly run across ChatGPT, Claude, Perplexity and Microsoft Copilot
  • Citation presence and rank logged for each prompt
  • A short qualitative note on framing for each
  • Quarterly review against the previous quarter

Total time, four to six hours a month. No tool subscription. The trade-off is no automated dashboard and no scale.

For larger firms with broad prompt coverage, the tool is usually worth the spend. For smaller firms, manual gets you a long way.

Connecting to the wider picture

Tracking is a feedback loop into the rest of the programme. If you are still building foundations, our AI search optimisation primer and content KPIs for the AI search era are the broader context. For the citation displacement playbook the tracking should feed into, ChatGPT keeps recommending your competitor is the companion read.

What we are still working out

A few honest caveats. The tooling category is young and the products will change shape over the next twelve months. The relationship between citation share and actual buyer behaviour is still being calibrated. Models change their citation logic without notice, which means historical data carries asterisks.

The pragmatic position is that some tracking beats none, automated tracking helps when scale matters and manual reading still does work no tool replicates yet.

If you’d like a second opinion on how you are tracking AI citations, drop us a line. You can also see how we approach this work on our AI SEO services page.

Frequently asked questions

Is Profound worth paying for, or can we run audits manually?
It depends on scale. For B2B tech firms with focused product lines and tight prompt sets, manual audits run by one person for two hours a month can be enough. Around 30 to 40 prompts captured monthly across ChatGPT, Claude, Perplexity and Microsoft Copilot, with citation presence and rank logged plus a qualitative note on framing. For larger firms with broad prompt coverage where you want to track 100 to 200 prompts across multiple surfaces every two weeks, the tool is usually worth the spend because manual cannot match the cadence.
What can manual audits catch that Profound misses?
Three things consistently. Texture and intent reading, where a manual audit tells you the model frames you as "a low-cost option" when you would rather be framed as "a security-led option". Conversational follow-ups, where a real prospect asks two or three questions in sequence and you can see how positioning holds up across the arc. Framing shifts, where a model starts describing a category in a new way before it shows up in measurable share movement. Tools tracking citation share will not flag these until they show in the numbers.
How much should we trust short-term movements in citation share?
Be cautious. LLM responses are non-deterministic. Run the same prompt twice and you can get different answers. Profound mitigates this by running multiple times and aggregating, but the aggregation introduces its own variance. A 5% movement in citation share might be real or it might be noise. We treat short-term movement with caution and only act on patterns sustained over four to eight weeks. Run a quarterly calibration where a subset of tool prompts is run manually to validate the data. If the gap is wide, investigate.
Share

Want help putting this into practice?

We work with technology companies on exactly this kind of programme. Tell us about yours.