How Google crawls, discovers, and indexes pages

Summary

Overview

Google Search resolves a query against an index that is the product of an upstream pipeline: a URL must first be discovered, then crawled, then rendered, then selected for indexing, and finally judged eligible to rank. The pipeline is one-directional and conditional at every gate. A URL that fails discovery never enters the queue; a URL that fails selection after crawl never appears in results regardless of how many times it is re-submitted.

Google publicly states that "indexing isn't guaranteed; not every page that Google processes will be indexed." Mueller adds: "most of the time when we still crawl something, it doesn't necessarily mean that we will automatically index it."

Source: Google Search Central documentation; John Mueller (Google), on-record. Confidence: Verified.

Google's Search Central documentation (last updated December 2025) states that crawling and indexing "take some time and rely on many factors" and that "in general, we cannot make predictions or guarantees about when or if your URLs will be crawled or indexed." Troubleshooting documentation says "for most sites, new pages will take several days minimum… most sites shouldn't expect same-day crawling," and that Google "strives to check and index pages in a reasonably timely manner. For most sites, this is three days or more." Mueller's on-record range: "several hours to several weeks."

Source: Google Search Central documentation, December 2025 revision; John Mueller, on-record. Confidence: Verified.

The consolidated research brief on the broader lifecycle treats this same pipeline as roughly nine distinct evaluation passes: crawl fetch, rendering, index selection / quality evaluation, canonical selection, query-time ranking, scheduled re-crawl, freshness-driven re-crawl, broad-core-update re-evaluation, and automated spam / quality system passes.

A site moves through a fixed pipeline — discovery, crawl, render, index selection, ranking, and ongoing re-evaluation — but the timing of each stage is highly variable and largely outside the owner's control. Indexing of a given page typically lands somewhere between several hours and several weeks; reaching stable rankings for a brand-new domain is realistically a multi-month affair with no Google-confirmed number attached.

Source: Compass research document, "Google Search Lifecycle," June 2026, anchored in Google's documentation, Search Central Blog, and on-record statements from John Mueller, Gary Illyes, Martin Splitt, Danny Sullivan, and Allan Scott. Confidence: Industry-consensus (synthesis).

Only the discovery pass is properly one-time-per-URL; the rest can be revisited at any time, including the indexing decision itself, which can be reversed. This page covers the first half of the pipeline — discovery, crawl, render-readiness, and index selection — together with canonicalisation and the post-2024 indexing-failure tail. The downstream half (rankings, the new-site ramp, and post-launch volatility) is covered at SEO J-curve and new-site ramp.

Crawl mechanics: how Googlebot discovers URLs

Before Googlebot can fetch a URL, Google must become aware that the URL exists. Discovery is documented by Google as occurring through a small set of channels:

Discovery happens via inbound links from already-crawled pages (the original and still-powerful route); XML sitemaps (Google uses <loc> and <lastmod>); Search Console property verification and URL Inspection's "Request indexing"; redirects from known URLs; and hosting platforms (Wix, Squarespace) that auto-notify Google on publish. For a new site, Google explicitly recommends that the "best first step is to request indexing of your homepage," from which Google can follow internal navigation to the rest.

Source: Google Search Central documentation. Confidence: Verified.

Caveat: Orphan pages — not linked from anywhere internally or externally, and not present in the sitemap — can remain invisible to Google indefinitely.

Inbound links remain the strongest channel because they double as a crawl-priority cue and an authority signal. A sitemap is a declaration of what URLs the site owner considers canonical and recently updated; it does not itself confer crawl priority.

Per Google's documentation, keeping an XML sitemap current is "adequate" for most sites. Sitemaps aid discovery (Google sees which URLs the owner considers canonical and when they last changed). They do NOT guarantee indexing — Google still chooses whether to crawl and whether to index based on quality and signals.

Source: Google Search Central sitemap documentation. Confidence: Verified.

For sites in the 50–500-URL range, the sitemap is a one-time set-up artifact: generate from the CMS, submit in Search Console, do not edit by hand. Its leverage is "did Google find this page," not "should Google rank this page." Two adjacent submission channels are widely misunderstood: IndexNow is not supported by Google, and the Indexing API is documented as restricted to JobPosting and BroadcastEvent content. The submission-tool category is treated under the indexing-failure section below.

The robots.txt gate

Googlebot fetches /robots.txt before crawling any URL on a host. A failure at this gate halts the entire crawl pipeline for the host:

Googlebot fetches /robots.txt before crawling any URL. If the server returns a 5xx (unreachable) error for robots.txt itself, Google postpones the crawl entirely rather than guessing — per Google's own message: "your server returned a 5xx (unreachable) error when we tried to retrieve your robots.txt file. To make sure we didn't crawl any pages listed in that file, we postponed our crawl." Google caches robots.txt for up to ~24 hours (longer if it can't refresh). DNS failure or server unreachability blocks the whole pipeline at this gate.

Source: Google Search Central documentation; Search Console robots.txt report messaging. Confidence: Verified (verbatim Google messaging).

A robots.txt 5xx is one of the most common silent launch failures because the surface symptom — "Google isn't indexing the site" — is indistinguishable from a discovery problem until the Crawl Stats report is inspected.

Crawl budget and re-crawl frequency for small sites

Crawl budget — the upper bound on how many fetches Google is willing to spend on a host in a given window — is a problem only at scale. Google's own crawl-budget documentation is explicit on this point:

Google's crawl-budget documentation states that crawl budget is a large-site concern. It explicitly tells sites under a few thousand URLs, or whose new pages are crawled same-day, that they "don't need to read this guide." Google's stated thresholds for caring are roughly: 1M+ pages updated regularly (about once a week), or 10k+ pages updated daily, or a high share of URLs stuck in "Discovered – currently not indexed" (which is itself usually a quality signal).

Source: Google Search Central, "Large site owner's guide to managing your crawl budget." Confidence: Verified.

Re-crawl frequency for small sites is therefore driven not by a budget ceiling but by crawl demand — Google's assessment of how often a host is likely to have something new or changed worth fetching. Crawl demand on a brand-new domain starts low because Google has no history to extrapolate from:

With no history, Google has little crawl demand to work with on a new domain; it crawls conservatively and ramps up (or doesn't) based on what it finds. This is part of why a brand-new site can wait days-to-weeks for its first deep crawl while an established site sees pages indexed in minutes-to-hours. This is not a deliberate suppression — it is the absence of demand signals.

Source: Synthesis of Google documentation on crawl demand and Gary Illyes commentary on indexing speed. Confidence: Industry-consensus.

The inverse holds for established sites with earned crawl demand:

An established site with a track record of fresh, high-quality content earns higher crawl demand; Google learns its update patterns and returns faster. This is why a new post on an authoritative site can be indexed in minutes-to-hours while the same post on a new domain waits days-to-weeks.

Source: Synthesised from Google's crawl-demand documentation and Gary Illyes' quality-plus-popularity framing of indexing drivers. Confidence: Industry-consensus.

Crawl frequency is, in effect, a trailing indicator of accrued trust — earned through sustained behaviour, not manufactured directly.

JavaScript-driven discovery delays at zero authority

The discovery process for URLs surfaced only through client-side JavaScript is materially slower than for URLs surfaced through plain HTML <a href> links — and the gap widens dramatically on low-authority hosts:

Onely's November 2022 experiment (Ziemek Bućko, "Google Needs 9X More Time To Crawl JS Than HTML") ran on a brand-new, zero-authority test subdomain. Findings: "It took Google 313 hours to get to the final, seventh page of the JavaScript folder. With HTML, it took just 36 hours. That's nearly 9 times faster." At the first injected link, 52 hours (JS) versus 25 hours (HTML).

Source: Onely (Ziemek Bućko), November 2022. Confidence: Single-source (methodologically disclosed).

Caveat: Onely sells technical-SEO audits and has a commercial interest in flagging JS-indexing risk. The methodology is disclosed and the result is directionally consistent with Google's own statements that new low-authority sites have less crawl priority.

The Onely experiment measures discovery delay specifically, not render completion of an already-known URL, and is therefore not contradicted by later studies showing favourable rendering numbers on high-authority sites: the favourable numbers come from sites with crawl-priority cushion; the punishing numbers come from sites without one.

Index selection: why crawled pages are not always indexed

Crawling is a necessary but not sufficient precondition for indexing. Google decides per page whether to add a crawled URL to the index — a quality-and-selection judgement, not a queueing artefact. The two Search Console statuses that surface this distinction are persistently misread by site owners; they have different underlying causes and require different interventions:

The two GSC "not indexed" states have different underlying causes:

  • "Discovered – currently not indexed" = Google knows the URL exists but hasn't crawled it yet. At scale this is a crawl-priority / site-wide quality signal. Fix: site-wide quality and internal-linking architecture.

  • "Crawled – currently not indexed" = Google has crawled the page and chose not to index it. This is a per-page quality decision. Fix: improve the specific page, or accept that it isn't worth keeping.

Source: Google Search Central documentation; John Mueller commentary. Confidence: Verified.

"Discovered – currently not indexed"

The "Discovered" state means Google is aware the URL exists but has not yet allocated crawl resources to it. For small, healthy sites this typically resolves on its own. Mueller has confirmed it can also persist indefinitely:

"Discovered – currently not indexed" is the Search Console status for "the page was found by Google, but not crawled yet." For small, good-quality sites it usually resolves on its own. But John Mueller has confirmed it can persist indefinitely: "That can be forever. It's something where we just don't crawl and index all pages. And it's completely normal for any website that we don't have everything indexed."

Source: John Mueller (Google), SEO office hours. Confidence: Verified.

A large "Discovered" backlog is more significant than the per-URL status suggests. Mueller has framed it as a site-wide quality signal:

Mueller has framed a large "Discovered" backlog as a site-wide quality signal — Google declines to spend crawl resources on URL patterns it predicts will be low-value, which often reflects the domain's overall quality rather than the individual page's. Practical implication: "request indexing" on individual pages won't fix the underlying signal. The lever is whole-site content quality and internal-linking architecture, not page-by-page nudges.

Source: John Mueller (Google), Search Central office hours. Confidence: Verified.

"Crawled – currently not indexed"

The "Crawled" state means Google has fetched and evaluated the page and elected not to index it:

"Crawled – currently not indexed" is distinct from "Discovered." Here Google has fetched and evaluated the page and chose not to index it — typically a quality / selection signal (thin, duplicative, or low-value content; weak site-level trust). Owners frequently misread this as a bug; it is usually a deliberate decision. Mueller notes Google may even index a weak page briefly, then drop it if signals don't improve. It can persist for weeks or months and may never resolve without genuine quality improvement; re-requesting indexing without changes rarely helps.

Source: Google Search Central documentation; John Mueller commentary. Confidence: Verified.

The two states should therefore be read off the GSC Page Indexing breakdown by category, not by aggregate count. A rising "Discovered" count alongside a steady "Crawled" count points at site-wide quality and internal linking. A rising "Crawled" count alongside a steady "Discovered" count points at per-page quality. A widely cited practitioner rule attaches a soft escalation trigger at the four-week mark: if important pages remain unindexed after roughly four weeks despite good content and clean technicals, the appropriate response is a content-quality and internal-linking audit rather than repeated re-requests of indexing.

The indexing-failure tail

Published evidence converges on a consistent picture: a non-trivial share of all published pages on the open web never gets indexed, and the share has been rising under post-2024 quality-tightening pressure.

Vendor benchmarks of indexing rates

The most-cited large-sample study is IndexCheckr's February 2025 distribution analysis:

IndexCheckr, "Insights from 16 Million Pages" (February 28, 2025). Of pages that were eventually indexed: average time-to-index 27.4 days; 14.0% within 0–7 days; 64.86% within 30 days; 76.81% within 90 days; 93.2% within six months; remaining 6.8% after day 180.

Source: IndexCheckr, "Insights from 16 Million Pages," February 28, 2025. Confidence: Single-source.

Caveat: Three stacked biases — (1) product-incentivised; (2) survivorship — the timing distribution is computed only on pages that got indexed; (3) clock-start mismatch — the clock starts when IndexCheckr began tracking, not at publication. Plus selection bias toward URLs people bother to monitor.

The survivorship caveat is critical. IndexCheckr's distribution describes pages that eventually got indexed; it tells nothing about the share that never did. For that figure, the Indexing Insight segment study is the cleanest available source:

Indexing Insight (Adam Gent), 1.7M pages across 18 sites, URL Inspection API, pulled March 31, 2025. Index-coverage scores: ~97% for news sites; <90% for e-commerce; ~70% for marketplace/listing sites. 88% of not-indexed pages were quality-driven (not technical).

Source: Indexing Insight (Adam Gent), 2025 industry analysis. Confidence: Single-source.

Caveat: Small site count (n=18); monitors only "important" sitemap-submitted pages (not a random sample of all URLs); product-incentivised. The "88% quality-driven" finding is the most actionable single number — it confirms that the typical fix path for "not indexed" is content/quality, not technical SEO.

The Google-side corroboration of a meaningful non-indexed share is John Mueller's 2021 office-hours framing:

John Mueller (Google), January 2021 (per Onely): "Especially for larger websites, it's really normal that we don't index everything… maybe we just index 1/10 of a website because it's a really large website." Mueller has also framed ~20% non-indexed as within normal bounds for healthy sites.

Source: John Mueller (Google), office-hours statement, 2021 (via Onely). Confidence: Verified (as a statement, not a measurement).

A widely-quoted Ahrefs 2023 snapshot anchors the broader "most pages don't get traffic" claim, which is downstream of the indexing-failure tail:

Ahrefs 2023 study, ~14 billion pages: 96.55% of pages get no organic traffic from Google. Cross-checked against Ahrefs 2025 (98.26% of new pages did not reach top 10 within a year) and Semrush 2022 (92% of new domains failed to stay in the top 100 over a year).

Source: Ahrefs, 2023. Confidence: Industry-consensus on direction.

Caveat: Product-incentivised; the snapshot includes all page types (parameter sprawl, low-intent URLs). The direction is robust across studies; the exact 96.55% is one vendor's snapshot.

The post-2024 tightening and the de-indexing turn

The pre-2024 framing of "most content is indexed within a week" is survivorship-skewed and should be revised downward at the tail. The claim is true for the eventually-indexed majority on healthy sites, but false as a universal expectation:

Mueller's "within about a week" and Onely's "83% in week one" describe pages that eventually got indexed — they ignore the ~16% never-indexed tail and the rising de-indexing rate. IndexCheckr's harder number is only 14% within 7 days. Honest framing: a healthy page on a decent site is often indexed within days to a few weeks, but a non-trivial share waits months or never makes it.

Source: Synthesis of John Mueller, Onely (2022), and IndexCheckr (2025). Confidence: Industry-consensus.

Two events in the 2024–2026 window materially changed the picture. The first was Google's March 2024 quality update:

Google's March 2024 update statement (updated April 26, 2024) reported: "45% less low-quality, unoriginal content in search results versus the 40% improvement we expected." A deliberate, large-scale tightening of the quality bar — Google overshot its own targeted improvement by 5 percentage points. This is a Google-confirmed measurement of a deliberate intervention, not an incidental side-effect.

Source: Google Search Central, March 2024 core update statement, updated April 26, 2024. Confidence: Verified.

Caveat: "Low-quality, unoriginal" is Google's own framing — what counts as either is opaque. The statement establishes that the bar moved and that Google considers the move successful; it does not specify which content patterns were targeted.

The second was the May 2025 de-indexing event — the first widely documented case of Google actively removing previously indexed pages at scale:

In the May 2025 indexing purge, Indexing Insight detected that ~25% of 2 million monitored pages were actively removed from Google's index — their highest-ever de-indexing event. Individual sites lost 15–75% of indexed pages. The widely-cited "130-day rule" of thumb (that pages not re-crawled for 130 days get de-indexed) was broken by the event.

Source: Indexing Insight (Adam Gent), 2025. Confidence: Single-source.

Caveat: Monitored-page selection bias; product-incentivised. The strategic point is that de-indexing is now a major force — getting indexed fast is easier than staying indexed.

The combined effect on the broader indexing model is summarised by the same source:

The 2024–2026 environment is not "Google crawls less because of AI" — Cloudflare's data shows the opposite. What changed is the quality bar to stay indexed. Discovery and crawling are not the bottleneck they once were, but the indexing quality threshold and post-indexing de-indexing risk have intensified. Getting indexed fast is easier than staying indexed and ranking.

Source: Synthesis of Cloudflare 2025, IndexCheckr 2025, Google 2024 statement, and Indexing Insight 2025. Confidence: Industry-consensus on direction.

Cannot-force-indexing and the submission-tool category

A persistent vendor category sells "fast indexing," "instant indexing," or "guaranteed indexation" services. The category has been independently tested and the results converge on a "does not work" verdict:

IndexCheckr ran its own test on 33,930 previously-unindexed pages submitted to indexing tools: 29.37% became indexed; 70.63% stayed unindexed. Tool-driven indexing via temporary backlinks often reverses when the links are removed. Independent technical SEOs call general-purpose "fast indexing" / "guaranteed indexation" tools a scam.

Source: IndexCheckr, 2025 test. Confidence: Single-source.

Caveat: Self-selected sample (pages that were already failing to index — the hardest cases). Product-incentivised, but the finding argues against the category IndexCheckr itself sits in, which mitigates the bias flag.

The Indexing API path fails on three independent grounds:

Three layered facts together kill the "use the Indexing API for general content" pitch: (1) scope restriction — Google's Indexing API is officially restricted to JobPosting and BroadcastEvent content; using it for general content violates policy and the API now requires approval; (2) the schema-cheat backfires — putting JobPosting schema on non-job pages to qualify for the Indexing API is a documented manual-action catalyst; (3) even legitimate use does not guarantee indexing — Martin Splitt: "Pushing it to the API doesn't mean indexed right away or indexed at all."

Source: Google Search Central documentation; Martin Splitt and John Mueller, on-record statements. Confidence: Verified.

The consolidated rule is that Google's inclusion mechanism is not a button that can be pushed harder:

Google cannot reliably be forced to index a page. Discovery can be aided (sitemap, internal links, URL Inspection "Request indexing," external links) and blockers removed (robots.txt, response codes, soft 404s), but inclusion or ranking cannot be guaranteed. "Search is never guaranteed" — Mueller. The Indexing API is restricted to JobPosting and BroadcastEvent. IndexNow is not supported by Google. "Instant indexing" services that pitch otherwise are operating outside documented use cases.

Source: Google Search Central documentation; John Mueller, on-record. Confidence: Verified.

"Request Indexing" in URL Inspection is "a hint, not a command" (Google's own framing); duplicate requests on the same URL are ignored within a crawl cycle and re-clicking cannot change the quality decision that produced "Crawled – currently not indexed." The contested claim that paid tooling can "buy speed" therefore fails on every available channel.

Canonicalisation

When Google encounters multiple URLs serving substantively the same content (URL parameters, http/https variants, www/non-www, trailing slash, language alternates, syndicated copies, near-duplicate landing pages), it consolidates them into a cluster and selects one URL as the canonical representative for indexing and ranking. The owner can express a preference via the rel=canonical link element, HTTP Link headers, or sitemap inclusion. None of these are directives:

Per Google's documentation, canonicalisation (via rel=canonical, redirects, sitemap inclusion), redirects, and duplicate-content hygiene are processing-efficiency tools. They consolidate signals and avoid wasted crawling. They are NOT ranking boosts. A vendor pitching a "canonical-tag audit" as a ranking lever is misrepresenting the tool. The right pitch is that it prevents Google from splitting signals across duplicates and lets the strongest version of a page accumulate the credit.

Source: Google Search Central, canonical and duplicate-content documentation. Confidence: Verified.

Caveat: Canonical signals are hints, not directives — Google overrides bad rel=canonical declarations. The canonical-clustering process and the approximately forty signals Google is documented as using for canonical selection are covered in the consolidated Google search lifecycle brief.

The practical implication is asymmetric: clean canonicalisation prevents efficiency losses but does not produce ranking lift; broken canonicalisation can split signals across duplicates and cost ranking that a single consolidated cluster would have accrued.

Mobile-first indexing

Mobile-first indexing — the long-running migration to using the smartphone Googlebot's view of a page as the primary indexable representation — was declared complete by Google in October 2023:

On October 31, 2023, John Mueller announced on the Google Search Central Blog ("Mobile-first indexing has landed"): "It's been a long road, getting from there to here. We're delighted to announce that the trek to Mobile First Indexing is now complete." Google turned off the indexing-crawler indicator in Search Console at the same time. The smartphone Googlebot is the primary crawler for essentially all sites.

Source: Google Search Central Blog, John Mueller, October 31, 2023. Confidence: Verified.

Caveat: "Mobile-first" does not mean "mobile-only" — desktop crawlers still exist — but anything not present in the mobile version is effectively at risk of not being seen.

The consequence for site construction is that any content, link, structured-data block, or metadata element present only on the desktop view is treated as effectively absent:

Per Google's mobile-first documentation, content, structured data, and links that exist only on the desktop version may never be seen. The documentation explicitly warns that limiting links on the mobile version "can slow down discovery of new pages."

Source: Google Search Central, mobile-first-indexing documentation. Confidence: Verified.

Caveat: Most modern responsive frameworks produce parity by default. Where this typically breaks: WordPress sites with separate mobile plugins or themes, hand-crafted mobile menus that drop secondary pages, and legacy AMP-only patterns.

Google's stated remedy for the parity-gap problem is responsive design — a single document and URL whose presentation adapts to viewport via CSS, rather than separate mobile and desktop versions:

Google recommends responsive design (same HTML and URL across devices) precisely because it eliminates mobile/desktop parity gaps by construction. There is one document; CSS changes the presentation per device width; Google sees the same content, links, structured data, and metadata regardless of viewport. The classic failure modes are a separate or aggressively-trimmed mobile template (m.example.com or dynamic-serving subset), a redesign that quietly drops content from mobile, and a mobile menu that omits secondary pages from the link graph.

Source: Google Search Central, mobile-first-indexing documentation. Confidence: Verified.

The corresponding operational rule is that the mobile version is the indexed version. Any "mobile decluttering" pattern that drops sections, links, structured data, or metadata is an indexing risk, not merely a UX choice. The fix is to verify mobile rendered HTML in URL Inspection against the desktop view before launch and after any redesign.

Practical indicators in Search Console

Google Search Console exposes the pipeline through a small set of instruments. Practitioner usage centres on four:

The Google Search Console instruments practitioners actually use:

  • URL Inspection tool / API: ground truth for whether a specific URL is indexed. API allows up to 2,000 queries per day per property. "Request Indexing" is a hint, not a command; duplicate requests are ignored within a crawl cycle.

  • Page Indexing (Coverage) report: distinguishes "Discovered — currently not indexed" (queue / crawl-priority issue) from "Crawled — currently not indexed" (page-level quality/relevance issue). These are different triggers and need different fixes.

  • Performance report lag: normally 2–6 hours, but is not real-time and is not a timing instrument for indexing.

  • Crawl Stats report (Settings → Crawl stats): host status, response-code, and file-type breakdowns.

Indexing Insight has argued that GSC misreports indexing states for pages being actively forgotten — i.e., the report can show a page as indexed when it has effectively been dropped from active serving. Treat GSC as the best available instrument, not as ground truth.

Source: Google Search Console documentation; Indexing Insight commentary. Confidence: Verified on instrument descriptions; Single-source on the "GSC misreports forgotten pages" claim.

Caveat: The 2,000 queries/day/property limit applies to the URL Inspection API; the Search Console UI is not rate-capped in the same way.

The single most actionable downstream rule is to monitor index coverage as a leading indicator and to investigate quality before technicals when the ratio drops below a soft threshold:

Monitor index coverage of intentional pages (sitemap-submitted) as a leading indicator. Threshold to act: if coverage drops below ~85–90%, investigate quality and duplication first, technical SEO second. Indexing Insight's 1.7M-page study found 88% of not-indexed pages were quality-driven. The default debugging order in the wild is the wrong way around: people check robots.txt and sitemap status before they audit content quality. The data says quality is the more common culprit by an order of magnitude.

Source: Practitioner consensus, anchored in Indexing Insight (Adam Gent), March 2025. Confidence: Industry-consensus (rule); Single-source (the 88% figure).

A complementary rule treats the published averages as a distribution and plans against the failure tail rather than the median:

Plan SEO timelines using a distribution, not an average. Assume a ~15–20% chance any given valuable page is never indexed. Assume reaching the top 10 within the first year is a <10% outcome per page. Treat any faster result as upside, not the plan. The published averages (Ahrefs 1.74%, Semrush 4.2% holding top-10 a year) are distribution figures with fat failure tails. Treating the median as the expectation is what generates "we've been live 6 months, why aren't we ranking" reactions.

Source: Synthesis of John Mueller (2021), Onely (2022), Ahrefs (2025), and Semrush (2022). Confidence: Industry-consensus.

The downstream ranking-side treatment is at SEO J-curve and new-site ramp. The contested proposition that any single average ("3–6 months to rank") is a defensible planning number fails on methodological grounds: averages hide a distribution where 94–98% of new pages do not reach the top 10 in a year, and Ahrefs' "average #1 page is five years old" is an age-of-incumbents statistic, not a how-long-til-you-rank promise.

Operational rules for the discovery and indexing stages

Three practitioner rules consolidate the indicators above into pre-launch and early-weeks checklists:

Pre-launch discovery gates. Verify in Search Console; submit an XML sitemap with accurate <lastmod>; request indexing of the homepage. Confirm robots.txt returns 200 and isn't blocking; confirm DNS/server reachability and that the server returns clean 200/404/410 codes (no soft 404s). Ensure mobile and desktop content, links, and structured data are equivalent. The single most overlooked launch failure is a robots.txt 5xx — Google postpones the whole crawl rather than guessing.

Source: Practitioner consensus, anchored in Google Search Central documentation. Confidence: Industry-consensus.

Pre-launch technical gates. Confirm pages return 200; confirm robots.txt does not block the site or its JS/CSS; scan for a stray noindex in meta, X-Robots-Tag, and the CMS toggle (most common launch-killer); ensure important navigation uses real <a href> links, not onclick/buttons; choose SSR or SSG for any content that must be indexed; use responsive design for automatic mobile parity.

Source: Practitioner consensus, anchored in Google's mobile-first and rendering documentation. Confidence: Industry-consensus.

Early weeks (1–8): discovery and mobile parity. Build prominent internal links from the homepage to priority pages; earn a few genuine external links for discovery and crawl demand; prefer SSR or pre-rendering if the site is JS-heavy; verify with URL Inspection's live test that rendered content matches intent; monitor the Page Indexing report. A small, declining "Discovered/Crawled – currently not indexed" count is normal; a large or rising count signals a site-wide quality / architecture problem, not a per-page fix.

Source: Practitioner consensus, anchored in Google Search Central documentation. Confidence: Industry-consensus.

The corresponding do-not rules are equally well-attested. A large "Discovered – currently not indexed" or "Crawled – currently not indexed" count is a site-wide quality signal — escalate to a content-quality and internal-linking audit looking for thin content, near-duplicates, parameter sprawl, soft 404s, and orphans, rather than re-requesting indexing on individual pages. "Fast indexing" or "guaranteed ranking" vendors should be avoided: independent testing puts their indexed-share at 29.37%, the Indexing API is policy-restricted, and the JobPosting-schema workaround triggers manual structured-data spam actions.

How the model differs in the AI-search era

The arrival of Google's AI surfaces (AI Overviews, AI Mode) has not introduced a separate indexing pipeline. Cloudflare's edge-level observation in July 2025 is that Googlebot serves both classical Search and the AI surfaces from the same crawl pipeline:

Per Cloudflare's July 2025 analysis, Googlebot now serves both classical search indexing AND Google's AI surfaces (AI Overviews, AI Mode) from the same crawl pipeline. There is no separate AI crawler or AI index on the Google side. Practical implication: optimising for "AI visibility" on Google is not a separate technical track — it is the same indexing pipeline. The differentiator at the AI-surface layer is not "did you let GPTBot crawl you" — it's the quality, structure, and extractability of content already in Google's index.

Source: Cloudflare, "From Googlebot to GPTBot," July 1, 2025. Confidence: Single-source (Cloudflare's interpretation of observed crawler behaviour).

Caveat: Cloudflare is observing crawl behaviour at the edge; Google has not, to public knowledge, published an explicit "single pipeline" architectural statement.

The implication is that the indexing pipeline described in this page is also the AI-surface eligibility pipeline, and that post-2024 indexing-quality tightening propagates into AI-surface inclusion. The downstream AI-citation behaviour is treated at AI Overview citation patterns (GEO/AEO).

Adjacent to this picture, the post-update SERP volatility documented in late 2025 and early 2026 indicates that the recurring re-evaluation passes have become more aggressive. Search Engine Roundtable documented at least nine separate volatility waves in the seven weeks following the December 2025 core update; SE Ranking's 100,000-keyword analysis found that approximately 15% of pages previously in the top 10 disappeared from the top 100 entirely:

Normally Google SERPs settle 2–4 weeks after a core update completes; the December 2025 core update did not settle cleanly. Search Engine Roundtable documented at least 9 separate volatility waves in 7 weeks (December 2025–February 2026). In the current environment, "rankings will settle in N weeks" is not a safe planning assumption.

Source: Search Engine Roundtable, December 2025–February 2026 reporting. Confidence: Industry-consensus.

Caveat: "9 waves" is an observation in third-party rank-tracking tools; the precise definition of a "wave" varies by tracker.

SE Ranking analysed 100,000 keywords after the December 2025 core update and found that approximately 15% of pages previously in the top 10 vanished from the top 100 entirely — not pages that slipped to position 12, but pages that fell off the first ten pages of results altogether.

Source: SE Ranking, post-December 2025 core update analysis, 2026. Confidence: Single-source.

Caveat: Product-incentivised (SE Ranking sells a rank-tracking tool). The 100k-keyword sample is large but the keyword-selection method is not fully described.

The combined picture for the post-2024 environment is that discovery is not the bottleneck; the quality threshold for getting and staying indexed is materially higher than pre-2024 averages suggest; and the de-indexing turn means that the indexing pipeline must now be modelled as a continuous re-evaluation rather than a one-time decision. The downstream technical-foundation discipline for new sites is consolidated at Core Web Vitals (the tiebreaker-class signals stack) and at Editorial discipline and sourcing (the content-quality side that the post-March-2024 bar measures against).