Research brief: the lifecycle of a website in Google Search — from launch to mature standing and the perpetual re-evaluation that follows (June 2026)

Summary

TL;DR

Key reframings

New sites suffer from a signals vacuum, not a penalty box (Mueller (May 28, 2021 SEO office hours) on new-site ranking instability — "we don't have a lot of signals for that new content yet… we have to make assumptions"). Google's consistent on-record position is that new sites rank unpredictably because Google "doesn't have a lot of signals" yet and must "make assumptions," not because a timer is holding them back. This is the single most important reframing in the whole lifecycle.

Crawl budget is a non-issue for the vast majority of sites (Google explicitly tells sub-few-thousand-URL sites they DO NOT need to think about crawl budget — thresholds for caring are roughly 1M+ pages updated regularly, or 10k+ pages updated daily). Google's own documentation says sites with fewer than a few thousand URLs, or whose pages are crawled the same day they're published, do not need to think about crawl budget at all.

"Mobile-first" is complete and total (Mobile-first indexing declared COMPLETE on October 31, 2023 — Mueller, Google Search Central Blog: "the trek to Mobile First Indexing is now complete"). The smartphone Googlebot is the primary crawler for essentially all sites, and content not present in the mobile version may simply not be seen.

The 2024 Content Warehouse leak partially vindicated skeptics on one narrow point (2024 Google Content Warehouse leak — documents public on GitHub March 27–May 7, 2024; surfaced by Erfan Azimi; analyzed by Rand Fishkin and Mike King; Google confirmed authenticity but cautioned against "out-of-context, outdated, or incomplete information"). A hostAge attribute described as being used "to sandbox fresh spam in serving time" exists — but its documented purpose is spam containment, not a blanket suppression of all new sites, which is consistent with rather than contradictory to Google's public messaging.

The pipeline at a glance

Stage 0 — discovery prerequisites: robots.txt reachable, sitemap submitted, internal links in place (Google URL discovery paths (in rough order of reliability) — inbound links, XML sitemaps, Search Console, redirects, hosting-platform auto-notify, Google sitemap tag semantics — uses <loc> and <lastmod> (when accurate); openly ignores <changefreq> and <priority> (Illyes: "a bag of noise"), Google robots.txt 5xx response postpones the whole crawl — Google will not guess; DNS/server reachability is a hard gate on the entire pipeline).

Stage 1 — URL discovery and queue scheduling. The "Discovered – currently not indexed" state can persist indefinitely (Google crawl queue prioritization — driven by perceived importance (internal link prominence, sitemap inclusion, historical quality); Splitt: "there is a queue in between us discovering the URL and the URL actually being crawled", Search Console "Discovered – currently not indexed" can persist indefinitely — Mueller: "That can be forever. It's something where we just don't crawl and index all pages", A large "Discovered – currently not indexed" backlog is a SITE-WIDE quality signal, not a per-page problem — Google declines to spend crawl resources on URL patterns it predicts will be low-value).

Stage 2 — Crawling. Mobile-first, response-code aware, crawl budget only matters at 1M+ pages or 10k+ daily updates (Mobile-first indexing declared COMPLETE on October 31, 2023 — Mueller, Google Search Central Blog: "the trek to Mobile First Indexing is now complete", Content/structured-data/link parity between mobile and desktop is required — limiting links on the mobile version "can slow down discovery of new pages", Crawl budget = crawl capacity limit × crawl demand — capacity is what the server can take without degrading; demand is what Google wants to crawl (popularity, perceived quality, staleness), Google explicitly tells sub-few-thousand-URL sites they DO NOT need to think about crawl budget — thresholds for caring are roughly 1M+ pages updated regularly, or 10k+ pages updated daily, A brand-new domain has no history, so Google has little crawl demand to work with — crawls conservatively and ramps up (or doesn't) based on what it finds, Illyes (May 2023 SEO Office Hours) — indexing speed "depends on a bunch of things, but the most important one is the quality of the site, followed by its popularity on the internet", Googlebot response-code handling — 200 proceeds, 3xx is followed (up to a chain limit), 4xx (incl. 410) is dropped without wasting crawl budget, 5xx slows or pauses crawl, soft-404 confuses everything, Search Console Crawl Stats report (Settings → Crawl stats) is the primary observability surface for the crawl stage — total requests, average response time, host status, response-code/file-type breakdowns).

Stage 3 — Rendering. Evergreen Chromium WRS renders all 200-status pages; "two waves" is now an oversimplification per Splitt himself (Google Web Rendering Service (WRS) runs on evergreen Chromium since 2019 — modern JS (ES6+, Web Components, etc.) is supported; previously frozen at Chrome 41, Google queues ALL 200-status pages for rendering, JS or not — Splitt: "you don't really see how long it takes us to render, if we render at all, when we render", "Two waves of indexing" — Google's Martin Splitt now calls it an oversimplification; "pretty much every website, when we see them for the first time, goes to rendering" and the waves "play less and less of a role", JS-heavy sites remain the highest-risk category at the render stage — Google continues to recommend SSR or pre-rendering as the robust path, Google removed older "design for accessibility / function without JavaScript" recommendation from its JS-SEO docs on March 4, 2026, as "outdated").

Stage 4 — Indexing and index selection. Not guaranteed; canonical clustering uses ~40 signals (Google: "Indexing isn't guaranteed; not every page that Google processes will be indexed" — Mueller: "most of the time when we still crawl something, it doesn't necessarily mean that we will automatically index it", Google canonicalization — clusters similar pages, selects the single most "representative" URL as canonical; signals are HINTS not directives (redirects strong, rel=canonical strong, sitemap weak, HTTPS preferred, hreflang clustering), Google canonical selection uses ~40 signals — Allan Scott (Google "Dups" team) on Search Off the Record: "somewhere in the neighborhood of 40", Search Console "Crawled – currently not indexed" is a deliberate quality decision, not a queue state — Google fetched and evaluated the page and CHOSE not to index it, Realistic indexing timing — Google's own stated range is "several hours to several weeks"; Mueller suspects "most good content is picked up and indexed within about a week", Onely (Tomek Rudzki) — "on average, 83% of pages are indexed within the first week of publication; some pages have to wait up to eight weeks"; ~16% of valuable, indexable pages on popular sites NEVER get indexed — flag survivorship).

Stage 5 — Initial ranking and stabilization (the myth-laden stage). See the eight myth-busting entries (Mueller (May 28, 2021 SEO office hours) on new-site ranking instability — "we don't have a lot of signals for that new content yet… we have to make assumptions", Mueller (May 2021) explicitly REJECTS both "sandbox" and "honeymoon" framings — "not the case that we're explicitly trying to promote new content or demote new content. It's just, we don't know and we have to make assumptions", Google has denied the "sandbox" for ~20 years — Matt Cutts (2005), Gary Illyes (2016), John Mueller (August 19, 2019 tweet: "There is no sandbox"), 2024 Google Content Warehouse leak — documents public on GitHub March 27–May 7, 2024; surfaced by Erfan Azimi; analyzed by Rand Fishkin and Mike King; Google confirmed authenticity but cautioned against "out-of-context, outdated, or incomplete information", hostAge leak attribute — documented as being used "to sandbox fresh spam in serving time"; this is a SPAM-CONTAINMENT filter that new low-trust sites can trip, NOT a blanket probation on all new sites, Domain age is NOT a ranking factor — Mueller: "No, domain age helps nothing"; asked who pushes the idea: "Primarily those who want to sell you aged domains :-)", Trust accrual on a new site — Mueller: site-wide quality assessment "can easily take… a couple of months, a half a year, sometimes even longer than a half a year, for us to recognize significant changes in the site's overall quality", There is NO Google-confirmed numeric "time to rank" figure — Google gives ranges, refuses ranking timelines; vendor "X months to rank" numbers are marketing).

Stage 6 — Ongoing re-crawl and maintenance (Google re-crawl cadence is driven by page importance, change frequency, and freshness demand — <lastmod> is used "if it's consistently and verifiably accurate", Lying with <lastmod> erodes Google's trust in your sitemap — "eventually we're not going to believe you anymore" (Google, 2023), Gary Illyes' 2024 stated mission was to "crawl even less" and schedule more intelligently — focusing on "URLs that more likely to deserve crawling", Mature sites get faster indexing as a TRAILING INDICATOR of accrued trust — a new post on an authoritative site can be indexed in minutes-to-hours; the same post on a new domain waits days-to-weeks, Broad core updates are NOT penalties — Sullivan: "this doesn't mean all sites will go back up to wherever they were if they are down from a previous peak"; recovery requires substantive improvement + waiting for the next update; 2026 cadence: March 27–April 8 + May 21–June 2).

Staged playbook (rules)

Stage A — pre-launch / launch week (remove the gates): Rule (Stage A, pre-launch / launch week): remove the gates — verify GSC, submit honest sitemap, request homepage indexing, confirm robots.txt + DNS + 200/404/410 hygiene, ensure mobile/desktop parity.

Stage B — weeks 1–8 (earn discovery and indexing): Rule (Stage B, weeks 1–8): earn discovery and indexing — build internal links from the homepage to priority pages, earn a few genuine external links, prefer SSR/pre-rendering if JS-heavy, monitor GSC Page Indexing for trend.

Stage C — months 2–12 (let trust accrue, don't panic): Rule (Stage C, months 2–12): let trust accrue, don't panic — keep <lastmod> honest, expect ranking volatility as Google "making assumptions" not a penalty, do NOT buy aged domains or sandbox-escape services, plus the no-aged-domains rule Rule: do not buy aged domains, "instant indexing" services, or "sandbox escape" packages — the causes they address are not real and the four-week unindexed threshold Rule: if important pages remain unindexed after ~4 weeks despite good content and clean technicals, escalate to a content-quality and internal-linking audit — do NOT just keep clicking "request indexing".

Cross-cutting hygiene: Rule: update sitemap <lastmod> only on substantive content changes — churning the date erodes Google's trust in the signal, Rule: a large or rising "Discovered/Crawled – currently not indexed" count is a SITE-WIDE quality signal — fix the site, not the page.

Genuine unknowns

See Genuine unknowns in the Google Search pipeline — exact queue priority math, render-queue position, signal weightings, re-rendering triggers, whether/when a page will ever rank for what Google won't disclose: exact crawl-queue priority math, render-queue position, signal weightings (the 2024 leak listed attributes without weights), precise re-rendering triggers, whether/when a given page will ever rank.

Source: compass_artifact research document, June 2026. Anchored in Google's documentation, Search Central Blog, and on-record statements from John Mueller, Gary Illyes, Martin Splitt, Danny Sullivan, and Allan Scott.