Google 如何抓取您的網站和索引

已發表: 2016-02-18

Google 如何抓取您網站的 AMP 頁面、JavaScript 和 AJAX

最後更新：2021 年 3 月 6 日

谷歌“抓取”Ajax 內容的一大“進步”！ Google 在使用 JavaScript 和 AJAX 抓取內容方面是世界上最好的。但是，他們仍在完善它。您是否發現這是一個讓您的網站被抓取的戰場？

Google 已經改變了處理對 Web 內容的 AJAX 調用的方式。 谷歌的約翰穆勒說他們可以“開和關”，所以搜索專家可以使用額外的 SEO 策略，讓谷歌知道你對內容的抓取意圖。展望未來，最好不要依賴過去的方式。擁有敏捷的營銷流程將幫助您更快地應對變化。

您可以利用在 Google Search Console 中發現的見解，並通過本文的答案和解釋更好地了解如何閱讀和解釋它們。您的Core Web Vitals 報告揭示了 PageSpeed 問題。

什麼是 Google 抓取和索引？

抓取和索引仍然是不同的任務。抓取是指 Googlebot 查看網頁上的所有內容和代碼並對其進行分析。索引是指同一頁面有資格被包含並顯示在 Google 的搜索結果中。自 Google Panda 更新以來，域名的重要性顯著上升。您在網絡世界的業務增長取決於您的網頁是否被正確抓取和索引。

一些企業做了所有創建優質內容和優化網站的工作，但未能將關鍵內容編入索引。這就是為什麼我們建議您的業務規劃會議和戰略預先考慮這一點。

什麼是谷歌機器人？

Googlebot 是 Google 使用的搜索機器人軟件，它從網絡上收集文檔，為 Google 搜索引擎構建可搜索的索引。

無論您是想學習 Google Crawler 的付費搜索方法還是付費搜索方法，SEO 都可以通過正確理解 GoogleBot 來改進搜索策略。

GoogleBot 是 Google 搜索引擎的一個分支，它可以抓取您的網頁並創建索引。它也被稱為蜘蛛。 GoogleBot 使用機器學習來抓取您允許它訪問的每個頁面，並將其添加到 Google 的索引中，在那裡可以檢索和返回以匹配用戶的搜索查詢。您努力向 Google 清楚地表明您希望抓取您網站上的哪些頁面以及您不想抓取哪些頁面，這看起來也像是一場戰鬥。

我如何知道我的網站是否在 Google 的索引中？

您的 Google Search Console 中的 Google 索引信息報告將測試您資源中的 URL。它的 URL 檢查工具將顯示 URL 的當前索引狀態。您需要輸入完整的 URL 來檢查它並獲得索引狀態報告。

您可以使用 Fetch as Google Tool 查看您的網站在被 Google 抓取時的外觀。由此，網站所有者可以利用更細粒度的選項，並可以選擇如何在逐頁狀態下對內容進行索引。一個示例是能夠查看您的頁面在有或沒有片段的情況下如何顯示 - 在緩存版本中，這是在 Google 服務器上收集的備用版本，以防當前無法查看實時頁面。

檢查任何站點 URL 的索引狀態的另一種方法是使用 info: 運算符。使用 Google Chrome 瀏覽器並在導航欄中輸入 info: URL。這將觸發 Google 顯示：顯示 Google 的“example-domain-url”緩存。查找類似於“example-domain-url”的網頁。

Google 索引如何在 JavaScript 網站上運行？

當 Google 需要抓取 JavaScript 網站時，需要一個傳統 HTML 內容不需要的額外階段。它被稱為渲染階段，需要額外的時間。索引階段和渲染階段是獨立的階段，這讓 Google 首先索引非 JavaScript 內容。

JavaScript 是 Google 抓取和索引的一個更耗時的過程。原因是需要先下載，解析，然後執行。

在渲染輸出中如何處理 JavaScript 預渲染？

GoogleBot 可以預渲染在渲染輸出中使用的 JavaScript。這家技術巨頭從用戶體驗的角度來看預渲染 JS。這使得它更容易，因為它消除了從預渲染頁面中刪除 JS 的需要。如果您的站點依靠 JS 來管理次要站點內容和佈局更新，而不是 AJAX 請求，請繼續關注 Google 的發言人 Martin Splitt，他負責處理此類抓取/索引案例的先進技術。

谷歌的抓取和索引過程有多重要？

讓您的網站被 Google 抓取並正確編入索引是網絡營銷成功的關鍵。這是整個起點。你必須有網絡爬取和被索引的能力才能成功。如果沒有將站點地圖上傳到您網站的根文件夾，爬網可能需要很長時間——可能需要 24 小時或更長時間才能索引新的博客文章或深度網站。

大多數互聯網衝浪者從未意識到您為提高網站的可抓取性和索引編制所採取的不同步驟。

據谷歌稱，“谷歌搜索索引包含數千億個網頁，大小超過 100,000,000 GB”。

如何管理舊資產以獲得最佳爬取歷史？

Googlebot 會抓取當前處於 404 狀態的各種陳舊資產。通常，應維護舊資產，直到它們停止被抓取。最終，Google 將重新抓取 HTML 內容、評估新網站資產並更新其抓取工具。我們不建議使用 401 或 404 來管理舊資產；這可能會導致渲染損壞。這是應該避免的。有時在使用 Rails Asset Pipeline 進行緩存時會發生這種情況。

為了讓 Google 抓取您的網站，您需要了解兩個重要概念：

1. 如果您希望您的網站被抓取和索引，那麼搜索引擎蜘蛛需要能夠正確查看您的網站。

2. 你可以做很多事情來確保你的網站被谷歌的蜘蛛正確抓取。

網絡爬蟲，有時稱為蜘蛛、蜘蛛機器人或網絡蜘蛛，通常簡稱為爬蟲，是一種系統地瀏覽萬維網的互聯網機器人，通常用於網絡索引。如果您添加所有必要的架構標記類型，您可以幫助 Google 更好地理解您的內容並將其編入 Google 的庫中。

關於聯合內容以及 Google 如何抓取 AJAX 的新見解

Google 現在可以索引 ajax 調用，了解這在 Google 搜索結果中的含義很重要。

當約翰·穆勒在上週五的英語谷歌網站管理員中心辦公時間環聊中被問及如何處理聯合內容和 ajax 調用時，他的回答是：“過去，我們基本上忽略了這一點。可以做的是使用 JS。” 我覺得有趣的是，發生了什麼變化以及我們可以期待什麼。您可以表明您希望呈現一個頁面以進行索引，同時保留一個動態標題。如果您想排除頁面的一部分，您可以使用 robots.txt 文件來表示該願望。讓您的Accelerated Mobile Page 的內容盡可能接近您的桌面版本。

例如，包含描述、評論、按鈕和答案 (Q&A) 的產品詳細信息頁面。聯合電話的那部分可以隱藏；他會建議將該內容移動到您要聚合此內容的站點內的單獨目錄中，這樣可能會阻止您的 robots.txt 文件聚合的內容。這可以避免看起來您正在自動生成網站上實際不存在的內容。

“我們試圖做的是呈現頁面，就像它在瀏覽器中的樣子一樣。查看最終結果並在搜索中使用這些結果”Mueller 補充道。如果您只想不考慮頁面的一部分，則可以由 WebMasters 自動處理文本。有趣的是，甚至 Mueller 也暗示在給定頁面中指出您不想解析哪些 AJAX 內容可能會很棘手，而頁面的其餘部分則針對 AJAX 內容進行解析。

Google 不會一直抓取您的所有 JavaScript。 “它仍然時斷時續，但正朝著越來越多的方向發展，”穆勒說。很明顯，Google 的開發人員正在關注 Google 如何抓取 AJAX 的測試，並希望正確索引註入 JavaScript 的標題，並且更加一致。這不是關於“向未編入索引的用戶偷偷內容”。

如果您有特定內容需要或不想被抓取，手動維護網站的 robots.txt 文件仍然是一個好主意。

谷歌如何抓取網站的概述

1.首先要知道的是你的網站總是被抓取。 谷歌已經表明； “平均而言，Googlebot 不應每隔幾秒訪問您的網站一次以上。” 換句話說，只要您的網站正確設置為可供抓取工具使用，您的網站就會一直被抓取。 Google 的“抓取速度”是指 Googlebot 請求的速度；這與您的網站被抓取的頻率無關。通常，企業的知名度越高，這部分來自於更多的新鮮度、具有權威性的相關反向鏈接、社交分享和提及等，您的網站就越有可能出現在搜索結果中。想像一下 Googlebot 做了多少次抓取，因此它並不總是可行或沒有必要一直抓取您網站上的每個頁面。

2、谷歌的套路是先訪問一個站點的robots.txt文件。 從那裡它了解網站所有者對允許谷歌在網站上抓取和索引的內容的規定。任何被指示為“不允許”的網頁都不會被編入索引。

與一般的 SEO 工作一樣，保持 robots.txt 文件處於最新狀態很重要。這不是一次性完成的交易。了解如何使用 robots.txt 文件進行爬網是一項熟練的任務。您的技術網站審核應涵蓋 robots.txt 的覆蓋範圍和語法，並讓您知道如何解決任何現有問題。

3. 谷歌接下來讀取sitemap.xml。 雖然搜索引擎不需要站點地圖來發現要抓取和索引的站點的任何和所有區域，但它仍然具有實際用途。由於不同網站的構建和優化方式，網絡爬蟲可能不會自動爬取每個頁面或部分。一些內容更受益於專業且結構良好的站點地圖；例如動態內容、排名較低的頁面或擴展的內容存檔，以及幾乎沒有內部鏈接的 PDF 文件。站點地圖還可以幫助 GoogleBot 快速了解新聞文章、視頻、圖像、PDF 和移動設備等類別中的元數據。

4. 搜索引擎更頻繁地抓取具有已建立信任因素的網站。 如果您的網頁獲得了顯著的 PageRank，那麼我們已經看到 Googlebot 有時會授予網站所謂的“抓取預算”。您的商業網站獲得的信任和利基權限越大，您可以預期從中受益的抓取預算就越多。

為什麼站點鏈接結構可能會影響爬網率和域信任

一旦您了解了 Google Crawler 的工作原理，新的更新可能會反映它們是否已取消或應用了某些搜索過濾器、新補丁是否更易於響應或域鏈接結構的更改。對您的網站在SERP 排名變化以及您的競爭中的表現進行基準測試，看看是否每個人都在特定時間獲得了轉化流量的峰值。這將有助於排除孤立事件。

遵守道德並贏得域信任。與其嘗試維護網絡服務器的機密，不如從一開始就遵循 Google 的搜索最佳實踐。 “一旦有人通過從您的‘秘密’服務器到另一個 Web 服務器的鏈接，您的‘秘密’URL 可能會出現在引薦來源標籤中，並且可以由另一個 Web 服務器在其引薦來源日誌中存儲和發布。同樣，網絡上有許多過時和損壞的鏈接”，這家搜索巨頭表示。每當有人錯誤地發布了指向您網站的鏈接或未能更新鏈接以反映您的服務器中的更改時，結果是 GoogleBot 現在將嘗試從您的網站下載不正確的鏈接。

標記您的內容如何幫助 Google 爬蟲

當 SEO 專家正確實施 Google 結構化數據來標記Web 內容時，Google 可以更好地理解您在搜索中展示的上下文。這意味著您可以將您的網頁更好地分發給 Google 搜索的互聯網用戶。這是通過標記內容屬性並在相關的地方啟用模式操作來實現的。這使其有資格包含在 Google Now Cards 、 Answer Boxes的大型展示和精選的豐富片段中。

為 GoogleBot 標記 Web 內容屬性的步驟

1.從 schema.org 提供的表中查明最佳數據類型。

找到最適合您的內容的內容，然後從該類型的標記參考指南中進行選擇，以找到所需和推薦的屬性。允許將多種內容類型的標記添加到單個 HTML 或 AMP HTML 內容頁面中，以幫助您進行下一次 Google 抓取。我們發現用戶喜歡包含視頻內容的新聞文章，這為添加標記創造了絕佳機會，以幫助您的內容頁面有資格包含在新聞輪播中的熱門故事或視頻的豐富搜索結果中。

2.製作一段包含您的關鍵產品和服務的標記。

借助所需的結構化數據屬性，盡可能輕鬆地抓取您的網站，以便在您想要獲得的 SERP 中進行可視化呈現。 SEO 現在有一個廣泛的數據類型參考可供借鑒，其中包含許多可定制標記的示例。通過使用可識別文章中的部分的可說 schema.org 屬性，可以改進抓取和索引。它可以在您的信息頁面中提取答案。

“如果您向我們提供服務器端呈現的頁面，並且該頁面上有 JavaScript 會刪除所有內容或以可能破壞的方式重新加載所有內容，那麼這可能會破壞我們的索引。所以這是我要確保的一件事，如果你交付一個服務器端呈現的頁面並且你仍然有 JavaScript，請確保它的構建方式是當 JavaScript 中斷時，它不會刪除內容，但是相反，它只是還無法替換內容。” ——谷歌的約翰·穆勒”

什麼是 Google 抓取預算？

“考慮它的最佳方式是，我們抓取的頁面數量與您的 PageRank 大致成正比。因此，如果您的根頁面上有很多傳入鏈接，我們肯定會抓取它們。然後您的根頁面可能會鏈接到其他頁面，這些頁面將獲得 PageRank，我們也會抓取這些頁面。然而，隨著您對網站的深入了解，PageRank 往往會下降，”Stone Temple 的 Eric Enge 說道。

在與潛在顧問討論爬網優化之前，請確保他們完全了解要點。抓取預算是一些人不熟悉的術語。應該確定 Google 為抓取您的網站分配的時間或頁面數量。如果您解決了阻礙網站性能的關鍵問題，爬網可能會有所改善。

谷歌的馬特·卡茨 (Matt Cutts) 為 SEO 提供了關於抓取頁面數量的首要注意事項。他在 2010 年表示，“實際上並不存在指數化上限。很多人認為一個域只會獲得一定數量的頁面索引，而這並不是它真正的工作方式。我們的爬行也沒有硬性限制。”

我們發現它有助於查看與您的 PageRank 和域信任成比例的抓取頁面數量。他補充說：“因此，如果您的根頁面上有很多傳入鏈接，我們肯定會抓取它們。” 詳細了解John Mueller 對網站反向鏈接配置文件的看法。

索引覆蓋率中的站點地圖數據

回答有關 Google 如何抓取網站的揮之不去的問題。

隨著新的 Google Search Console 完成，許多人詢問哪些報告仍可用於更好地了解抓取和索引。

“隨著我們繼續使用新的 Search Console，我們將關閉舊的站點地圖報告。新的站點地圖報告具有舊報告的大部分功能，我們的目標是隨著時間的推移將其餘信息（特別是圖像和視頻）帶到新報告中。此外，要跟踪在站點地圖文件中提交的 URL，您可以在索引覆蓋率報告中使用站點地圖文件進行選擇和過濾。這樣可以更輕鬆地專注於您關心的 URL。” ——約翰·穆勒，2019 年 1 月 25 日

鑑於Google 視覺搜索的興起，今天讓您的圖像和視頻文件正確編入索引變得更加重要。強大的視覺資產可以促成銷售。產品頁面和圖像的正確索引為Google 產品輪播提供了動力。

2016 年 7 月 9 日，John Mueller 談到如果 Google 必須呈現頁面然後看到重定向，這會導致延遲。當被問到“頁面被抓取時是否有任何時間表？” 他回答說：“這是科學的。”

當被問及是否包含包含價格等信息的結構化數據或可能缺貨的商品的內容時，這是否會提高準確數據的抓取率？回應是，“這是一個複雜的技術領域。” John Mueller 補充說：“我認為結構化數據可以通過不同的方式提供給我們。使用站點地圖讓我們知道。僅僅因為有一些定價信息並不意味著數據會迅速更新。”

熊貓算法是連續的，但不會在爬行時運行

我們知道，網站抓取是 Googlebot 發現要添加到 Google 索引中的新頁面和更新頁面的過程，並通過算法過程執行此操作：因為它的計算機程序確定要抓取的網站、抓取頁面的頻率以及質量它在重新處理網站的大部分內容時給出的評估，一個小型網站通常可以在幾個月內被收回。直接在您的Google 商家信息上發布帖子有助於將這些網址直接放入 Google 的索引中。

Panda 算法確實會持續運行，並且不會按照任何預定的時間表運行，但它確實需要一些時間，比如某些網站需要幾個月的時間來收集相關的語義信號以進行爬取。根據 Google 的 John Mueller 的說法，抓取頻率因站點而異。

問：當被問及“如果您同時擁有網站的桌面版本和 AMP 版本作為移動版本，那麼使用動態服務的最佳方式是什麼？”

答：“我相信，如果您使用動態服務，我們將使用普通的 GoogleBot 抓取 AMP 頁面，我們將永遠看不到 AMP 頁面。根據您在代理上使用的參數，您會突然獲得一個 AMP 頁面，而不是一個 HTML 頁面。” 如果 Twitter 等非 Google 客戶想要提取頁面的 AMP 版本，動態提供的 AMP 頁面也很複雜。 John Muller 敦促網站管理員避免網站出現技術問題。

問：當被問及 GoogleBot 如何讓一個頁面在被負面 SEO 定位後被拒絕時？拒絕文件可用於中斷與此類反向鏈接的關聯。 “對於你來說，仍然抓取反向鏈接頁面以更新拒絕文件是否重要？”

A. “我們在重新抓取或重新處理其他頁面後刪除了鏈接。如果我們不費心重新抓取它，無論如何它不會有很多重量。如果我們都需要 6 個月的時間才能再次抓取該頁面。” 分類器確定網站何時準備好重新爬網並嘗試評估以形成移動索引的一般指導。搜索巨頭正試圖找出與移動爬網真正相關的內容。

觀看完整的 Google Webmaster Central Hangout 以了解完整的詳細信息。谷歌試圖專注於網站中更重要的 URL。如果需要，提交垃圾郵件報告，然後它會嘗試識別其他人的負面 SEO，這意味著阻止網站成功抓取的能力。在大多數情況下，專注於您可以做些什麼來改進您的網站，建立積極的搜索歷史，並使其更加強大和更好。

“我們不會一直以相同的頻率抓取 URL。所以我們每天都會抓取一些網址。某些 URL 可能是每週一次。每隔幾個月，甚至可能每半年左右一次的其他 URL。
如果您對您的網站進行了全面的重大更改，那麼可能很多這些更改會很快被採納，但會有一些剩餘的更改。
這是一個帶有最後修改日期的站點地圖文件，以便 Google 啟動並嘗試比其他方式更快地仔細檢查這些文件。” ——約翰·穆勒

谷歌爬蟲可能仍面臨的現有問題

1、網址結構複雜的網站，多為網址參數問題。將會話 ID 等內容混合到路徑中可能會導致抓取多個 URL。實際上，谷歌並沒有真正陷入困境。但它可能會浪費您的網站需要更明智地使用的大量資源。

2.當GoogleBot發現相同的路徑部分一遍又一遍地重複時，它可能會減慢爬行速度。

3. 渲染內容，如果谷歌爬蟲不能立即提取頁面內容，它會渲染頁面，看看會出現什麼。如果頁面上有任何元素需要您單擊某些內容或執行某些操作才能查看內容，那麼這也可能是它可能錯過的內容。 GoogleBot 不會四處點擊以查看可能出現的情況。約翰穆勒說：“我認為我們不會嘗試任何點擊的東西。這不像我們不斷滾動。”

我從對話中了解到，它有助於區分在用戶採取行動之前未加載的內容和 GoogleBot 不向下滾動就看不到的內容。

與其花大量時間配置 JavaScript 來管理頁面上顯示的內容，不如尋找為最終用戶提供最正確和最完整內容的內容。考慮調整網站的分頁、JavaScript 和有助於用戶獲得更好體驗的技術。 “第三次也是最後一次，看看 AMP，Andrey Lipattsev 在活動結束時重申。

我們強烈建議每個網站都為Google 移動搜索的興起做好充分準備。此外，GoogleBot 可能會在嘗試獲取嵌入內容時超時。對於用戶來說，這會使可訪問性變得更加困難。

您要抓取的最重要的信息是：

Web URL — 您的頁面、帖子和關鍵文檔的 Web URL 地址。
頁面標題標籤——頁面標題標籤表示網頁、博客文章或新聞文章的名稱。
元數據——這可以包含許多內容，例如頁面描述、結構化數據標記和流行關鍵字。

這是 GoogleBot 在抓取您的網站時檢索的主要信息。這也很可能是您看到的索引。這是基本概念。對於正在發展的網站，您的網站可能被抓取的方式以及搜索結果如何返回、組織以及有機會顯示在豐富網頁摘要中的方式要復雜得多。請注意， Google 會注意到發佈在其平台上的評論；如果由於某種原因它不會索引您的網站， Google 評論可能會在本地包和其他地方彈出。

Google 如何抓取新的域擴展

Google 於 2015 年 7 月 7 日宣布了他們計劃如何處理新域名的排名，例如 .news、.social、.ninja、.doctor、.insurance、.shopping 和 .video。總而言之：它們的排名將與 .com 和 .net 完全相同。在這個富有創意的數字環境中，觀看 Google 在抓取過程中如何體驗您的網站的現場演示將展示高級 SEO 如何更好地為互聯網上的日常搜索提供真實、可觸摸和有形的內容。隨著新的域擴展在這里和擴展，如果您使用它們，請確保您的網站將被自動抓取並且內容非常自然地交付。

谷歌提供了谷歌爬蟲將如何處理這些即將在搜索結果中出現的域的一瞥，希望能夠避免可能的誤解，即他們將如何處理最新的域擴展選項。當被問及.BRAND TLD是否會比 .com 獲得更多或更少的權重時，谷歌回答說：“不會。這些 TLD 將被視為與其他 gTLD 相同。它們將需要相同的地理定位設置和配置，並且它們不會對我們抓取、索引或排名 URL 的方式產生更大的權重或影響。”

對於可能想知道新 gTLD 如何影響搜索的網站管理員，我們了解到 Google 將像其他 gTLD（例如 .com、.net 和 .org）一樣抓取新 gTLD。根據我們對帖子的解釋，在 TLD 中使用關鍵字不會通過在 SERP 排名中授予特定優勢或劣勢來影響網站。

GoogleBot 多久抓取一次網站？

較新的站點和不經常更新的站點的抓取頻率較低。平均而言，如果工作正確，Googlebot 可以在四天內以最快的速度發現和抓取一個新網站。我們還發現這可能需要四個星期。然而，這確實是一個“視情況而定”的答案。我們聽說其他人在同一天聲稱索引。谷歌表示，抓取和索引是可能需要一些時間並且依賴於許多因素的過程。

如何使 AJAX Web 應用程序可抓取？

當網站管理員選擇將 AJAX 應用程序與旨在顯示在搜索結果中的內容一起使用時，Google 宣布了一個新流程，該流程在實施後可以幫助 Google（以及可能的其他主要搜索引擎）抓取和索引您的內容。過去，由於 AJAX 內容可能需要動態處理，AJAX Web 應用程序對搜索引擎的處理提出了挑戰。

大多數網站所有者手頭的任務比設置抓取、索引或提供網頁的限制更重要。需要深入 SEO 的人來指定哪些頁面有資格出現在搜索結果中或頁面上的哪些部分。在大多數情況下，如果您的 Web 內容得到了很好的優化，您的頁面應該被編入索引，而無需採取額外措施。對於大型購物車通常需要的更精細的方法，有許多選項可用於指示有關網站所有者如何允許 Google 對其網站進行抓取和索引的偏好。大部分專業知識是通過 Google Search Console 和一個名為“robots.txt”的文件執行的。

John Mueller 邀請網站管理員就爬取 AJAX 發表評論。隨著這方面的進一步發展，Google 對 GoogleBot 解析 JavaScript 和 Ajax 的能力或究竟如何的態度更加積極。在實施太多意見之前，最好密切關注該主題的溝通發展軌跡。目前，我們建議您不要將很多重要的站點元素或 Web 內容委託給 Ajax/JavaScript。

幫助 Google 抓取網站的更高級方法

在您的 Google Search Console（以前稱為 Google 網站管理員工具）中，可以設置 URL 參數。對於一個簡單的網站，這通常是不需要的；甚至谷歌也預先警告用戶，他們在使用這種 SEO 策略之前應該已經具備了專業知識。您的網站是否面臨重複內容的問題可能是一項決定。

抓取問題可能是由動態 URL 引起的，這反過來可能意味著您在 URL 參數索引上遇到了一些挑戰。 URL 參數部分允許網站管理員配置他們對 Google 如何使用 URL 參數抓取和索引您的網站的選擇。默認情況下，網頁的抓取方式與 GoogleBot 確定的方式相對應。應仔細檢查包含人們需要的關鍵答案的頁面； 許多人在 People Also Ask 部分找到答案。

如果您有新鮮的內容來贏得更頻繁的 Google 抓取，這將很有幫助。因此，您在博客上發布的內容越多，您期望被抓取的頻率就越高。以前，Google Search Console 最多只能存儲 90 天的歷史抓取數據。現在可以獲得更多的歷史數據，請求該時間跨度增加的 SEO 很高興擁有更多數據來發現與您的網站相關的 Google 的抓取習慣。

為移動 Web 性能和更快的 Google 抓取做準備

您知道您的移動網站被抓取的程度如何嗎？ Google 的 Accelerated Mobile Pages (AMP) 可以很好地幫助網站所有者提高他們在移動優先世界的搜索排名和可抓取性方面的表現。切換到 Google AMP 並了解它將如何影響您網站的可抓取性通常需要有經驗的人來掌舵。對於那些關注網站可見性和定位的人來說，我們知道速度負載很重要。如果您的網頁在所有其他特徵上都相似，但速度方面，那麼預計 GoogleBot 會傾向於強調易於抓取的速度更快的網站，這也是用戶認為在 SERP 中排名靠前的地方。

如果您在更新 AMP 網頁時需要幫助，然後測試您的移動網站是如何被抓取的，請閱讀此處以獲得解決方案。站點在各種移動設備上的加載方式可能不同，這會影響加載性能。測試看看谷歌的緩存服務器是否在較慢的連接上加載得更快

快速修復損害 Web 爬網的服務器連接問題

企業主經常不知道他們的託管頁面和他們所在的服務器的質量。這就引出了一個非常重要的觀點。如果您的網站出現連接錯誤，結果可能是 Google 在嘗試訪問該網站時無法訪問，因為您的網站已關閉或其服務器已關閉。特別是，如果您運行的 Google Ads 廣告系列鏈接到無法加載到服務器問題的著陸頁，則結果可能非常具有破壞性。您可能會在 Google AdWords 控制台中收到警告，警告太多了，他們可以取消廣告。

但除此之外，你還有很多事情要權衡。如果這種情況繼續被忽視，谷歌甚至可能停止訪問您的網站，您網站的健康狀況將受到負面影響，您的網頁排名可能會下降，結果您的流量可能會大幅下降。這是純粹的邏輯——如果谷歌在很長一段時間內無法訪問您的網站，他們就像我們一樣，需要繼續執行可行的任務。設置警報 - 密切關注您的服務器連接和抓取錯誤。

架構如何用於索引網站？

Gary Illyes 在 2017 年的 Pubcon 上證實，模式在網頁索引和排名方面發揮作用，而不僅僅是幫助出現在豐富的搜索功能中。 Jennifer Slegg 在她的SEM Post上報告說，首先需要更多網站使用它，然後警告說“你要小心你的架構也不會成為垃圾郵件，只使用適合頁面或站點的架構類型。否則，網站將面臨手動操作垃圾結構化數據的風險。”

如果您有購物車，那麼實施電子商務 JSON-LD 結構化數據尤其重要。 The content on your website gets indexed and returned in search results. Schema markup helps your website rank better for every form of content. The content on your website gets indexed and returned in search results better when schema markup helps your individual pages be understood better for the topic they directly address. Keep a closer eye on the top 100 results in each category.

How does GoogleBot Check Web Page Resources?

Most of your web pages use CSS and/or JavaScript to load. How your site is built and how many of these resources are used impacts your load times. Typically both CSS and JavaScript are loaded as external files that are linked to from your HTML. Google must have the access they want to these resources in order to fully understand your web pages. Often someone unfamiliar with technical SEO issues and how Google crawls your website will block these files within your robots.txt file. Read reports in your Search Console to better understand Google Crawler .

You can check to determine if your website is adhering correctly to this guideline.

Take advantage of the Google guidelines tool while employing your SEO techniques to know what files (if any) are set up as “blocked” from Googlebot. It only stands to reason that if web crawlers cannot understand your site's contents, they cannot rank you. Google needs the right to crawl your web pages in order to understand them fully and match your content to relevant search queries. Put your page through the SEO tool to obtain a better idea of how Google sees your site. Or request us to perform this vital task for you. Then we can go over the results together so that you address any issues correctly.

Requesting crawl rate adjustments:

Submit your website to Google and wait at least 24 hours before seeking to determine if your crawl rate changed. Google support states that “The term crawl rate means how many requests per second Googlebot makes to your site when it is crawling it: for example, 5 requests per second. You cannot change how often Google crawls your site, but if you want Google to crawl new or updated content on your site, you should use Fetch as Google”.

If you use the Google Webmaster tools and go to site settings, you can request a limit to your crawl rate, the new rate lasts for ninety days.

Google Crawls Sites that Follow their Webmaster Guidelines

The answers you need to know that your site correctly follows the Google webmaster guidelines*** for being crawled.

* Page headers are present when accessed by Googlebot; have correct site data architecture .

* Well-formed static links are discovered.

* The number of on page links is not excessive.

* Page avoids ordinary accessibility issues.

* Robots.txt file found and is correctly formed.

* All images have alt text to help GoogleBot render pages faster.

* All CSS and JavaScript files testy as visible to Googlebot

* Sitemaps for both search engines and users are available.

* No page speed issues.

NOTE: Additionally, you'll want to know that your web server correctly supports the If-Modified-Since HTTP header. This helps your web server to tell GoogleBot if your content has changed or updated since its last crawl. Having this feature working for you saves on your website's bandwidth and overhead.

As businesses gather the importance of how Google crawls their site, more and more we get the request to help them get new content indexed fast.

5 Ways to Get New Content Indexed Fast

1. Link to fresh content from your home page or a prominent web page on your website

2. Publish a Google Post about your new content

3. Invite Google bots by sharing a link to one or two of your new blog's post with a YouTube video

4. Add your new page to your site map and resubmit your sitemap

5. Make sure new content is added to your RSS feed and that the RSS feed is accessible to web crawlers

6. Add your Site to Qirina.com

“Google essentially gathers the pages during the crawl process and then creates an index, so we know exactly how to look things up. Much like the index in the back of a book, the Google index includes information about your site's ontology , words and their locations. When you search, at the most basic level, our algorithms look up your search terms in the index to find the appropriate pages.” – Matt Cutts of Google

“The web spider crawls to a website, indexes its information, crawls on to the next website, indexes it, and keeps crawling wherever the Internet's chain of links leads it. Thus, the mighty index is formed.” – Crazy Egg

“Search engines crawl your site to get the contents into their index. The bigger your site gets, the longer this crawl takes. An important concept while talking about crawling is the concept of crawl depth. Say you had 1 link, from 1 site to 1 page on your site. This page linked to another, to another, to another, etc. Googlebot will keep crawling for a while. At some point though, it'll decide it's no longer necessary to keep crawling.” – Yoast on Crawl Efficiency

“We strongly encourage you to pay very close attention to the Quality Guidelines below, which outline some of the illicit practices that may lead to a site being removed entirely from the Google index or otherwise affected by an algorithmic or manual spam action. If a site has been affected by a spam action, it may no longer show up in results on Google.com or on any of Google's partner sites.” – Google Webmaster Guidelines

How does Google Crawler handle redirect loops?

GoogleBot follows a minimum of five redirect hops. Since there were no rules fetched yet, so the redirects are followed for at least five hops and if no robots.txt is discovered, the search giant treats it as a 404 for the robots.txt. Handling of logical redirects for the robots.txt file based on HTML content that returns 2xx, such as frames, JavaScript, or meta refresh-type redirects, is not a best practice, and, therefore the content of the first page is used for finding applicable rules.

In our audits, we find that old tracking pixels are a common issue . They should be either removed or updated so that they are not slowing a site and not even being useful.

How long can my robots.txt file be?

Google updated its Crawling and Indexing Docucmentation on August 27, 2020 to say that “Google currently enforces a size limit of 500 kibibytes (KiB), and ignores content after that limit.”*** It is the first time we have heard of any robots.txt length limit. Very few sites will be impacted by this limit to the size of the robots.txt file.

What does a “server error” mean in my GSC reports?

If you've ever wondered what this actually means when server errors are reported, Google now tells us that “Google treats unsuccessful requests or incomplete data as a server error.” The quality of the server that hosts your website is very important. Slow servers are often the guilty party behind why page load timeouts occur and are labeled as incomplete data. Google's customer is the person using it's search capabilities. People, especially those who search from mobile devices, want fast results. Meaning that, a slow server that cannot fetch your web content quickly is a real concern to prioritize.

How Google Regards 404/410 Status Codes and Indexing Old Pages

Frequently the question resurfaces as to how Google handles 404 and 410 error codes and how that impacts crawling a website. Google's John Mueller responded to a question about web pages that no longer exist and the best way that a webmaster should manage it.

In a recent Webmaster Hangout, Google's John Mueller responded to the question: “If a 404 error goes to a page that doesn't exist, should I make them a 410?” with the following answer:

“From our point of view, in the midterm/long term, a 404 is the same as a 410 for us. So in both of these cases, we drop those URLs from our index.
We, generally, reduce crawling a little bit of those URLs so that we don't spend too much time crawling things that we know don't exist.
The subtle difference here is that a 410 will sometimes fall out a little bit faster than a 404. But usually, we're talking on the order of a couple of days or so.
So if you're just removing content naturally, then that's perfectly fine to use either one. If you've already removed this content long ago, then it's already not indexed so it doesn't matter for us if you use a 404 or 410.” ——約翰·穆勒

It is worth noting that by using the 410 status code, SEO's can actually speed up the process of Google removing the web page from its index. Mueller also stated that “the 410 response is primarily intended to assist the task of web maintenance by notifying the recipient that the resource is intentionally unavailable and that the server owners desire that remote links to that resource be removed”.

“It turns out webmasters shoot themselves in the foot pretty often. Pages go missing. People misconfigure sites. Sites go down. People block GoogleBot by accident.
So if you look at the entire web, the crawl team has to design to be robust against that. So with 404s, along with I think 401s and maybe 403s, if we see a page and we get a 404, we are going to protect that page for 24 hours in the crawling system.” – John Mueller**

A Major Part of SEO is Crawling and Indexing

With so many tasks involved today in digital marketing and improving site performance with SEO current best practices , many small businesses feel challenged to give sufficient time and effort to Google crawl optimization. If you fall in this bucket, it is quite possible you are missing a significant amount of traffic. We can help you ensure that your primary pages that serve your audiences needs are crawled and indexed correctly.

Crawl optimization should be a highly rated priority for any large website seeking to improve its SEO efforts. Even with the best of e-Commerce Schema implementation , if your site isn't indexed correctly, you have a real problem. By implementing tracking, monitoring your Google Analytics SEO reports , and directing GoogleBot to your key web content, you can gain an advantage over your competition.

Summary

In order to be indexed and returned in search engine results, your website should be easy to crawl first. If you think your business website is poorly indexed or returned, it is important to determine if your site is correctly crawled. Start with full website SEO audit , implement improvements, and then see how the benefit you gain in increased Internet traffic and site views.

Remember, reaching your goal of having your website indexed by Google is only the first step in successful digital marketing. To improve your website beyond being crawled and indexed, make sure you're following basic SEO principles, creating high-value content users want, and getting rich data insights from Google Analytics . Then, you'll be in a better position to integrate organic and paid search .

Hill Web Creations can offer you new ideas on how to “encourage” Google to re-crawl your website, or select web pages that have been recently updated. Call 661-206-2410 and ask for Jeannie. The benefits of our work will show up in your future comprehensive SEO Reports .

Or you can start by checking out ourTypes of Website Audits Available

* https://support.google.com/webmasters/answer/35769

** https://www.youtube.com/watch?v=kQIyk-2-wRg

*** https://developers.google.com/search/docs/advanced/robots/robots_txt