到底被google收錄了沒？利用自訂搜尋引擎，大量即時檢查網站檢索狀況（需coding）

Crystal KUO

11 min readDec 17, 2022

為什麼要做這個？

今年年初用了google的inspection api來測自家網站的收錄狀況，但原本看的指標失靈了（哭），根本看不出來有沒有被即時收錄，為了能真真正正看出收錄狀況，才找到了這支API來下手。

小警告：因為是pyhton菜鳥，code可能會很醜陋，介意的看官們建議直接關掉視窗，或是繼續看下去給一些建議也行！

誰可能需要這個工具？

需要大量且即時知道頁面是否有被google收錄
不想每個網址都要手動site法去看有沒有被google收錄

寫code之前要準備的東西

安裝好你的python
要爬的網址，這次抓的是自家sitemap
到GCP（Google Cloud）開啟Custom Search API、Google Sheets API
設定好你的搜尋系統

Custom Search API

限制

免費搜尋：100次／天
超過100次後要收錢，5美金（約155台幣）／1000次，每天最多1萬次

取得API key

到Search Engine的API key處點選Get a Key，Google就會帶著你做完之後的步驟，並得到一串金鑰，請將金鑰好好複製下來，寫code的時候就會用到嚕！

基本上跟著文件步驟走的，Custom Search API就會一起被打開了，但如果沒有的話，可以到GCP menu下API和服務中的程式庫，找到Custom Search API並啟用。

Custom Search設定

接著到程式化搜尋引擎（Programmable Search Engine）來建立你的搜尋引擎。（這個原意應該是要給沒資源做搜尋的網站方便，結果被我這樣拿來用XD）

先新增一個搜尋引擎

這裡可以限定是要查整個google的索引，還是只有特定網域之下的！如果是針對自己網站的話，建議限定網站，在之後的比對上會比較好做。

再來就是到你的搜尋引擎中，找到它的ID，複製下來！這裡就告一段落囉。

Google Sheets API

串Google Spreadsheet查「Google sheet API怎麼串」之類的就能找到超多，或是上官方教學文跟著做，這裡就不贅述了！

開始coding

使用到的套件有

import requests as req #打api
import pandas as pd #篩選資料
import json #轉換格式
import time #定時
from google.oauth2.service_account import Credentials #填入spreadsheet
import gspread #填入spreadsheet
import gspread_dataframe as gd #填入spreadsheet
import sys #loop到一半終止

步驟說明

取得要查詢的title／link list，我是用beautiful soup抓sitemap，不多說這部分！
把標題丟到自訂search engine查詢，並使用網址比對是否有相同結果
如果有網址在搜尋引擎中找不到，一直loop到全部被索引才會停止

把標題丟到自訂search engine查詢，並使用網址比對是否有相同結果

準備好你的list之後，就來把要搜尋的東西丟到search engine裡面。

這裡做了一個function來打API，其中目的有

蒐集API裡面需要的資訊
比對是否有我想找到的東西
把需要的資訊裝成一個dict

在最一開始我會先設定一個變數（d）來計算用了幾次API，才不會超出額度要付錢，並在自訂function裡面，每打一次加一次。


d = 4 #計算打了幾次API>>GCP起始數目，99次會自動停掉

def getAPI(): 
    global d 
    if d == 99:
        sys.exit("API使用次數達99次！可能要付錢啦")
    else:
        d += 1
        print(d)
        getAPI = req.get('https://customsearch.googleapis.com/customsearch/v1?cx={搜尋引擎編號}&key={API金鑰}',params={'q':{要查詢的關鍵字，我是用文章標題去搜}}).json()
        apiDate = datetime.now().strftime("%Y-%m-%d") #記錄打API當下的日期
        apiTime = datetime.now().strftime("%H:%M:%S") #紀錄當下的時間
        
        if getAPI.get('items') is not None: 
            rowData = getAPI['items']   
            for i in rowData: #蒐集API裡面需要的資訊
                resultList = {}
                resultTitle = resultList["title"] = i["title"]
                resultLink = resultList["link"] = i["link"]

                if Link == resultLink: #比對是否有我想找到的東西，這裡是用網址來比對
                    #把東西包成我想要的dict，包含當下的時間、文章發布的時間、文章最後編輯時間、是否有結果、文章分類、連結、標題、SERP上的標題
                    serpChecked = {"Date":apiDate, "Time":apiTime, "PublicDate":publicDate ,"PublicTime":publicTime, "ModifiedDate":modifiedDate ,"ModifiedTime":modifiedTime, "Status": True, "Sort":sort,"Link":Link,"Title":Query,"SERPTitle":resultTitle}
                    temp.append(serpChecked)
                    break

                else:
                    serpChecked = {"Date":apiDate, "Time":apiTime, "PublicDate":publicDate ,"PublicTime":publicTime, "ModifiedDate":modifiedDate ,"ModifiedTime":modifiedTime, "Status": False, "Sort":sort,"Link":Link,"Title":Query,"SERPTitle":'No Result'}
                    temp.append(serpChecked)
        else: 
            serpChecked = {"Date":apiDate, "Time":apiTime, "PublicDate":publicDate ,"PublicTime":publicTime, "ModifiedDate":modifiedDate ,"ModifiedTime":modifiedTime, "Status": False, "Sort":sort,"Link":  Link,"Title":Query,"SERPTitle":'No Related Result'}
            temp.append(serpChecked)

temp =[] #用來裝一堆dict的

比較特別的是，搜尋的結果可能是完全沒有搜結果 or 有結果但就是沒有你要的結果，這裡我分成兩塊來參考。如果不想那麼麻煩的，應該可以用try except把兩個結果合併一起。

接著就來判斷，是否全部都有被索引！這裡是用dataframe來篩選

df = pd.DataFrame(temp) #把一堆dict做成dataframe
df.drop_duplicates(subset='Link', keep='first', inplace=True) #刪除重複的Link
print("printall") 
appendSheet() #這個我自訂是寫入spreadsheet的function
print(df)

try:
    statusFilter = df["Status"] == False #篩選出False的資料
except:
    print('All indexed') #如果沒有False狀態的文章，那就是全部索引啦（可喜可賀！）
    sys.exit #停止整個步驟

如果還是找不到，那就loop到找到吧！

先確認Flase的狀態不等於零，然後等個3分鐘，因為一直打也不會收那麼快，不要一下就把扣打用光了。

nofoundList = df[statusFilter]

if len(nofoundList) != 0:

    print("wati for 3mins")
    time.sleep(180) #暫停時間，可自行調整

print("Start finding when will be indexed") 

    nofoundDict = nofoundList.to_dict('records') #把dataframe換回dict

    while nofoundDict is not None:
        print("wati for 3mins")
        time.sleep(180) #暫停時間，可自行調整
        
        print('start!!')
        temp =[]

        for b in nofoundDict: #抓sitemap裡面狀態還是Flase的資料，丟進去search engine
            getAPI()

            
        df = pd.DataFrame(temp)
        df.drop_duplicates(subset='Link', keep='first', inplace=True) #刪除重複的Link
        appendSheet() #把結果寫入spreadsheet

print("Start finding when will be indexed") 

    nofoundDict = nofoundList.to_dict('records') #把dataframe換回dict，之後比較好抓資料

    while nofoundDict is not None:
        
        print('start!!')
        temp =[]

        for b in nofoundDict: #抓sitemap裡面狀態還是Flase的資料，丟進去search engine
            getAPI()

            
        df = pd.DataFrame(temp)
        df.drop_duplicates(subset='Link', keep='first', inplace=True) #刪除重複的Link
        appendSheet() #把結果寫入spreadsheet

全部都有找到，那就是All indexed，程式停止
沒有找到結果，就把False篩選出來，延續剛剛的步驟，等3分鐘再來一次

          if df['Status'].all() == True:
                      print('All indexed')
                      break
          
                  else:
                      statusFilter = df["Status"] == False #篩選出False的資料
                      nofoundList = df[statusFilter]
                      nofoundDict = nofoundList.to_dict('records')
                      print("wait for 3mins")
                      time.sleep(180)
                      print("Another Finding")
    else: #這個else代表
        print("All indexed")

以上，這就是在跟SEO對抗的時候想到的辦法，如果大家有更方便的方法請在留言跟我說(╯✧∇✧)╯

參考資料：

spreadsheet寫入方式：https://pythonviz.com/google-cloud/google-sheets-api-read-write-pandas-dataframe/
dataframe使用方式：https://www.learncodewithmike.com/2020/11/python-pandas-dataframe-tutorial.html
global全域變數是什麼：http://kaiching.org/pydoing/py/python-global.html