目录
正如你可能已经知道的, 官方Google Places API 每个地方只能有5条评论。因此,开发人员正在研究刮削,以便有能力从谷歌地图上的任何企业获取所有评论。
刮取谷歌的所有保护和动态渲染页面可能是一项具有挑战性的任务。幸运的是,有许多工具,你可以用python或其他编程语言来刮取评论。在这篇博文中,你将看到两个最常见的刮取谷歌评论的工具:浏览器模拟和Outscraper平台。它们中的每一个都足以从地图的任何列表中获得所有的评论。
利用浏览器渲染动态内容,在Python中抓取谷歌评论
我们将使用 硒 来控制Chrome浏览器。浏览器将呈现谷歌评论的动态页面。要开始用Selenium构建评论搜刮器,我们需要以下条件:
- Python 3+。
- 安装了Chrome浏览器。
- Selenium 3.141.0+(python包)。
- Chrome驱动(适用于你的操作系统)。
- Parsel或任何其他从HTML中提取数据的库,如Beautiful Soup。
安装Selenium和其他软件包
通过运行以下命令安装Selenium和Parsel包。我们将在以后使用Parsel,当我们从HTML中解析内容时。
pip install selenium
pip install parsel # to extract data from HTML using XPath or CSS selectors
启动浏览器
在启动驱动程序之前,请确认你已经完成了前面的步骤,并且你有chromedriver文件的路径。通过以下代码初始化驱动程序。你应该看到新的浏览器窗口打开。
from selenium import webdriver
chromedrive_path = './chromedriver' # use the path to the driver you downloaded from previous steps
driver = webdriver.Chrome(chromedrive_path)
你可能会在Mac上看到以下内容: "chromedriver不能被打开,因为开发商无法被验证。" 为了克服这个问题,在Finder中控制点击chromedriver,从菜单中选择打开,然后在出现的对话框中点击打开。你应该在打开的终端窗口中看到 "ChromeDriver已成功启动"。关闭它,在这之后,你就可以从你的代码中启动ChromeDriver了。
下载所有评论页面
一旦你启动了驱动程序,你就可以打开一些页面了。要打开任何页面,使用 "get "命令。
url = 'https://www.google.com/maps/place/Central+Park+Zoo/@40.7712318,-73.9674707,15z/data=!3m1!5s0x89c259a1e735d943:0xb63f84c661f84258!4m16!1m8!3m7!1s0x89c258faf553cfad:0x8e9cfc7444d8f876!2sTrump+Tower!8m2!3d40.7624284!4d-73.973794!9m1!1b1!3m6!1s0x89c258f1fcd66869:0x65d72e84d91a3f14!8m2!3d40.767778!4d-73.9718335!9m1!1b1?hl=en&hl=en'
driver.get(url)
帕斯评论
一旦你的页面被打开,你将在Chrome窗口中看到由你的代码控制的页面。你可以运行以下代码,从驱动程序中获取HTML页面内容。
page_content = driver.page_source
为了舒适地看到HTML内容,在Chrome浏览器中打开Chrome菜单,在浏览器窗口的右上角选择更多工具>开发者工具,打开开发者控制台。现在你应该能够看到你的页面的元素。
你可以通过使用你喜欢的解析工具来解析HTML页面的内容。我们将使用 帕索尔 在本教程中。
from parsel import Selector
response = Selector(page_content)
迭代审查。
results = []
for el in response.xpath('//div/div[@data-review-id]/div[contains(@class, "content")]'):
results.append({
'title': el.xpath('.//div[contains(@class, "title")]/span/text()').extract_first(''),
'rating': el.xpath('.//span[contains(@aria-label, "stars")]/@aria-label').extract_first('').replace('stars' ,'').strip(),
'body': el.xpath('.//span[contains(@class, "text")]/text()').extract_first(''),
})
print(results)
来自谷歌评论爬虫的输出(缩短)。
[
{
'title': 'Wanda Garrett',
'rating': '5',
'body': 'Beautiful ✨ park with a family-friendly atmosphere! I had a great time here; seeing all of the animals and learning all of the interesting facts was a fantastic way to spend the day. The zoo is beautifully landscaped and surrounded by …'
},
{
'title': 'Bernadette Bennett',
'rating': '4',
'body': 'Worth going for the seals! They are the main attraction and located in the center of the zoo. We watched a live feeding and it was great. The kids loved it. The zoo is well manicured surrounded by gorgeous gardens. Lots of benches to rest …'
},
{
'title': 'Mary Cutrufelli',
'rating': '3',
'body': "So not gonna lie... We came from PA. My kid expected to see lions and hippos and zebra from Madagascar. None of that which is there. It's clean it's a nice zoo. I wouldn't go again though."
},
...
]
停止浏览器
在刮痧前后相应地启动和停止驱动程序是很重要的。这就像在上网前后打开和关闭浏览器一样。通过运行以下代码停止浏览器。
driver.quit()
尽管Google Reviews的HTML结构很棘手,但通过Selenium和对XPath和CSS选择器的良好了解,你可以取得相当好的搜刮结果。这种使用浏览器模拟器的方法应该可以保护你不被屏蔽。然而,如果你扩大你的应用程序,考虑使用代理,以避免意外的问题。
多处理和其他建议
在多进程中运行驱动程序是可能的(不是多线程),但每个驱动程序将消耗一个CPU。请确保你有足够的数量。
用Python抓取谷歌评论的最简单方法
用浏览器从谷歌提取数据有其优点和缺点。虽然你可以自己开发抓取器,但在扩展过程中,可能会导致使用具有大量CPU的服务器以处理浏览器模拟的大笔费用。此外,应该有一个人负责维护爬虫,并在谷歌网站变化时更新它。
通过使用 Outscraper平台, API,或 SDKs, Outscraper为企业和个人提供了最简单的解决方案,使他们能够开始从谷歌上搜刮评论,而无需处理代理、浏览器模拟和投资开发。
通过使用SDK在Python中抓取评论
1.你需要python3+和这个 蟒蛇包.通过运行命令来安装该软件包。
pip install google-services-api
from outscraper import ApiClient
api_cliet = ApiClient(api_key='KEY_FROM_OUTSCRAPER')
response = api_cliet.google_maps_reviews(
'https://www.google.com/maps/place/Do+or+Dive+Bar/@40.6867831,-73.9570104,17z/data=!3m2!4b1!5s0x89c25b96a0b10eb9:0xfe4f81ff249e280d!4m5!3m4!1s0x89c25b96a0b30001:0x643d0464b3138078!8m2!3d40.6867791!4d-73.9548217',
language='en',
limit=100
)
5.等待几秒钟,直到评论被取走。
[
{
"name": "Do or Dive Bar",
"full_address": "1108 Bedford Ave, Brooklyn, NY 11216, United States",
"borough": "Bedford-Stuyvesant",
"street": "1108 Bedford Ave",
"city": "Brooklyn",
"postal_code": "11216",
"country_code": "US",
"country": "United States of America",
"us_state": "New York",
"state": "New York",
"plus_code": null,
"latitude": 40.686779099999995,
"longitude": -73.9548217,
"time_zone": "America/New_York",
"site": "https://www.doordivebedstuy.com/",
"phone": "+1 917-867-5309",
"type": "Bar",
"rating": 4.5,
"reviews": 425,
"reviews_data": [
{
"google_id": "0x89c25b96a0b30001:0x643d0464b3138078",
"autor_link": "https://www.google.com/maps/contrib/115539085325450648866?hl=en-US",
"autor_name": "Sam Grjaznovs",
"autor_id": "115539085325450648866",
"autor_image": "https://lh3.googleusercontent.com/a-/AOh14GgxmEH7a10v6Bo8AFb6OkbyxxfIBPXbMYVAxeSIRA=c0x00000000-cc-rp-ba3",
"review_text": "Cozy shin dig with an assortment of drinks. They have a strong specialty for 10bucks and merch too. They have out side dining as well as back yard area. Ask for Brandon every other Saturday. He\u2019s hella cute!",
"review_img_url": "https://lh5.googleusercontent.com/p/AF1QipPNs8QvvdkBonV5wuxdoylFjLY3k7L6muepbDq-",
"owner_answer": null,
"owner_answer_timestamp": null,
"owner_answer_timestamp_datetime_utc": null,
"review_link": "https://www.google.com/maps/reviews/data=!4m5!14m4!1m3!1m2!1s115539085325450648866!2s0x0:0x643d0464b3138078?hl=en-US",
"review_rating": 5,
"review_timestamp": 1603781021,
"review_datetime_utc": "10/27/2020 06:43:41",
"review_likes": 0,
"reviews_id": "7222934207919784056"
},
{
"google_id": "0x89c25b96a0b30001:0x643d0464b3138078",
"autor_link": "https://www.google.com/maps/contrib/110571545135018844510?hl=en-US",
"autor_name": "Arabella Stephens",
"autor_id": "110571545135018844510",
"autor_image": "https://lh3.googleusercontent.com/a-/AOh14GisqDfheDO0Aq0cu1Z7YBTbzLyvSEvM3IMDKg3q=c0x00000000-cc-rp",
"review_text": "Great atmosphere, always fun vibe and good beers. I live in the area and this is a very reliable standby. Would recommend to anyone who is not pretentious and likes a bit of clutter in their watering hole.",
"review_img_url": "https://lh3.googleusercontent.com/a-/AOh14GisqDfheDO0Aq0cu1Z7YBTbzLyvSEvM3IMDKg3q",
"owner_answer": null,
"owner_answer_timestamp": null,
"owner_answer_timestamp_datetime_utc": null,
"review_link": "https://www.google.com/maps/reviews/data=!4m5!14m4!1m3!1m2!1s110571545135018844510!2s0x0:0x643d0464b3138078?hl=en-US",
"review_rating": 5,
"review_timestamp": 1614111762,
"review_datetime_utc": "02/23/2021 20:22:42",
"review_likes": 0,
"reviews_id": "7222934207919784056"
},
{
"google_id": "0x89c25b96a0b30001:0x643d0464b3138078",
"autor_link": "https://www.google.com/maps/contrib/101725757133547547783?hl=en-US",
"autor_name": "Jack Parker",
"autor_id": "101725757133547547783",
"autor_image": "https://lh3.googleusercontent.com/a-/AOh14GjFK9CLb8__u5PtJzH1rGuX4DVgPvjaEeIkSJnCNw=c0x00000000-cc-rp",
"review_text": "All the bartenders are rad. Cheap drinks, and a nice backyard. They have space heaters, but I would still recommend bundling up if you plan on spending a while there. Jeopardy night is always fun too. Can\u2019t wait to sit inside again!",
"review_img_url": "https://lh3.googleusercontent.com/a-/AOh14GjFK9CLb8__u5PtJzH1rGuX4DVgPvjaEeIkSJnCNw",
"owner_answer": null,
"owner_answer_timestamp": null,
"owner_answer_timestamp_datetime_utc": null,
"review_link": "https://www.google.com/maps/reviews/data=!4m5!14m4!1m3!1m2!1s101725757133547547783!2s0x0:0x643d0464b3138078?hl=en-US",
"review_rating": 5,
"review_timestamp": 1611947492,
"review_datetime_utc": "01/29/2021 19:11:32",
"review_likes": 0,
"reviews_id": "7222934207919784056"
},
...
]
]
Python软件包 ► https://pypi.org/project/google-services-api
API ► https://app.outscraper.com/api-docs
视频教程
常见问题
最常见的问题和答案
由于Outscraper的谷歌地图评论API,可以刮取所有谷歌地图评论。Outscraper的API服务使你可以不受任何限制地进行搜刮。
有一个用于谷歌地图评论的API服务。这就是Outscraper的谷歌地图评论API。由于Outscraper的服务,你可以导出和下载所有谷歌地图评论。
评论可以用Python和Selenium来刮取。这篇文章中详细解释了"用Python抓取所有谷歌评论“.