scrpae google maps by selenium
Controlling Chrome browser with Selenium

Using Selenium in data scraping is quite a common thing. It automates browsers and allows you to simulate real user activities to avoid getting blocked. It is specially helpful while scraping JavaScript dynamically generated sites like Google. Moreover, it helps to avoid any kind of crawling protection as it basically simulates real clients' behavior.

In this article, you will see a step-by-step tutorial of developing a scraper that will be able to parse places from Google Maps without getting blocked even after 1000+ pages.

What We'll Need to Build The Crawler

  1. Python 3+.
  2. Chrome browser installed.
  3. Selenium 3.141.0+ (python package).
  4. Chrome Driver (for your OS).
  5. Parsel or any other library to extract data from HTML like Beautiful Soup.

Step by Step Tutorial to Build The Google Maps Extractor

Download and Save ChromeDriver

To use Selenium with Google Chrome you will need to link python code to the browser by using ChromeDriver.

Download the version of ChromeDriver that matches your browser version and OS type. Your Chrome browser version number can be found here: Chrom -> Menu icon (upper right corner) -> Help -> About Google Chrome.

Unarchive chromedriver file and save it somewhere on your system (path to it we will use later on). In this tutorial example, we store the file in the project folder.

Install Selenium And Parsel Packages

Install Selenium and Parsel packages by running the following commands. We will use Parsel later when we will parse content from HTML.

					pip install selenium
pip install parsel # to extract data from HTML using XPath or CSS selectors

Initialize And Start Webdriver

Before initializing Webdriver make sure you did the previous steps and you have the path to your chromedriver file. Initialize the driver by the following code. You should see the new Chrome window opens.

					from selenium import webdriver

chromedrive_path = './chromedriver' # use the path to the driver you downloaded from previous steps
driver = webdriver.Chrome(chromedrive_path)

On mac you might see the following: "chromedriver cannot be opened because the developer cannot be verified". To overcome this control-click the chromedriver in Finder, choose Open from the menu, and then click Open in the dialog that appears. You should see "ChromeDriver was started successfully" in the opened terminal windows. Close it and after this, you will be able to start ChromeDriver from your code.

Download Google Maps Page

Once you start the driver you are ready to open some pages. To open any page, use the "get" command.

					url = ',+USA/@40.7443439,-74.0197995,13z'

Parse Maps Search Content

Once your page is opened you will see the page in your browser window that is controlled by your python code. You can run the following code to get the HTML page content from ChromeDriver.

					page_content = driver.page_source

To comfortably see the HTML content, open the developer console in Chrome by opening the Chrome Menu in the upper-right-hand corner of the browser window and selecting More Tools > Developer Tools. Now you should be able to see elements of your page.

google maps HTML parsing
Finding xpath to elements we want to get with developer console

You can parse the content from the HTML page by using your favorite parsing tools. We'll use Parsel in this tutorial.

					from parsel import Selector

response = Selector(page_content)

Iterate over listings and get each place's data.

					results = []

for el in response.xpath('//div[contains(@aria-label, "Results for")]/div/div[./a]'):
        'link': el.xpath('./a/@href').extract_first(''),
        'title': el.xpath('./a/@aria-label').extract_first('')

Output from Google Maps places (shortened).

    'link': '!4m5!3m4!1s0x89c259acf2c7299d:0x149a07481483ce!8m2!3d40.7542649!4d-73.990364?authuser=0&hl=en&rclk=1',
    'title': 'Black Iron Burger'
    'link': '!4m5!3m4!1s0x89c259846e633763:0x69420cb6024065f9!8m2!3d40.723028!4d-73.989794?authuser=0&hl=en&rclk=1',
    'title': 'Fools Gold NYC'
    'link': '!4m5!3m4!1s0x89c25976492b11ff:0x14f9cacdef3eb7c1!8m2!3d40.7280546!4d-73.9812525?authuser=0&hl=en&rclk=1',
    'title': '11th St. Bar'

Stop the Driver Once The Process of Extraction is Finished

It's important to start and stop the driver before and after the scraping accordingly. It's the same as you would open and close your browser before and after surfing the internet. Close the driver by running the following code.


Conclusion And Recommendations on Scaling And Making The Scraper More Robust

Despite the tricky HTML structure of Google Maps, with Selenium and a good knowledge of XPath and CSS selectors, you can achieve quite good results in scraping. This method of using a browser emulator should protect you from getting blocked. However, if you are going to scale your application you might consider using proxies as well in order to avoid some unexpected troubles.

Running in Parallel

It's possible to run drivers in multiprocessing (not multithreading) but each driver will consume one CPU. Make sure you have enough of them.

The Ultimate Way of Scraping Google Maps at Enterprise Scale

As scraping Google might be quite a challenge, many companies and big enterprises want to start scraping millions of pages without spending time on developing and maintaining their own crawlers.

The easiest way of getting started with scraping Google Maps is using Outscraper Platform (to get results in CSV files), API, or SDKs (to integrate into code).

The Easiest Way of Getting Started With Google Maps Scraping

Install Outscraper's SDK by running the following command.

					pip install google-services-api

Initialize the client and search for listings on Google or parse specific ones by sending place IDs.

* you can get API token (SECRET_API_KEY) from the Profile page.

					from outscraper import ApiClient

api_client = ApiClient(api_key='SECRET_API_KEY')

# Search for businesses in specific locations:
result = api_client.google_maps_search('restaurants brooklyn usa', limit=20, language='en')

# Get data of the specific place by id
result = api_client.google_maps_search('ChIJrc9T9fpYwokRdvjYRHT8nI4', language='en')

# Search with many queries (batching)
result = api_client.google_maps_search([
    'restaurants brooklyn usa',
    'bars brooklyn usa',
], language='en')


Result output (shortened).

      "name": "Colonie",
      "full_address": "127 Atlantic Ave, Brooklyn, NY 11201",
      "borough": "Brooklyn Heights",
      "street": "127 Atlantic Ave",
      "city": "Brooklyn",
      "postal_code": "11201",
      "country_code": "US",
      "country": "United States of America",
      "us_state": "New York",
      "state": "New York",
      "plus_code": null,
      "latitude": 40.6908464,
      "longitude": -73.9958422,
      "time_zone": "America/New_York",
      "popular_times": null,
      "site": "",
      "phone": "+1 718-855-7500",
      "type": "American restaurant",
      "category": "restaurants",
      "subtypes": "American restaurant, Cocktail bar, Italian restaurant, Organic restaurant, Restaurant, Wine bar",
      "posts": null,
      "rating": 4.6,
      "reviews": 666,
      "reviews_data": null,
      "photos_count": 486,
      "google_id": "0x89c25a4590b8c863:0xc4a4271f166de1e2",
      "place_id": "ChIJY8i4kEVawokR4uFtFh8npMQ",
      "reviews_link": "",
      "reviews_id": "-4277250731621359134",
      "photo": "",
      "street_view": "",
      "working_hours_old_format": "Monday: 5\\u20139:30PM | Tuesday: Closed | Wednesday: Closed | Thursday: 5\\u20139:30PM | Friday: 5\\u20139:30PM | Saturday: 11AM\\u20133PM,5\\u20139:30PM | Sunday: 11AM\\u20133PM,5\\u20139:30PM",
      "working_hours": {
        "Monday": "5\\u20139:30PM",
        "Tuesday": "Closed",
        "Wednesday": "Closed",
        "Thursday": "5\\u20139:30PM",
        "Friday": "5\\u20139:30PM",
        "Saturday": "11AM\\u20133PM,5\\u20139:30PM",
        "Sunday": "11AM\\u20133PM,5\\u20139:30PM"
      "business_status": "OPERATIONAL",
      "reserving_table_link": "",
      "booking_appointment_link": "",
      "owner_id": "114275131377272904229",
      "verified": true,
      "owner_title": "Colonie",
      "owner_link": "",
      "location_link": ",-73.9958422,14z/data=!4m8!1m2!2m1!1sColonie!3m4!1s0x89c25a4590b8c863:0xc4a4271f166de1e2!8m2!3d40.6908464!4d-73.9958422"

Extra: The Easiest Way of Getting Started With Google Reviews Scraping

You can extract reviews by running the following code (assumed you have installed the python package and initiated the client).

					# Get reviews of the specific place by id
result = api_client.google_maps_reviews('ChIJrc9T9fpYwokRdvjYRHT8nI4', reviewsLimit=20, language='en')

# Get reviews for places found by search query
result = api_client.google_maps_reviews('Memphis Seoul brooklyn usa', reviewsLimit=20, limit=500, language='en')

# Get only new reviews during last 24 hours
from datetime import datetime, timedelta
yesterday_timestamp = int(( - timedelta(1)).timestamp())

result = api_client.google_maps_reviews(
    'ChIJrc9T9fpYwokRdvjYRHT8nI4', sort='newest', cutoff=yesterday_timestamp, reviewsLimit=100, language='en')

Result output (shortened).

  "name": "Memphis Seoul",
  "address": "569 Lincoln Pl, Brooklyn, NY 11238, \\u0421\\u043f\\u043e\\u043b\\u0443\\u0447\\u0435\\u043d\\u0456 \\u0428\\u0442\\u0430\\u0442\\u0438",
  "address_street": "569 Lincoln Pl",
  "owner_id": "100347822687163365487",
  "owner_link": "",
  "reviews_data": [
      "google_id": "0x89c25bb5950fc305:0x330a88bf1482581d",
      "autor_link": "",
      "autor_name": "Eliott Levy",
      "autor_id": "112314095435657473333",
      "review_text": "Very good local comfort fusion food ! \\nKimchi coleslaw !! Such an amazing idea !",
      "review_link": "!4m5!14m4!1m3!1m2!1s112314095435657473333!2s0x0:0x330a88bf1482581d?hl=en-US",
      "review_rating": 5,
      "review_timestamp": 1560692128,
      "review_datetime_utc": "06/16/2019 13:35:28",
      "review_likes": null
      "google_id": "0x89c25bb5950fc305:0x330a88bf1482581d",
      "autor_link": "",
      "autor_name": "fenwar1",
      "autor_id": "106144075337788507031",
      "review_text": "Great wings with several kinds of hot sauce. The mac and cheese ramen is excellent.\\nUPDATE:\\nReturned later to try the meatloaf slider, a thick meaty slice  topped with slaw and a fantastic sauce- delicious. \\nConsider me a regular.\\ud83d\\udc4d",
      "review_link": "!4m5!14m4!1m3!1m2!1s106144075337788507031!2s0x0:0x330a88bf1482581d?hl=en-US",
      "review_rating": 5,
      "review_timestamp": 1571100055,
      "review_datetime_utc": "10/15/2019 00:40:55",
      "review_likes": null


Project Manager Linkedin

0 تعليقات

اترك تعليقاً

عنصر نائب للصورة الرمزية