16 November 2022

Webscraping of JavaScript Pages using a combination of Selenium & Beautiful Soup

Why Selenium?

Beautiful Soup makes web scraping easy by traversing the DOM (Document Object Model). But it handles only Static Scraping and is not capable of handling JavaScript. Essentially Beautiful Soup fetches webpage from the server without the help of a browser. We get what we see in the "view page source". If the data that we are looking for is available in the "view page source", then Beautiful Soup is sufficient for web scraping. But if we  need data that gets rendered only upon clicking a JavaScript link, then we need to use dynamic web scraping methods In such a situation we first use Selenium to automate the browser and click on the JavaScript link, wait for the elements to load and then use Beautiful Soup to extract the elements.

Step 1: Open the page with selenium

First we will open our target website link with Selenium. A screenshot of our link is given below and here is the actual link:  https://whitepropertymanagement.appfolio.com/listings/listings


Let us now translate this to actual python code. First lets import the required libraries for the program

# Import required libraries
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
import pandas as pd
from bs4 import BeautifulSoup
import logging
import pdb

Next let us install chrome driver. For Selenium to work, it must access the browser driver. Here, Selenium accesses the Chrome browser driver without actually opening a browser window(headless argument). It also opens the chrome browser in incognito mode. The implicitly_wait method sets a sticky timeout to implicitly wait for an element to be found, or a command to complete. This method only needs to be called one time per session. 

# Define options
opts=webdriver.ChromeOptions()
# opts.headless=True # Open in headless mode. 
opts.add_argument('--incognito') # open in incognito mode

# Install Chrome Driver
driver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts)
driver.implicitly_wait(10)

Next let us define a function to scroll to the bottom of the page, wait for 10 seconds and then scroll to the top of the page. We will call this function after we open our web page.


def scroll_to_end(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # scroll to the bottom of the page
    time.sleep(10) # sleep_between_interactions
    driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.HOME) # use CTRL + HOME keys. It will scroll to the top of the page
    

search_url = "https://whitepropertymanagement.appfolio.com/listings/" # Specify search URL
driver.get(search_url)
driver.maximize_window()
time.sleep(10)
scroll_to_end(driver)

Step 2: Extract corresponding link of each property

If you take  a closer look at the webpage, we can notice that each property has an element titled "view details". (When we click on this, it opens up another webpage). Let us find the URL to this element.

# Find all elements by link text
try:
    links = driver.find_elements_by_link_text("View Details")
except:
    print('Unexpected error in driver.find_elements_by_link_text("View Details")')
    
print("Number of Property Listings on page: {}".format(len(links)))

# Store all links in a list
list_of_links_url = []
for link in links:
   link_url =  link.get_attribute('href') # use get_attribute() to get all href
   list_of_links_url.append(link_url)

Step 3: Open all links and extract the property address with Beautiful Soup

Once we click on the link "View Details", this is how our page looks: 



We can use the driver.get() method to open all the links. Then extract the property information using the xpath or class. Here is where we bring in Beautiful Soup to extract data. In this particular case, empty strings have also been collected and the last piece of logic is to remove the empty strings.

# link_url = ""
addresses_list = []
for link_url in list_of_links_url:
   driver.get(link_url)
   time.sleep(10)
   # Use BeautifulSoup to extract data
   page_source = driver.page_source
   soup = BeautifulSoup(page_source, "html.parser")
   address_result_set = soup.find(class_ = "fw-normal js-show-title")
   for address in address_result_set:
       print(address.extract().strip())
       addresses_list.append(address.extract().strip())

# Remove empty strings from list of strings
while("" in addresses_list):
    addresses_list.remove("")
   
driver.quit() #closes all the browser windows and terminates the WebDriver session.

In the event that you want to use xpath, you can take a look at the following plugins which give you the absolute xpath to any web element:

Step 4: Write the Property Addresses to csv

With the code below, we can write to pandas and from there to a csv file.

# Write to pandas DataFrame
df_address = pd.DataFrame(addresses_list, columns = ['PropertyAddress']) # add column
df_address['PropertyAddress'] = df_address['PropertyAddress'].str.strip() # remove whitespaces

# Write to a csv file
df_address.to_csv('propertyAddress.csv', index=False)

Here is the full program: 

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 16 14:16:59 2022

@author: pavan
"""
# Import required libraries
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
import pandas as pd
from bs4 import BeautifulSoup
import pdb

# Define options
opts=webdriver.ChromeOptions()
# opts.headless=True # Open in headless mode. 
opts.add_argument('--incognito') # open in incognito mode

# Install Chrome Driver
driver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts)
driver.implicitly_wait(1)

def scroll_to_end(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # scroll to the bottom of the page
    time.sleep(5) # sleep_between_interactions
    driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.HOME) # use CTRL + HOME keys. It will scroll to the top of the page
    time.sleep(5) # sleep_between_interactions
    

search_url = "https://whitepropertymanagement.appfolio.com/listings/" # Specify search URL
driver.get(search_url)
driver.maximize_window()
time.sleep(5)
scroll_to_end(driver)

# Find all elements by link text
try:
    links = driver.find_elements_by_link_text("View Details")
except:
    print('Unexpected error in driver.find_elements_by_link_text("View Details")')
    
print("Number of Property Listings on page: {}".format(len(links)))

# Store all links in a list
list_of_links_url = []
for link in links:
   link_url =  link.get_attribute('href') # use get_attribute() to get all href
   list_of_links_url.append(link_url)
   
# link_url = ""
addresses_list = []
for link_url in list_of_links_url:
   driver.get(link_url)
   time.sleep(5)
   # Use BeautifulSoup to extract data
   page_source = driver.page_source
   soup = BeautifulSoup(page_source, "html.parser")
   address_result_set = soup.find(class_ = "fw-normal js-show-title")
   for address in address_result_set:
       print(address.extract().strip())
       addresses_list.append(address.extract().strip())

# Remove empty strings from list of strings
while("" in addresses_list):
    addresses_list.remove("")
  
driver.quit() #closes all the browser windows and terminates the WebDriver session.

# Write to pandas DataFrame
df_address = pd.DataFrame(addresses_list, columns = ['PropertyAddress']) # add column
df_address['PropertyAddress'] = df_address['PropertyAddress'].str.strip() # remove whitespaces

# Write to a csv file
df_address.to_csv('propertyAddress.csv', index=False)

Step 5: Alternate method for web scrapping

I can also scrap the property information form the main link itself  instead of clicking on "view details" and then scraping the website that loads after clicking "view details". Lets take a look at directly scraping from the URL : https://whitepropertymanagement.appfolio.com/listings/

Before we go any further, here is the screenshot of the webpage.



And here is the code for scraping:
 
# Import required libraries
import time
from datetime import datetime
import os
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
# import Action chains 
from selenium.webdriver.common.action_chains import ActionChains
import pandas as pd
from bs4 import BeautifulSoup
import logging
import pdb

# Set the start time of the program
start_time = time.time()

# Prefix datetime to the csv file name and log file name
dt = datetime.today().replace(microsecond=0)
dt_string = dt.strftime("%Y%m%d-%H%M%S")
csv_file_name = dt_string + '-' + 'scraped_data_whiteproperty.csv'
log_file_name = dt_string + '-' + 'log_of_whiteproperty.log'

# Set the output directory
full_file_path = os.path.realpath(__file__) # the actual file path (during runtime only)
current_directory = os.path.dirname(full_file_path) # the directory path of the current file
one_folder_back = os.path.normpath(os.getcwd() + os.sep + os.pardir)
dest_directory = one_folder_back + "\\PropertyData\\WhiteProperty\\"
os.chdir(dest_directory)

# Handling the Logger
logger = logging.getLogger(__name__) # Create the logger
logger.setLevel(logging.INFO) # set log level
f = logging.Formatter('%(asctime)s - %(levelname)s - %(filename)s - %(message)s') # Format the logger 
fh = logging.FileHandler(log_file_name) # Set file handler
fh.setFormatter(f) # Set the formatting
logger.addHandler(fh) # Add filehandler to logger

# Install Chrome Driver
opts=webdriver.ChromeOptions()
opts.headless=True

driver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts)
driver.implicitly_wait(1)

# create action chain object
action = ActionChains(driver)

# Specify search URL
search_url = "https://whitepropertymanagement.appfolio.com/listings/"

# =============================================================================
# Write a function to take the cursor to the end of the page
# This line of code would help us to reach the end of the page. And then we’re giving 
# sleep time so we don’t run in problem, where we’re trying to read elements
# from the page, which is not yet loaded.
# =============================================================================

def scroll_to_end(driver):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(10)#sleep_between_interactions
    driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.HOME) #use CTRL + HOME keys. It will scroll to the top of the page

# Main program
driver.get(search_url)
driver.maximize_window()
time.sleep(10)
scroll_to_end(driver)

# =============================================================================
# We need to identify any attributes such as class, id, etc. which is common
# across all these unique properties. This is the link text for the button 
# labelled as "View Details"
# =============================================================================

text_search  = driver.find_elements_by_link_text("View Details")
logger.info("Number of Property Listings on page: {}".format(len(text_search)))

# Find all elements by link text
try:
    links = driver.find_elements_by_link_text("View Details")
except:
    logger.info('Unexpected error in driver.find_elements_by_link_text("View Details")')
    
list_of_links_url = []
for link in links:
   # get_attribute() to get all href
   link_url =  link.get_attribute('href')
   # logger.info(link_url)
   list_of_links_url.append(link_url)

# Use BeautifulSoup to see if data can be  extracted
page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")

address_result_set = soup.find_all(class_ = "u-pad-rm js-listing-address")
addresses_list = []

# This uses address.getText()
for address in address_result_set:
    logger.info(address.getText()) # also can use address.contents instead of address.getText(). It creates a list with len = 6    
    addresses_list.append(address.getText())
    
df_address = pd.DataFrame(addresses_list, columns = ['PropertyAddress']) # add column
df_address['PropertyAddress'] = df_address['PropertyAddress'].str.strip() # remove whitespaces

df_address['DateTime'] = pd.to_datetime(dt) # create new column DateTime
df_address['CompanyName'] = 'White Property Management'
df_address.to_csv(csv_file_name, index=False)

driver.quit() #closes all the browser windows and terminates the WebDriver session.
logger.info("Time taken to run the program: {} seconds".format(time.time() - start_time))
logging.shutdown() # Orderly shutdown by flushing and closing all handlers

Step 6: Webscraping with Only beautiful soup

Since we don't need to click the "view details" link, do we really need Selenium? And the answer is we dont need selenium, We can just use Beautiful Soup and extract the property details that we need. And instead of finding elements by class name, we shall use xpath. 
This is the weblink from where we shall extract property information: https://whitepropertymanagement.appfolio.com/listings . It is the same URL as in the previous step.

# -*- coding: utf-8 -*-
"""
Created on Fri Sep  3 21:01:41 2021

@author: pavan
"""
# =============================================================================
# Uses an iframe 
# Given website: https://www.whitepropertymgmt.com/rental-search/
# The Source   : https://whitepropertymanagement.appfolio.com/listings?1630685522050&theme_color=%23676767&filters%5Border_by%5D=date_posted
# 
# Methodology: Obtain the source attribute of the iframe and then use requests module and parse its contents
# =============================================================================

import time
import requests
from bs4 import BeautifulSoup
from lxml import html
import pandas as pd
from collections import OrderedDict
import pdb

source_url = 'https://whitepropertymanagement.appfolio.com/listings?1630686283930&theme_color=%23676767&filters%5Border_by%5D=date_posted'

# s = requests.Session()
response = requests.get(source_url)

if response.status_code == 200:
    print ('OK! Good to go as response code is : {}'.format(response.status_code))
else:
    print ('Boo. Bad Response code from requests: {}'.format(response.status_code))

content = response.text
soup = BeautifulSoup(content, "html.parser")

# extracting all the URLs found within a page’s <a> tags:
all_urls_list = []
for link in soup.find_all('a'):
    # print(link.get('href'))
    all_urls_list.append(link.get('href'))
    
# remove duplicate entries form the above list
all_urls_list_unique = list(OrderedDict.fromkeys(all_urls_list))

# remove items from a list if they dont contain string /listings or /detail
approved_list = ['/listings/detail']
all_urls_list_unique[:] = [url for url in all_urls_list_unique if any(sub in url for sub in approved_list)]
print('# of unique listings in the web page: {}'.format(len(all_urls_list_unique)))

# =============================================================================
# The # of unique listings  is a good indicator of the number of distinct entities in the iframe.
# We will use this to pick up all the necessary elements
# =============================================================================
 
# Parsing the page
tree = html.fromstring(response.content)

# =============================================================================
# 
#        Note on using xpath:
#        Reference: https://www.geeksforgeeks.org/web-scraping-using-lxml-and-xpath-in-python/
# The full Xpath for the first price is '/html/body/main/div[2]/div[1]/div[2]/div/dl/div[1]/dd'
# The full Xpath for the second price is '/html/body/main/div[2]/div[2]/div[2]/div/dl/div[1]/dd'
# --
# --
# The full Xpath for the eigth price is '/html/body/main/div[2]/div[8]/div[2]/div/dl/div[1]/dd'
# 
# To retreive the text from the eight price: '/html/body/main/div[2]/div[8]/div[2]/div/dl/div[1]/dd/text()'
# 
# For web scraping the the eigth price, it can be split into three parts:
# part 1: '/html/body/main/div[2]'
# part 2: '/div[8]'
# part 3: '/div[2]/div/dl/div[1]/dd/text()'
# 
# Example Code:
# Get element using XPath
# rent = tree.xpath(
#     '/html/body/main/div[2]/div[1]/div[2]/div/dl/div[1]/dd/text()')
# print(rent)
# =============================================================================

address_list = []
address_part1 = '/html/body/main/div[2]'
address_part2 = '/div[i]' # capture i dynamically as shown below
address_part3 = '/div[2]/p[1]/span/text()'

rent_list = []
rent_part1 = '/html/body/main/div[2]'
rent_part2 = '/div[i]' # capture i dynamically as shown below
rent_part3 = '/div[2]/div/dl/div[1]/dd/text()'

sft_list = []
sft_part1 = '/html/body/main/div[2]'
sft_part2 = '/div[i]'
sft_part3 = '/div[2]/div/dl/div[2]/dd/text()'

bedbath_list = []
bedbath_part1 = '/html/body/main/div[2]'
bedbath_part2 = '/div[i]'
bedbath_part3 = '/div[2]/div/dl/div[3]/dd/text()'

available_list = []
available_part1 = '/html/body/main/div[2]'
available_part2 = '/div[i]'
available_part3 = '/div[2]/div/dl/div[4]/dd/text()'
# pdb.set_trace()
for i in range(len(all_urls_list_unique)):
    address_part2 = '/div[' + str(i+1) + ']'
    address_path = address_part1 + address_part2 + address_part3
    print('\n', i+1)
    print('Address Path : {}'.format(address_path))
    address_value = tree.xpath(address_path)
    print('Address Value: {}'.format(address_value))
    
    rent_part2 = '/div[' + str(i+1) + ']'
    rent_path = rent_part1 + rent_part2 + rent_part3
    print('Rent Path : {}'.format(rent_path))
    rent_value =  tree.xpath(rent_path)
    print('Rent Value : {}'.format(rent_value))
    
    sft_part2 = '/div[' + str(i+1) + ']'
    sft_path = sft_part1 + sft_part2 + sft_part3
    print('SFT Path : {}'.format(sft_path))
    sft_value = tree.xpath(sft_path)
    print('SFT Value: {}'.format(sft_value))
    
    bedbath_part2 = '/div[' + str(i+1) + ']'
    bedbath_path = bedbath_part1 + bedbath_part2 + bedbath_part3
    print ('Bed/Bath path: {}'.format(bedbath_path))
    bedbath_value = tree.xpath(bedbath_path)
    print('Bed Bath value: {}'.format(bedbath_value))
    
    available_part2 = '/div[' + str(i+1) + ']'
    available_path = available_part1 + available_part2 + available_part3
    print ('Available path: {}'.format(available_path))
    available_value = tree.xpath(available_path)
    print('Available value: {}'.format(available_value))
    
    address_list.append(address_value)
    rent_list.append(rent_value)
    sft_list.append(sft_value)
    bedbath_list.append(bedbath_value)
    available_list.append(available_value)

# my_dict = {'Address': address_list, 'Rent': rent_list}        
# df = pd.DataFrame(my_dict)
df_address = pd.DataFrame(address_list, columns = ['Address'])
df_rent = pd.DataFrame(rent_list, columns = ['Rent'])
df_sft = pd.DataFrame(sft_list, columns = ['SFT'])
df_bedbath = pd.DataFrame(bedbath_list, columns = ['Bed/Bath'])
df_available = pd.DataFrame(available_list, columns = ['Available'])

df_res1 = df_address.join(df_rent, how='outer')
df_res2 = df_sft.join(df_bedbath, how='outer')

df_res3 = df_res1.join(df_res2, how='outer')

df_final = df_res3.join(df_available, how='outer')
df_final.to_csv('scraped_data_whiteproperty.csv', index=False)

No comments:

Post a Comment