Why Selenium?
Beautiful Soup makes web scraping easy by traversing the DOM (Document Object Model). But it handles only Static Scraping and is not capable of handling JavaScript. Essentially Beautiful Soup fetches webpage from the server without the help of a browser. We get what we see in the "view page source". If the data that we are looking for is available in the "view page source", then Beautiful Soup is sufficient for web scraping. But if we need data that gets rendered only upon clicking a JavaScript link, then we need to use dynamic web scraping methods In such a situation we first use Selenium to automate the browser and click on the JavaScript link, wait for the elements to load and then use Beautiful Soup to extract the elements.
Step 1: Open the page with selenium
First we will open our target website link with Selenium. A screenshot of our link is given below and here is the actual link: https://whitepropertymanagement.appfolio.com/listings/listings
Let us now translate this to actual python code. First lets import the required libraries for the program
# Import required libraries from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.common.keys import Keys import pandas as pd from bs4 import BeautifulSoup import logging import pdb
Next let us install chrome driver. For Selenium to work, it must access the browser driver. Here, Selenium accesses the Chrome browser driver without actually opening a browser window(headless argument). It also opens the chrome browser in incognito mode. The implicitly_wait method sets a sticky timeout to implicitly wait for an element to be found, or a command to complete. This method only needs to be called one time per session.
# Define options opts=webdriver.ChromeOptions() # opts.headless=True # Open in headless mode. opts.add_argument('--incognito') # open in incognito mode # Install Chrome Driver driver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts) driver.implicitly_wait(10)
def scroll_to_end(driver): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # scroll to the bottom of the page time.sleep(10) # sleep_between_interactions driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.HOME) # use CTRL + HOME keys. It will scroll to the top of the page search_url = "https://whitepropertymanagement.appfolio.com/listings/" # Specify search URL driver.get(search_url) driver.maximize_window() time.sleep(10) scroll_to_end(driver)
Step 2: Extract corresponding link of each property
If you take a closer look at the webpage, we can notice that each property has an element titled "view details". (When we click on this, it opens up another webpage). Let us find the URL to this element.
# Find all elements by link text try: links = driver.find_elements_by_link_text("View Details") except: print('Unexpected error in driver.find_elements_by_link_text("View Details")') print("Number of Property Listings on page: {}".format(len(links))) # Store all links in a list list_of_links_url = [] for link in links: link_url = link.get_attribute('href') # use get_attribute() to get all href list_of_links_url.append(link_url)
Step 3: Open all links and extract the property address with Beautiful Soup
Once we click on the link "View Details", this is how our page looks:
We can use the driver.get() method to open all the links. Then extract the property information using the xpath or class. Here is where we bring in Beautiful Soup to extract data. In this particular case, empty strings have also been collected and the last piece of logic is to remove the empty strings.
# link_url = "" addresses_list = [] for link_url in list_of_links_url: driver.get(link_url) time.sleep(10) # Use BeautifulSoup to extract data page_source = driver.page_source soup = BeautifulSoup(page_source, "html.parser") address_result_set = soup.find(class_ = "fw-normal js-show-title") for address in address_result_set: print(address.extract().strip()) addresses_list.append(address.extract().strip()) # Remove empty strings from list of strings while("" in addresses_list): addresses_list.remove("") driver.quit() #closes all the browser windows and terminates the WebDriver session.
In the event that you want to use xpath, you can take a look at the following plugins which give you the absolute xpath to any web element:
- ChroPath : https://autonomiq.io/deviq-chropath.html
- SelectorsHub: https://selectorshub.com/selectorshub/
Step 4: Write the Property Addresses to csv
With the code below, we can write to pandas and from there to a csv file.
# Write to pandas DataFrame df_address = pd.DataFrame(addresses_list, columns = ['PropertyAddress']) # add column df_address['PropertyAddress'] = df_address['PropertyAddress'].str.strip() # remove whitespaces # Write to a csv file df_address.to_csv('propertyAddress.csv', index=False)
Here is the full program:
# -*- coding: utf-8 -*- """ Created on Thu Nov 16 14:16:59 2022 @author: pavan """ # Import required libraries import time from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.common.keys import Keys import pandas as pd from bs4 import BeautifulSoup import pdb # Define options opts=webdriver.ChromeOptions() # opts.headless=True # Open in headless mode. opts.add_argument('--incognito') # open in incognito mode # Install Chrome Driver driver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts) driver.implicitly_wait(1) def scroll_to_end(driver): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # scroll to the bottom of the page time.sleep(5) # sleep_between_interactions driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.HOME) # use CTRL + HOME keys. It will scroll to the top of the page time.sleep(5) # sleep_between_interactions search_url = "https://whitepropertymanagement.appfolio.com/listings/" # Specify search URL driver.get(search_url) driver.maximize_window() time.sleep(5) scroll_to_end(driver) # Find all elements by link text try: links = driver.find_elements_by_link_text("View Details") except: print('Unexpected error in driver.find_elements_by_link_text("View Details")') print("Number of Property Listings on page: {}".format(len(links))) # Store all links in a list list_of_links_url = [] for link in links: link_url = link.get_attribute('href') # use get_attribute() to get all href list_of_links_url.append(link_url) # link_url = "" addresses_list = [] for link_url in list_of_links_url: driver.get(link_url) time.sleep(5) # Use BeautifulSoup to extract data page_source = driver.page_source soup = BeautifulSoup(page_source, "html.parser") address_result_set = soup.find(class_ = "fw-normal js-show-title") for address in address_result_set: print(address.extract().strip()) addresses_list.append(address.extract().strip()) # Remove empty strings from list of strings while("" in addresses_list): addresses_list.remove("") driver.quit() #closes all the browser windows and terminates the WebDriver session. # Write to pandas DataFrame df_address = pd.DataFrame(addresses_list, columns = ['PropertyAddress']) # add column df_address['PropertyAddress'] = df_address['PropertyAddress'].str.strip() # remove whitespaces # Write to a csv file df_address.to_csv('propertyAddress.csv', index=False)
Step 5: Alternate method for web scrapping
I can also scrap the property information form the main link itself instead of clicking on "view details" and then scraping the website that loads after clicking "view details". Lets take a look at directly scraping from the URL : https://whitepropertymanagement.appfolio.com/listings/
Before we go any further, here is the screenshot of the webpage.
And here is the code for scraping:
# Import required libraries import time from datetime import datetime import os from selenium import webdriver from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.common.keys import Keys # import Action chains from selenium.webdriver.common.action_chains import ActionChains import pandas as pd from bs4 import BeautifulSoup import logging import pdb # Set the start time of the program start_time = time.time() # Prefix datetime to the csv file name and log file name dt = datetime.today().replace(microsecond=0) dt_string = dt.strftime("%Y%m%d-%H%M%S") csv_file_name = dt_string + '-' + 'scraped_data_whiteproperty.csv' log_file_name = dt_string + '-' + 'log_of_whiteproperty.log' # Set the output directory full_file_path = os.path.realpath(__file__) # the actual file path (during runtime only) current_directory = os.path.dirname(full_file_path) # the directory path of the current file one_folder_back = os.path.normpath(os.getcwd() + os.sep + os.pardir) dest_directory = one_folder_back + "\\PropertyData\\WhiteProperty\\" os.chdir(dest_directory) # Handling the Logger logger = logging.getLogger(__name__) # Create the logger logger.setLevel(logging.INFO) # set log level f = logging.Formatter('%(asctime)s - %(levelname)s - %(filename)s - %(message)s') # Format the logger fh = logging.FileHandler(log_file_name) # Set file handler fh.setFormatter(f) # Set the formatting logger.addHandler(fh) # Add filehandler to logger # Install Chrome Driver opts=webdriver.ChromeOptions() opts.headless=True driver = webdriver.Chrome(ChromeDriverManager().install() ,options=opts) driver.implicitly_wait(1) # create action chain object action = ActionChains(driver) # Specify search URL search_url = "https://whitepropertymanagement.appfolio.com/listings/" # ============================================================================= # Write a function to take the cursor to the end of the page # This line of code would help us to reach the end of the page. And then we’re giving # sleep time so we don’t run in problem, where we’re trying to read elements # from the page, which is not yet loaded. # ============================================================================= def scroll_to_end(driver): driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(10)#sleep_between_interactions driver.find_element_by_tag_name('body').send_keys(Keys.CONTROL + Keys.HOME) #use CTRL + HOME keys. It will scroll to the top of the page # Main program driver.get(search_url) driver.maximize_window() time.sleep(10) scroll_to_end(driver) # ============================================================================= # We need to identify any attributes such as class, id, etc. which is common # across all these unique properties. This is the link text for the button # labelled as "View Details" # ============================================================================= text_search = driver.find_elements_by_link_text("View Details") logger.info("Number of Property Listings on page: {}".format(len(text_search))) # Find all elements by link text try: links = driver.find_elements_by_link_text("View Details") except: logger.info('Unexpected error in driver.find_elements_by_link_text("View Details")') list_of_links_url = [] for link in links: # get_attribute() to get all href link_url = link.get_attribute('href') # logger.info(link_url) list_of_links_url.append(link_url) # Use BeautifulSoup to see if data can be extracted page_source = driver.page_source soup = BeautifulSoup(page_source, "html.parser") address_result_set = soup.find_all(class_ = "u-pad-rm js-listing-address") addresses_list = [] # This uses address.getText() for address in address_result_set: logger.info(address.getText()) # also can use address.contents instead of address.getText(). It creates a list with len = 6 addresses_list.append(address.getText()) df_address = pd.DataFrame(addresses_list, columns = ['PropertyAddress']) # add column df_address['PropertyAddress'] = df_address['PropertyAddress'].str.strip() # remove whitespaces df_address['DateTime'] = pd.to_datetime(dt) # create new column DateTime df_address['CompanyName'] = 'White Property Management' df_address.to_csv(csv_file_name, index=False) driver.quit() #closes all the browser windows and terminates the WebDriver session. logger.info("Time taken to run the program: {} seconds".format(time.time() - start_time)) logging.shutdown() # Orderly shutdown by flushing and closing all handlers
Step 6: Webscraping with Only beautiful soup
Since we don't need to click the "view details" link, do we really need Selenium? And the answer is we dont need selenium, We can just use Beautiful Soup and extract the property details that we need. And instead of finding elements by class name, we shall use xpath.
This is the weblink from where we shall extract property information: https://whitepropertymanagement.appfolio.com/listings . It is the same URL as in the previous step.
# -*- coding: utf-8 -*- """ Created on Fri Sep 3 21:01:41 2021 @author: pavan """ # ============================================================================= # Uses an iframe # Given website: https://www.whitepropertymgmt.com/rental-search/ # The Source : https://whitepropertymanagement.appfolio.com/listings?1630685522050&theme_color=%23676767&filters%5Border_by%5D=date_posted # # Methodology: Obtain the source attribute of the iframe and then use requests module and parse its contents # ============================================================================= import time import requests from bs4 import BeautifulSoup from lxml import html import pandas as pd from collections import OrderedDict import pdb source_url = 'https://whitepropertymanagement.appfolio.com/listings?1630686283930&theme_color=%23676767&filters%5Border_by%5D=date_posted' # s = requests.Session() response = requests.get(source_url) if response.status_code == 200: print ('OK! Good to go as response code is : {}'.format(response.status_code)) else: print ('Boo. Bad Response code from requests: {}'.format(response.status_code)) content = response.text soup = BeautifulSoup(content, "html.parser") # extracting all the URLs found within a page’s <a> tags: all_urls_list = [] for link in soup.find_all('a'): # print(link.get('href')) all_urls_list.append(link.get('href')) # remove duplicate entries form the above list all_urls_list_unique = list(OrderedDict.fromkeys(all_urls_list)) # remove items from a list if they dont contain string /listings or /detail approved_list = ['/listings/detail'] all_urls_list_unique[:] = [url for url in all_urls_list_unique if any(sub in url for sub in approved_list)] print('# of unique listings in the web page: {}'.format(len(all_urls_list_unique))) # ============================================================================= # The # of unique listings is a good indicator of the number of distinct entities in the iframe. # We will use this to pick up all the necessary elements # ============================================================================= # Parsing the page tree = html.fromstring(response.content) # ============================================================================= # # Note on using xpath: # Reference: https://www.geeksforgeeks.org/web-scraping-using-lxml-and-xpath-in-python/ # The full Xpath for the first price is '/html/body/main/div[2]/div[1]/div[2]/div/dl/div[1]/dd' # The full Xpath for the second price is '/html/body/main/div[2]/div[2]/div[2]/div/dl/div[1]/dd' # -- # -- # The full Xpath for the eigth price is '/html/body/main/div[2]/div[8]/div[2]/div/dl/div[1]/dd' # # To retreive the text from the eight price: '/html/body/main/div[2]/div[8]/div[2]/div/dl/div[1]/dd/text()' # # For web scraping the the eigth price, it can be split into three parts: # part 1: '/html/body/main/div[2]' # part 2: '/div[8]' # part 3: '/div[2]/div/dl/div[1]/dd/text()' # # Example Code: # Get element using XPath # rent = tree.xpath( # '/html/body/main/div[2]/div[1]/div[2]/div/dl/div[1]/dd/text()') # print(rent) # ============================================================================= address_list = [] address_part1 = '/html/body/main/div[2]' address_part2 = '/div[i]' # capture i dynamically as shown below address_part3 = '/div[2]/p[1]/span/text()' rent_list = [] rent_part1 = '/html/body/main/div[2]' rent_part2 = '/div[i]' # capture i dynamically as shown below rent_part3 = '/div[2]/div/dl/div[1]/dd/text()' sft_list = [] sft_part1 = '/html/body/main/div[2]' sft_part2 = '/div[i]' sft_part3 = '/div[2]/div/dl/div[2]/dd/text()' bedbath_list = [] bedbath_part1 = '/html/body/main/div[2]' bedbath_part2 = '/div[i]' bedbath_part3 = '/div[2]/div/dl/div[3]/dd/text()' available_list = [] available_part1 = '/html/body/main/div[2]' available_part2 = '/div[i]' available_part3 = '/div[2]/div/dl/div[4]/dd/text()' # pdb.set_trace() for i in range(len(all_urls_list_unique)): address_part2 = '/div[' + str(i+1) + ']' address_path = address_part1 + address_part2 + address_part3 print('\n', i+1) print('Address Path : {}'.format(address_path)) address_value = tree.xpath(address_path) print('Address Value: {}'.format(address_value)) rent_part2 = '/div[' + str(i+1) + ']' rent_path = rent_part1 + rent_part2 + rent_part3 print('Rent Path : {}'.format(rent_path)) rent_value = tree.xpath(rent_path) print('Rent Value : {}'.format(rent_value)) sft_part2 = '/div[' + str(i+1) + ']' sft_path = sft_part1 + sft_part2 + sft_part3 print('SFT Path : {}'.format(sft_path)) sft_value = tree.xpath(sft_path) print('SFT Value: {}'.format(sft_value)) bedbath_part2 = '/div[' + str(i+1) + ']' bedbath_path = bedbath_part1 + bedbath_part2 + bedbath_part3 print ('Bed/Bath path: {}'.format(bedbath_path)) bedbath_value = tree.xpath(bedbath_path) print('Bed Bath value: {}'.format(bedbath_value)) available_part2 = '/div[' + str(i+1) + ']' available_path = available_part1 + available_part2 + available_part3 print ('Available path: {}'.format(available_path)) available_value = tree.xpath(available_path) print('Available value: {}'.format(available_value)) address_list.append(address_value) rent_list.append(rent_value) sft_list.append(sft_value) bedbath_list.append(bedbath_value) available_list.append(available_value) # my_dict = {'Address': address_list, 'Rent': rent_list} # df = pd.DataFrame(my_dict) df_address = pd.DataFrame(address_list, columns = ['Address']) df_rent = pd.DataFrame(rent_list, columns = ['Rent']) df_sft = pd.DataFrame(sft_list, columns = ['SFT']) df_bedbath = pd.DataFrame(bedbath_list, columns = ['Bed/Bath']) df_available = pd.DataFrame(available_list, columns = ['Available']) df_res1 = df_address.join(df_rent, how='outer') df_res2 = df_sft.join(df_bedbath, how='outer') df_res3 = df_res1.join(df_res2, how='outer') df_final = df_res3.join(df_available, how='outer') df_final.to_csv('scraped_data_whiteproperty.csv', index=False)
No comments:
Post a Comment