How do you scrape data from a website using python?
Published:
How to scrape websites?
First lets grab the needed modules
pip install -U selenium
pip install webdriver-manager
pip install beautifulsoup4
You might be asking what are these modules?
Here’s the run down.
selenium basically helps us automate browsers! It opens up a browser on your device your python script can interact with.
webdriver-manager is a wrapper module for selenium. How it works is, selenium needs a “driver” to interface with a chosen browser. Selenium docs has links to where you can download these drivers, however webdriver-manager circumvents this problem by automating the downloading and saving of webdrivers
And finally beatifulsoup4 is our chosen html parser.
Next create a new file main.py
You can do this manually or by running the below command (only for linux)
touch main.py
Now in the file add these lines
from selenium import webdriver
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))Choose your URL
First lets see which sites we can scrape from The example rn is olx.com, a pakistani site called olx.com
URL = "https://czone.com.pk/laptops-pakistan-ppt.74.aspx"
driver.get(URL)
We can get the page source from the driver and pass it to our html parser
soup = BeautifulSoup(driver.page_source, 'lxml')
products = soup.findAll(
'div', {'class': 'product'}
)
Next lets grab all the products on that page and put them into a list
records = []
for product in products:
product_name = product.find('h4').getText().strip()
product_price = product.find('div', {'class': 'price'}).getText().strip()
records.append((product_name, product_price))
