When you wanted to scrape data from website in Python, you might think about the package available, such as BeautifulSoup, Selenium, and Scrapy, but wouldn’t it be better if there’s a simpler way? If you need to scrap tabular from web page, Pandas would also be good option. Pandas make it easy for us to scrape tabular data on web.
In this article, I will demonstrate:
- How to scrap tabular data with pandas from Wikipedia? https://en.wikipedia.org/wiki/Wikipedia:Multiyear_ranking_of_most_viewed_pages
- How to use Web Scraper to scrap data from Yelp?
Web scraping with Pandas
First, Check out the website (Wikipedia- Multiyear ranking of most viewed pages), There are few tables in this pages, and each table has their header.
read_html method to read the HTML tables
Use URL directly, and use pandas function pd.read_html to read in webpage.
Check the total number of tables found
Using len() function, we find that there are 21 tables in this webpage.
Access a particular table
Simply access element of the list.
Set a particular column as index
If you would like to set a particular column as index, you can use index_col parameter to choose which column to be index. For this example we choose page as our index.
Return tables containing a string or regex
If you have certain table you want to pick you can use match parameter and type keyword to pick certain table.
Use Web Scraper to scrap data from Yelp
What is Web-Scraper?
- Web Scraper is a browser extension that automates data extraction from websites.
- The goal of Web Scraper is to make web data extraction as simple as possible. Configure scraper by simply pointing and clicking on elements. No coding required.
First, Install Web Scraper to Google Chrome:
Next, I will show you how to web scrape for bakery store from the Yelp.
- Right click on the webpage you are going to do web scraping, then click Inspect.
- choose the Web Scraper at the tab on the right.
3. Create Sitemap:
- Sitemap name- could be any name you want to call, but no capital letter or space.
- Start URL is the current page, we also can say it is our starting page
4. After create sitemap, we can start to add new selector for each bakery link.
- ‘_root’ is our starting page
- We can give an ID name for this selector. For this demo, I give the ID name bakery_link.
- Type could be Link/Text/Image… and so on. It depends on what type of data you select.
- Select the area by clicking ‘Select’ under selector and click the links. You can see above picture red highlight section are selected links.
- and make sure you check ‘Multiple’ if there are more than one link or data selected.
5. Direct to the bakery_link selector, then add more selectors inside.
- Add more selectors for the data we would like to scrape. (You can see the example for bakery name selector below) and you can just repeat this step to scrape more data information, for example, phone data, address data, rate, … and so on.
6. Once you create all the data for scraping, we can check ‘Selector graph’ to see if all the data included.
7. Scrape data from more pages
- Add selectors for page link inside ‘_root’.
- Edit ‘bakery_link’ by including ‘page_link’ as parent selector.
- Review selector graph
8. Start scraping
9. Export data as CSV
In this article, you have learned how to web scrap with simple method with pandas and google extension, Web Scraper. If you would like to learn more about Web Scraper you can visit following website: