How Scrape HTML Tabular data with Python

How Scrape HTML Tabular data with Python

Many of you might have already read several articles about data scraping from the websites. Most of them suggested using Node.js with Cheerio library or Python with Beautiful Soup. Although it is very effective when you master the techniques, it takes your time and effort until you finish all the coding for finding an element you need, requesting data, cleaning data to create a dataframe before you can do the actual data analysis. (And, of course, some additional time to fix all the bugs and errors.

This short article will show you a tutorial on how to the easiest way to scrape the tabular data from any website with the three lines of Python Script!

Example of Scraping Real-time COVID-19 Data from Worldometer:

For example, you want to get the tabular data from the Worldometer website . As this dataset is dynamic, changing over time, the Data scraping is make-sense that we get the most updated result every time when running the script!

Example Covid-19 Tabular Data from Worldometer Example Covid-19 Tabular Data from Worldometer

To scrape this dataset, get your machine ready with Python and Pandas. We gonna use the Pandas read_html() to extract all tables of any webpage. However, we cannot just use it to read URL directly because you might face an error 403: Forbidden. To avoid the error, we gonna request it with requests module first to get the HTML body before use Pandas to read it. Overall, the script looks like this:

import requests, pandas as pd
r = requests.get('http://www.worldometers.info/coronavirus/')
dfs = pd.read_html(r.text)

pandas.read_html() function searches for HTML <table> related tags on the input (URL) you provide. It always returns a list, even if the site only has one table.

dfs[0]
         #           Country,Other  TotalCases NewCases  TotalDeaths NewDeaths  TotalRecovered  ActiveCases  Serious,Critical  Tot Cases/1M pop  Deaths/1M pop  TotalTests  Tests/ 1M pop    Population
0      NaN                   World     5622939  +38,672     348715.0    +1,102       2393539.0    2880685.0           53131.0             721.0           44.7         NaN            NaN           NaN
1      1.0                     USA     1709243   +3,017      99883.0       +78        465668.0    1143692.0           17116.0            5167.0          302.0  15204572.0        45961.0  3.308117e+08
2      2.0                  Brazil      376669      NaN      23522.0       NaN        153833.0     199314.0            8318.0            1773.0          111.0    735224.0         3461.0  2.124098e+08
3      3.0                  Russia      362342   +8,915       3807.0      +174        131129.0     227406.0            2300.0            2483.0           26.0   9160590.0        62775.0  1.459285e+08
4      4.0                   Spain      282480      NaN      26837.0       NaN        196958.0      58685.0             854.0            6042.0          574.0   3556567.0        76071.0  4.675305e+07
...
dfs[1]
         #           Country,Other  TotalCases NewCases  TotalDeaths NewDeaths  TotalRecovered  ActiveCases  Serious,Critical  Tot Cases/1M pop  Deaths/1M pop  TotalTests  Tests/ 1M pop    Population
0      NaN                   World     5584267  +90,184     347613.0    +3,096       2362984.0    2873670.0           53167.0             716.0           44.6         NaN            NaN           NaN
1      1.0                   China       82985      +11       4634.0       NaN         78268.0         83.0               7.0              58.0            3.0         NaN            NaN  1.439324e+09
2      2.0                     USA     1706226  +19,790      99805.0      +505        464670.0    1141751.0           17114.0            5158.0          302.0  15187647.0        45910.0  3.308117e+08
3      3.0                  Brazil      376669  +13,051      23522.0      +806        153833.0     199314.0            8318.0            1773.0          111.0    735224.0         3461.0  2.124098e+08
4      4.0                  Russia      353427   +8,946       3633.0       +92        118798.0     230996.0            2300.0            2422.0           25.0   8945384.0        61300.0  1.459285e+08
...

SUBSCRIBE FOR NEW ARTICLES

@
comments powered by Disqus