Studying and Predicting the Progress of COVID-19 using Pandas and ARIMA

COVID-19 has been around for nearly 4 months since the outbreak. In this notebook, we will study some of the useful statistics regarding number of confirmed/deaths/recovered cases as a function of time per each country/region. We will use the the dataset that has been publicly avaiable by www.kaggle.com in here.

What will You Learn?

– How to use Pandas to load .csv files
– How to check for attributes with missing values and if necessary getting rid of those attributes
– How to generate some useful statistics from the dataset?
– How to use visualisation for a better understanding regarding the patterns in the data?

What is inside the COVID-19 Dataset?

Main file in this dataset is covid_19_data.csv and the detailed descriptions are below: covid_19_data.csv and below is a summary of the attributes in the .csv file:

  1. Sno – Serial number
  2. ObservationDate – Date of the observation in MM/DD/YYYY
  3. Province/State – Province or state of the observation (Could be empty when missing)
  4. Country/Region – Country of observation
  5. Last Update – Time in UTC at which the row is updated for the given province or country. (Not standardised and so please clean before using it)
  6. Confirmed – Cumulative number of confirmed cases till that date
  7. Deaths – Cumulative number of of deaths till that date
  8. Recovered – Cumulative number of recovered cases till that date

Are Country Level Datasets Available?

The Country level datasets are also available:

If you are interested in knowing country level data, please refer to the following Kaggle datasets:
India
South Korea
Italy
Brazil
USA
Switzerland
Indonesia

You can download the Jupyter Notebook for COVID-19 Analysis here:

Leave a Comment