Photo by Fusion Medical Animation on Unsplash
One of the most common reports analyzed in the last year 2020 is the one showing the number of Covid19 infections. While, I would like to focus on it from a slightly different side. Namely, it is said that if there is anything positive about the pandemic from 2020/2021, it will be defeating the influenza virus. With the short analysis below, I wanted to check whether the coronavirus really contributed to a significant reduction or complete defeat of the flu both in Europe and in Poland.
Data with number of Covid19 and Influenza detected cases were obtained from two different sources:
To download data for flu, I went into the website above and filter the report shown on it by the period and countries. Next, I was able to download the spreadsheet file with the chosen data.
I converted the downloaded excel file into .csv format and moved it to my GitHub repository. The tool I use most often for the data analysis process is Jupyter Notebook.
At the beginning, I downloaded all the necessary python libraries that I will use and the first file containing the number of cases of influenza virus, which I named "flu_detected".
|
|
|
As you can see, the table presents 11 columns. At first, I saw what the "Region" column contains, and there are two values:
I think there are duplicates in this table, because if a country belongs to the European Economic Area, the data for this country will be both in "EU / EEA" and in "WHO Europe". However, it is worth to check on the example country. Let it be Poland, which is part of the EEA:
As you can see, we have the same values for Poland in both "EEA" and "WHO Europe". Therefore, I have only left the values assigned to "WHO Europe", which includes all of the "EEA" countries and those that are not part of a Free-Trade Zone.
I looked at a few details:
As you can see, there are 9 763 lines and 11 columns. We also see the data types (integer "int64" or text "object") and we know that the table does not contain blank values as each column has the same number of rows.
Next, I have planned a few tasks to do based on the table we have opened:
I started by changing the column name from "Week" to "YearWeek" as the values in that column basicaly shows the year and week (see table above):
|
Then, I checked the unique values in the "Surveillance System Type" columns, which had only one variable: "Non-sentinel", and the "Season" column, which in turn had the following values:
|
out: ['Non-sentinel']
|
out: ['2015/2016', '2016/2017', '2017/2018', '2018/2019', '2019/2020', '2020/2021']
I decided to remove both the "Surveillance System Type" and the "Season" columns. The first one contains only one value that does not matter to me, the second one is basically a slimmed-down version of the "YearWeek" column:
|
The next step is to summarize all the flu types:
Now it's time to create a pivot table that will improve the readability of the above data:
I think the table looks much better. All flu type names are currently in the "Flu Type" column, the values are in the "Detected_Cases" column and the "YearWeek" column has been split into two separate columns named "Year" and "Week" .
While I was focusing on the total detected influenza cases, I also wanted to see the ratio for each of the flu type. Before doing that, I decided to shorten the text for each type of flu a bit, to make the chart easier to read:
|
Now the values in the "Flu Type" column look like this:
|
out: ['A', 'A (H1)', 'A (H3)', 'B', 'B / Vic', 'B / Yam', 'Total Detected Cases']
I would like to check if we have enough data for each year. To do this, I have checked what is the total of detected flu cases and the number of weeks included in the given year:
|
|
As you can see in the example above, 2015 is incomplete (we have data for 14 weeks). Of course, 2021 is not looking much better. However, this is due to the fact that I downloaded the data on the end of May 2021. In that case, I removed only 2015 from the table:
|
In the below script I was checking what was the ratio of flu types in 2016-2021:
|
As you can see in the chart above, 2021 seems to be flu virus free. As for the remaining years, almost every year the most common type of influenza virus is type A, which occurs in both humans and animals (pigs, horses, seals, minks, whales and birds), and type B, which occurs only in humans. You can find a lot of information about the definition for each type of flu, among others on Wikipedia. I encourage you to check it out.
I didn't need the data for each flu type anymore, so I made another table with the total flu cases. I called the table "df_flu":
|
Here I saw the total value of flu cases in 2016-2021:
|
At first glance, you can see a decrease in flu cases in 2019-2021 (May). It is true that the data for May is incomplete, but (as far as I remember) most cases occur in the first quarter of each year. However, I checked it myself a bit later.
Finally, we will remove the letter "W" from the "Week" column, then we will have a column with integer values representing the week number - e.g.: "1" instead of "W01":
|
Done, one last look at the final table with data showing the number of flu cases in Europe:
It's time for the second report which shows what was the number of infections with the SARS-COV-2 virus in Europe. Finally, the second table will have the same form as the one above (for flu), so that I could combine them and make a comparison.
All the data with COVID-19 I used, were taken from repository of the Center for Science and Systems Engineering (CSSE) at Johns Hopkins University.
Website address: https://github.com/CSSEGISandData/COVID-19
On the basis of these data there was created a very popular graph wich presents the current situation related to the pandemic.
To download all of the reports from the above repository, I used the "Beautiful Soup" library. Links for all the reports has been saved in a variable "urls" and the dates for each file were saved in "df_list_names":
|
A look at the "urls" and "df_list_names" variables:
|
out:
['https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-01-2021.csv',
'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-02-2021.csv']
|
out: ['01 -01-2021 ', '01 -02-2021', '01 -03-2021 ', '01 -04-2021', '01 -05-2021 ']
The downloaded files from the repository contain the following columns:
FIPS, Admin2, Province_State, Country_Region, Last Update, Lat and Long_, Confirmed, Deaths, Recovered, Active, Incident_Rate, Case_Fatality_Ratio (%)
However, I decided that I would need only the following columns:
Country_Region, Last_Update, Confirmed, Deaths, Recovered.
In addition, I also downloaded two columns which might be needed: Lat, Long_
After downloading the data, it turned out that the names of the columns changed slightly over the year. Due to this, I had to add conditional statements to the code:
|
|
A glance at the downloaded data:
The next step will be to change the data type for the two columns "Last_Update" and "File_Name"
As you can see above, we have a few columns with no values, which I'm going to fix:
|
|
The next step is to summarize the number of cases: Confirmed / Deaths / Recovered by country and column "Last_Update":
|
The "Last update" column shows the date for the data. However, to be able to connect and compare both tables for Covid and Influenza with each other, I needed a week number for each day. To create the columns with week number I used the "isocalendar" function:
|
Let's see what the example table for Poland looks like now. I chose week 20 of 2021:
|
I took the maximum values from the table above because I needed weekly level data:
|
I also changed the column name for the region:
|
Everything seems to be fine now. However, one more issue puzzles me. Namely, do I have the same country names in both tables or maybe there are some differences:
The table with the flu is my main table, because we need only European countries. As you can see, we have a few missing values in the table "df_covid" columns "cov". We can see below the countries names where we had difference in both tables:
I made a list with the countries to change and I have refreshed the table:
I changed the name for the "Detected Cases" column to recognize in the final table that we have in here influenza cases:
|
I've merged these two tables with each other and checked which cells had no value for 2021. All blank cells have been replaced with zeros - I needed only numerical values throughout the column:
|
|
For the analysis purposes and chart clarity, I have added year quarters:
|
final_df = final_df.merge(quarters, on = 'Week', how = 'inner') |
Currently we have 9 359 lines, no missing weeks, no empty cells, and no negative values. Everything looks fine, so I saved the file on my computer, then send it into my repository:
|
I've created a separate file for the analysis in Jupyter Notebook - file name "Flu_Covid_Analysis.ipynb". Of course, I started by downloading the libraries and the file itself:
|
|
It is worth to remind the names of the columns and describe their meaning:
There are few points I wanted to check additionally based on the table I have prepared:
AD1. How many flu cases were detected in weekly basis (for Europe and Poland)?
To see the flu history, I created the following function:
|
Checking the result:
|
The highest number of detected flu cases were in 2018 and 2019. In 2021, the flu detection rate was near to zero. Of course, the data doesn't include the full year of 2021, but the biggest detection increase occurs between the fifth and tenth week of each year, and in 2021 there is no increase in the mentioned week.
Checking the same report, but for Poland:
|
The graph looks similar to the previous one from Europe. However, the biggest increase was in 2016 and you can see a really small increase in flu detection after week 50 (in Europe we had more cases of influenza at the same time). What is common for Europe and Poland is the virtually no flu cases in 2021.
AD2. How many covid cases were detected in weekly basis (for Europe and Poland)
What we know about the pandemic is that there were no officially confirmed cases of this virus prior to 2020. We can confirm this with the code line below:
|
out: [2021, 2020]
I would like to see the results for Covid19 in a chart and compare them with the corresponding flu data. So, I have decided to remove the values for the weeks from my table and keep only the years and quarters. I put the new data in a table called "df_q":
|
Now we can write a function for our chart:
|
|
As seen above, we have a large increase in COVID cases in the fourth quarter of 2020. The first quarter of 2021 was not much better. However, there is a clear improvement in the second quarter of 2021:
|
In Poland, we have a quite similar trend as it was in Europe.
Now we can create one chart containing the flu and covid cases. However, it is important to remember that this will only be done to show a certain trend, as there are far more coronavirus infections in comparison to the flu detected cases.
The first function divides the values in the table by a thousand. I did this mainly to avoid showing the millionth values on the chart. I also rounded the numbers to two decimal places, which will also improve readability. The last issue is to create a new column combining years and quarters:
|
And below is the final function for a graph with virus comparison:
|
|
There is a clear connection between covid19 and the flu. When the pandemic came to Europe, the flu was almost gone. Remember that the volumes on the chart are divided by 1000.
Thus, the highest number of influenza cases was in Q1 2018: 184,900 cases
On the other hand, the highest number of Covid19 cases was in Q2 2021: 52,285,000 cases
Let's see how it looked in Poland:
|
There is a similar trend in Poland compared to what we saw on the chart for the whole Europe.
AD3. Top 10 countries with flu / covid19
The next step was to check the top 10 countries with the highest influenza and covid19 detection.
The most influenza cases reported:
|
The most covid19 cases reported:
|
AD4. Finally, I will check how many flu cases we had after the second quarter of 2020
|
On the basis of the data I have prepared, we can say with certainty that the SARS-CoV-2 pandemic influenced the detection of influenza. However, the number of flu cases has never (at least since 2016) been close to the range of the covid19 pandemic across the Europe.
The scale of the coronavirus is incredibly high, and we cannot doubt that we have been hit by the pandemic.
Please feel free to visit my GitHub account where you can find all of the scripts from this project with their description.
Comments:
0 comments
There is no comment yet.
Add new comment: