Statisical Analysis of the Global Innovation Index dataset and Proposed Storyboard.
Ref: https://www.euroscience.org/news/science-policy-esof2020-innovation-in-times-of-covid-19/
As the global Covid-19 pandemic hits, the theme of the Global Innovation Index (GII) 2020 is “Who will Finance Innovation,” which presents the current state and evolution of financial support mechanisms while exploring needed advances and remaining challenges. The GII model includes 131 countries/economies, which represent 93.5% of the world’s population. The GII relies on two sub-indices – the Innovation Input Sub-Index and the Innovation Output Sub-Index.
Our project aims to analyse and identify patterns regarding the GII during Covid-19 pandemic. We intend to draw conclusions from the data and generate visualization of the data for the respective countries or regions especially in Singapore.
For this particular report, I will be focusing on the statistical analysis of the GII data while my teammate, Elaine, will take care of the exploratory data analysis (EDA).
packages = c('tidyverse', 'plotly', 'heatmaply', 'ggcorrplot',
'dendextend', 'psych', 'stringr', 'countrycode',
'factoextra', 'gridExtra', 'ggpubr', 'ggstatsplot')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
The datasets are obtained from the Global Innovation Index (GII) website, which is co-published by Cornell University, INSEAD, and the World Intellectual Property Organization (WIPO), a specialized agency of the United Nations. The datasets can be downloaded from HERE.
For this particular project, we have downloaded full dataset in CSV for year 2013 to 2020. But we only used from year 2015 to 2020 due to inconsistency of the indicator naming convention in year 2013 and 2014.
The dataset consists of 7 main indicators and 101 sub-indicators which are used to compute the overall GII score. These indicators are represented as rows in the dataset. Each of the 131 different countries are represented as a column. Each indicator comes with 5 different parameters: Rank, Score, Value, Strength, and Weakness.
head(read_csv('data/2020.csv'))
# A tibble: 6 x 663
X1 X2 `Rank Indicator~ Albania Algeria Argentina Armenia
<chr> <chr> <lgl> <chr> <chr> <dbl> <chr>
1 Indi~ <NA> NA <NA> <NA> NA <NA>
2 Glob~ <NA> NA 83 121 80 61
3 Inno~ <NA> NA 108 125 77 26
4 Inno~ <NA> NA 74 111 80 83
5 Inno~ <NA> NA 91 126 73 47
6 Index <NA> NA <NA> <NA> NA <NA>
# ... with 656 more variables: Australia <chr>, Austria <dbl>,
# Azerbaijan <chr>, Bahrain <chr>, Bangladesh <chr>, Belarus <chr>,
# Belgium <dbl>, Benin <chr>, `Bolivia (Plurinational State
# of)` <chr>, `Bosnia and Herzegovina` <chr>, Botswana <chr>,
# Brazil <chr>, `Brunei Darussalam` <chr>, Bulgaria <dbl>, `Burkina
# Faso` <chr>, `Cabo Verde` <chr>, Cambodia <chr>, Cameroon <chr>,
# Canada <chr>, Chile <dbl>, China <chr>, Colombia <dbl>, `Costa
# Rica` <dbl>, `Côte d'Ivoire` <chr>, Croatia <dbl>, Cyprus <dbl>,
# `Czech Republic (the)` <dbl>, Denmark <dbl>, `Dominican Republic
# (the)` <chr>, Ecuador <chr>, Egypt <chr>, `El Salvador` <chr>,
# Estonia <dbl>, Ethiopia <chr>, Finland <dbl>, France <dbl>,
# Georgia <dbl>, Germany <dbl>, Ghana <chr>, Greece <dbl>,
# Guatemala <chr>, Guinea <chr>, Honduras <chr>, `Hong Kong,
# China` <chr>, Hungary <dbl>, Iceland <dbl>, India <dbl>,
# Indonesia <dbl>, `Iran (Islamic Republic of)` <chr>,
# Ireland <dbl>, Israel <dbl>, Italy <dbl>, Jamaica <chr>,
# Japan <chr>, Jordan <chr>, Kazakhstan <dbl>, Kenya <chr>,
# Kuwait <chr>, Kyrgyzstan <chr>, `Lao People's Democratic Republic
# (the)` <chr>, Latvia <dbl>, Lebanon <chr>, Lithuania <dbl>,
# Luxembourg <dbl>, Madagascar <chr>, Malawi <chr>, Malaysia <dbl>,
# Mali <chr>, Malta <dbl>, Mauritius <chr>, Mexico <dbl>,
# Mongolia <chr>, Montenegro <chr>, Morocco <dbl>,
# Mozambique <chr>, Myanmar <chr>, Namibia <chr>, Nepal <chr>,
# `Netherlands (the)` <dbl>, `New Zealand` <chr>, `Niger
# (the)` <chr>, Nigeria <chr>, `North Macedonia` <chr>,
# Norway <dbl>, Oman <chr>, Pakistan <chr>, Panama <dbl>,
# Paraguay <chr>, Peru <chr>, Philippines <dbl>, Poland <dbl>,
# Portugal <dbl>, Qatar <dbl>, `Republic of Korea (the)` <dbl>,
# `Republic of Moldova (the)` <chr>, Romania <dbl>, `Russian
# Federation (the)` <dbl>, Rwanda <chr>, `Saudi Arabia` <chr>,
# Senegal <chr>, ...
For this project, we are only using the Score component of the dataset.
The start end and end year variables are going to be used throughout the data cleaning process, hence, we will set these as global variables. As mentioned in the Data Source section, we are only using dataset from year 2015 to 2020 due to the inconsistent naming convention in year 2013 and 2014.
start_year = 2015
end_year = 2020
We will load all the datasets from year 2015 to 2020 and dynamically assign them to a variable.
score_compiled = data.frame()
gii_df = data.frame()
We are going to 1. Extract the Score component of the dataset for each year, 2. Remove unwanted brackets from the score, 3. Pivot the dataset to a longer dataset
Refer to the code below for more info.
for (year in start_year:end_year){
var_name_2 <- paste("score", year, sep = "_")
year_df <- get(paste("year", year, sep = "_"))
score_col = which(colnames(year_df)=="Score")
value_col = which(colnames(year_df)=="Value")
score <- year_df %>%
select(c(1:2, (score_col+2):value_col-1)) %>%
slice(-c(1,6)) %>%
mutate(Indicator = X1) %>%
select(-c(1:2)) %>%
relocate(Indicator, .before = Albania_1)
score[] <- lapply(score, gsub, pattern = "\\[", replacement = "")
score[] <- lapply(score, gsub, pattern = "\\]", replacement = "")
score[] <- lapply(score, gsub, pattern = "n/a", replacement = 0)
score <- score %>%
type.convert %>%
pivot_longer(cols = !Indicator, names_to = "Country", values_to = "Data") %>%
mutate(Year = year, Category = "Score")
score$Country <- substr(score$Country, 1, nchar(score$Country)-2)
assign(var_name_2, score)
score_compiled <- rbind(score_compiled, get(var_name_2))
}
Below is how the compiled score dataset looks like after data cleaning.
head(score_compiled)
# A tibble: 6 x 5
Indicator Country Data Year Category
<fct> <chr> <dbl> <int> <chr>
1 Global Innovation Index Afghanistan 10.6 2015 Score
2 Global Innovation Index Albania 30.7 2015 Score
3 Global Innovation Index Algeria 24.4 2015 Score
4 Global Innovation Index American Samoa 10.6 2015 Score
5 Global Innovation Index Andorra 10.6 2015 Score
6 Global Innovation Index Angola 26.2 2015 Score
Many of the country names are not consistent through the years. Hence, we need to standardise them so that the same country with different names would not be picked up as 2 different countries. For example, United States of America is named as “United States of America (the)” in year 2019 and year 2020 while named as “United States of America” from year 2015 to year 2018.
score_compiled %>%
select(Country, Year) %>%
unique() %>%
filter(str_detect(Country, "^United States"))
# A tibble: 6 x 2
Country Year
<chr> <int>
1 United States of America 2015
2 United States of America 2016
3 United States of America 2017
4 United States of America 2018
5 United States of America (the) 2019
6 United States of America (the) 2020
unique_cty <- score_compiled %>%
select(Country) %>%
unique() %>%
count()
print(paste("There are", unique_cty,"unique countries in the dataset BEFORE standardsing the country names."))
[1] "There are 250 unique countries in the dataset BEFORE standardsing the country names."
What we did is to assign a ISO3C country code to each country based on the best matched country code using the “countrycode” package and then using the ISO3C country code to generate a standardised country name. Some of the countries cannot be found in the ISO standard naming convention and thus, we used the original name in the dataset.
At the same time, we also differentiate the overall indicators by changing them to uppercase.
gii_df <- score_compiled %>%
mutate(iso3c = countrycode(Country, origin = 'country.name', destination = 'iso3c')) %>%
mutate(cty_name = countrycode(iso3c, origin = 'iso3c', destination = 'country.name.en' )) %>%
mutate(Country = ifelse(is.na(cty_name), Country, cty_name)) %>%
select(c(1:5)) %>%
filter(Indicator != "Index") %>%
mutate(Indicator = str_to_upper(Indicator, locale = 'en'))
unique_cty <- gii_df %>%
select(Country) %>%
unique() %>%
count()
print(paste("There are", unique_cty,"unique countries in the dataset AFTER standardsing the country names."))
[1] "There are 231 unique countries in the dataset AFTER standardsing the country names."
The naming convention of some indicators also differs from year to year. To make the name of the indicators consistent, we will align the name of the indicator based on year 2020 according to the section number of each indicator.
indicator_compiled = data.frame()
for (year in start_year:end_year){
year_df <- get(paste("year", year, sep = "_"))
indicator <- year_df %>%
select(c(1:2)) %>%
mutate(Year = year) %>%
filter(X2 != "") %>%
rename(Indicator = X1) %>%
rename("Indicator Name" = X2)
indicator_compiled <- rbind(indicator_compiled, indicator)
}
indicator_unique <- indicator_compiled %>%
arrange(desc(Year)) %>%
distinct(Indicator, .keep_all = TRUE) %>%
select(c(1:2))
gii_df <- gii_df %>%
left_join(indicator_unique, by = "Indicator") %>%
mutate(`Indicator Name` = replace_na(`Indicator Name`, "")) %>%
mutate(Indicator = str_trim(paste(Indicator, `Indicator Name`, sep = " "))) %>%
select(c(1:5))
For the purpose of visualisation such as bubble plots, we pivot the dataframe to a longer stucture with each indicator as a column.
gii_df_wide_score <- gii_df %>%
filter(Category == "Score") %>%
pivot_wider(names_from = Indicator, values_from = Data)
Preview of the wider dataframe.
head(gii_df_wide_score)
# A tibble: 6 x 117
Country Year Category `GLOBAL INNOVAT~ `INNOVATION EFF~
<chr> <int> <chr> <dbl> <dbl>
1 Afghan~ 2015 Score 10.6 3.8
2 Albania 2015 Score 30.7 0.5
3 Algeria 2015 Score 24.4 0.5
4 Americ~ 2015 Score 10.6 3.8
5 Andorra 2015 Score 10.6 3.8
6 Angola 2015 Score 26.2 1
# ... with 112 more variables: `INNOVATION INPUT SUB-INDEX` <dbl>,
# `INNOVATION OUTPUT SUB-INDEX` <dbl>, `1. Institutions` <dbl>,
# `1.1. Political environment` <dbl>, `1.1.1. Political and
# operational stability` <dbl>, `1.1.2. Government
# effectiveness` <dbl>, `1.2. Regulatory environment` <dbl>,
# `1.2.1. Regulatory quality` <dbl>, `1.2.2. Rule of law` <dbl>,
# `1.2.3. Cost of redundancy dismissal, salary weeks` <dbl>, `1.3.
# Business environment` <dbl>, `1.3.1. Ease of starting a
# business` <dbl>, `1.3.2. Ease of resolving insolvency` <dbl>,
# `1.3.3. Ease of paying taxes` <dbl>, `2. Human capital and
# research` <dbl>, `2.1. Education` <dbl>, `2.1.1. Expenditure on
# education, % GDP` <dbl>, `2.1.2. Government funding per secondary
# student` <dbl>, `2.1.3. School life expectancy` <dbl>, `2.1.4.
# Assessment in reading, mathematics, and science` <dbl>, `2.1.5.
# Pupil-teacher ratio, secondary` <dbl>, `2.2. Tertiary
# education` <dbl>, `2.2.1. Tertiary enrolment` <dbl>, `2.2.2.
# Graduates in science and engineering` <dbl>, `2.2.3. Tertiary
# level inbound mobility` <dbl>, `2.3. Research and development
# (R&D)` <dbl>, `2.3.1. Researchers` <dbl>, `2.3.2. Gross
# expenditure on R&D (GERD)` <dbl>, `2.3.3. Global R&D companies,
# top 3, mn US$` <dbl>, `3. Infrastructure` <dbl>, `3.1.
# Information and communication technologies (ICTs)` <dbl>, `3.1.1.
# ICT access` <dbl>, `3.1.2. ICT use` <dbl>, `3.1.3. Government's
# online service` <dbl>, `3.1.4. Online e-participation` <dbl>,
# `3.2. General infrastructure` <dbl>, `3.2.1. Electricity
# output` <dbl>, `3.2.2. Logistics performance` <dbl>, `3.2.3.
# Gross capital formation` <dbl>, `3.3. Ecological
# sustainability` <dbl>, `3.3.1. GDP per unit of energy use` <dbl>,
# `3.3.2. Environmental performance` <dbl>, `3.3.3. ISO 14001
# environmental certificates` <dbl>, `4. Market
# sophistication` <dbl>, `4.1. Credit` <dbl>, `4.1.1. Ease of
# getting credit` <dbl>, `4.1.2. Domestic credit to private
# sector` <dbl>, `4.1.3. Microfinance institutions' gross loan
# portfolio` <dbl>, `4.2. Investment` <dbl>, `4.2.1. Ease of
# protecting minority investors` <dbl>, `4.2.2. Market
# capitalization` <dbl>, `4.2.3. Venture capital deals` <dbl>,
# `4.2.4. Venture capital deals` <dbl>, `4.3. Trade, competition,
# and market scale` <dbl>, `4.3.1. Applied tariff rate, weighted
# mean` <dbl>, `4.3.2. Intensity of local competition` <dbl>, `5.
# Business sophistication` <dbl>, `5.1. Knowledge workers` <dbl>,
# `5.1.1. Employment in knowledge-intensive services` <dbl>,
# `5.1.2. Firms offering formal training, %` <dbl>, `5.1.3. GERD
# performed by business enterprise` <dbl>, `5.1.4. GERD financed by
# business enterprise` <dbl>, `5.1.5. Females employed with
# advanced degrees` <dbl>, `5.2. Innovation linkages` <dbl>,
# `5.2.1. University/industry research collaboration` <dbl>,
# `5.2.2. State of cluster development` <dbl>, `5.2.3. GERD
# financed by abroad` <dbl>, `5.2.4. Joint venture/strategic
# alliance deals` <dbl>, `5.2.5. Patent families filed in at least
# two offices` <dbl>, `5.3. Knowledge absorption` <dbl>, `5.3.1.
# Intellectual property payments` <dbl>, `5.3.2. High-tech
# imports` <dbl>, `5.3.3. ICT services imports` <dbl>, `5.3.4.
# Foreign direct investment, net inflows` <dbl>, `6. Knowledge and
# technology outputs` <dbl>, `6.1. Knowledge creation` <dbl>,
# `6.1.1. Patent applications by origin` <dbl>, `6.1.2. PCT
# international applications by origin` <dbl>, `6.1.3. Utility
# model applications by origin` <dbl>, `6.1.4. Scientific and
# technical publications` <dbl>, `6.1.5. Citable documents H
# index` <dbl>, `6.2. Knowledge impact` <dbl>, `6.2.1. Growth rate
# of GDP per person engaged` <dbl>, `6.2.2. New business
# density` <dbl>, `6.2.3. Total computer software spending` <dbl>,
# `6.2.4. ISO 9001 quality certificates` <dbl>, `6.2.5. High-tech
# and medium high-tech manufacturing, %` <dbl>, `6.3. Knowledge
# diffusion` <dbl>, `6.3.1. Intellectual property receipts` <dbl>,
# `6.3.2. High-tech exports` <dbl>, `6.3.3. ICT services
# exports` <dbl>, `6.3.4. Foreign direct investment, net
# outflows` <dbl>, `7. Creative outputs` <dbl>, `7.1. Intangible
# assets` <dbl>, `7.1.1. Trademark application class count by
# origin` <dbl>, `7.1.2. Global brand value, top 5000, %
# GDP` <dbl>, `7.1.3. Industrial designs by origin/bn PPP$
# GDP` <dbl>, `7.1.4. ICTs and organizational model
# creation` <dbl>, `7.2. Creative goods and services` <dbl>,
# `7.2.1. Cultural and creative services exports` <dbl>, ...
Since there are over 100 indicators, we are going to focus on the main 7 indicators in year 2020 for our statistical analysis.
The main 7 indicators are:
We filter only the data in year 2020 and select the relevant columns. All the NA are replaced with 0.
df <- gii_df_wide_score %>%
filter(Year == 2020) %>%
select(c("Country",
"GLOBAL INNOVATION INDEX",
"1. Institutions",
"2. Human capital and research",
"3. Infrastructure",
"4. Market sophistication",
"5. Business sophistication",
"6. Knowledge and technology outputs",
"7. Creative outputs")) %>%
replace(is.na(.), 0)
df <- column_to_rownames(df, var = "Country")
pairs.panels(df,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE # show correlation ellipses
)

From the correlation plot, all indicators are strongly correlated with the Global Innovation Index, with the lowest contributor being “Market sophistication” at 0.76. “Market sophistication” is also not so highly correlated with all of the other indicators.
cor.test.p <- function(x){
FUN <- function(x, y) cor.test(x, y)[["p.value"]]
z <- outer(
colnames(x),
colnames(x),
Vectorize(function(i,j) FUN(x[,i], x[,j]))
)
dimnames(z) <- list(colnames(x), colnames(x))
z
}
p <- cor.test.p(df)
heatmaply_cor(cor(df),
node_type = "scatter",
point_size_mat = -log10(p),
point_size_name = "-log10(p-value)",
label_names = c("x", "y", "Correlation"),
k_col = 2,
k_row = 2)
The correlation can be seen more clearly with the heatmap above. For our project application, we will be using the above correlation plot since it is more interactive and easier to identify which is the higher correlated variable.
gii_df_wide_score_continent <- gii_df_wide_score %>%
mutate(region = countrycode(Country, origin = 'country.name.en', destination = 'un.region.name'))
gg2 <- ggplot(gii_df_wide_score_continent %>%
filter(Year == 2020),
aes(x = region, y = `GLOBAL INNOVATION INDEX`)) +
geom_boxplot(aes(fill=region)) + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
ggplotly(gg2)
gg3 <- ggplot(gii_df_wide_score_continent %>%
filter(Year == 2020),
aes(x=region,y=`GLOBAL INNOVATION INDEX`))+
geom_violin(aes(fill=region),alpha=0.7)+ theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
ggplotly(gg3)
ggbetweenstats(
data = gii_df_wide_score_continent %>%
filter(Year == 2020),
x = region,
y = `GLOBAL INNOVATION INDEX`,
title = 'Scatter + Box + Violin',
type = 'parametric',
conf.level = 0.95,
pairwise.comparisons = TRUE,
pairwise.display = 'significant',
p.adjust.method = 'holm',
ggtheme = ggplot2::theme_classic()
)

Since p-value is <0.05, we conclude that there is a significant difference between the GII score between the 5 different continents/regions.
heatmap_indicator <- gii_df_wide_score %>%
filter(Year == 2020) %>%
select(c("Country",
"1. Institutions",
"2. Human capital and research",
"3. Infrastructure",
"4. Market sophistication",
"5. Business sophistication",
"6. Knowledge and technology outputs",
"7. Creative outputs")) %>%
replace(is.na(.), 0) %>%
column_to_rownames(var = "Country") %>%
percentize()
a <- fviz_nbclust(heatmap_indicator, hcut, method = "silhouette")
b <- fviz_nbclust(heatmap_indicator, hcut, method = "wss")+
geom_vline(xintercept = 3, linetype = 2)
c <- fviz_nbclust(heatmap_indicator, hcut, method = "gap_stat")
grid.arrange(a,b,c)

Based on the 3 methods, the best bet is to use 2 or 3 as the optimal number of clusters. For this project, we chose the optimal number of clusters to be 3.
heatmaply(heatmap_indicator, k_row = 3, k_col = 2,
dist_method = "euclidean", hclust_method ="complete")
Interestingly, you can see that “Market sophistication” is clustered on its own from the rest of the indicators which is what we described during the correlation analysis.
Lastly, we visualise the different clusters via a cluster map. This allows us to interactively identify which countries are in which cluster.
heatmap_indicator_clus <- heatmap_indicator %>%
rownames_to_column(var = "Country") %>%
mutate(cluster = cutree(hclust(dist(heatmap_indicator, method = "euclidean"), method = "complete"), k = 3)) %>%
column_to_rownames(var = "Country")
clus_plot <- fviz_cluster(list(data = heatmap_indicator_clus, cluster = heatmap_indicator_clus$cluster),
geom = "text",
labelsize = 10,
show.clust.cent = FALSE,
main = "Clustering of Countries by Indicator Score",
legend = "none",
ggtheme = theme_minimal(),
xlab = FALSE,
ylab = FALSE)
ggplotly(clus_plot, tooltip = row.names(clus_plot))
The current correlation plot only consists of the 7 main indicators and only for year 2020. However, we have over 100 over sub-indicators for us to perform correlation analysis on and 6 years of data. Therefore, for our shiny app, we are going to allow users to interactively select:

The current statistical plot only focus on the GII score and does not involve any other indicators. Therefore, for our shiny app, we are going to allow users to interactively select:

The current hierarchical clustering only consists of the 7 main indicators for year 2020, and also fixed at 3 clusters. Therefore, for our shiny app, we are going to allow users to interactively select:

Currently, all the graphs are limited by the current selection of variables and therefore, are not as useful for visual analytics. With the help of shiny and the interactiveness of plotly package, the proposed visualisation storyboard will be able to allow user to generate more insights from the GII dataset.