Dimensional Reduction and KDE of the Fragile State Index
Abstract
The Fragile State Index uses 12 indicator variables to quantify countries’ risk of state failure or collapse. However any analysis from this dataset using those variables, say a linear regression, would likely suffer from issues where too many of the variables are correlated with one another (multicollinearity) and that the high number of variables might cause a model to overfit to noise (the curse of dimensionality).
This project aims to reduce the number of indicator variables through dimensional reduction, using principal component analysis (PCA).
Through PCA, we derived a new, numeric variable which I have interpreted as “Humanitarian Public Goods.“ Then, kernel density estimation (KDE) was used to estimate the probability density function of this new variable.
Introduction
The Fund for Peace publishes an annual “Fragile State Index“ that aims to assess how vulnerable or at risk states are to collapsing. These are situations where a state’s government might lose sovereignty over its territory, or even lose all credibility as the governing authority domestically and internationally.
The index takes several qualitative and quantitative indicators in a country that are expected to affect a country’s stability such as: the economy, inter-group violence, and foreign interventions. There are 12 of these “indicator variables.” These indicator variables are expressed in an ordinal scale from 1-10, with higher values representing more unstable situations. And from these indicators, an aggregate score is summed up to generate the overall instability of the country or the country’s fragility index. (The Fund for Peace, 2017)
Choropleth Map of States’ Fragility Indexes in 2020; Click here for an interactive version which also has data from 2010-2020
However, there are primarily two problems with this dataset that must be addressed before any kind of meaningful analysis can be conducted using the indicators.
First is the curse of dimensionality. There are 12 indicator variables in total that make up the index. In spite of this, a model generated from the data, say a multiple regression model, will suffer from both sparsity and overfitting concerns. Sparsity warns us that there might not be enough data to sufficiently make up for the complexity (i.e. too many input variables) of the dataset if one were to include all the indicators into a model. And even then overfitting might still occur; as one adds more variables from this data set into a model it would naturally fit better, but we cannot be confident as to whether we are fitting to noise or to actual population parameters we wish to examine.
Second, there is the issue of multicollinearity. Many of the indicator variables in the dataset are highly correlated and predicative of one another. This creates problems since many common data prediction models and methods of analysis assume that there is no multicollinearity in the input data. Additionally, the existence of multicollinearity suggests to us that a set of highly correlated input variables could just be described with less variables that still captures the same information.
Essentially, this boils down to a concern over parsimony. We would prefer a less complex dataset with orthogonal variables while still retaining the information the original dataset gives us. We need to separate the signal from the noise.
Dimensional reduction serves as a strong solution to these issues. By transforming the original dataset using principle component analysis or PCA, we could derive a lower dimensional, orthogonal (reduced number of variables which are not collinear) dataset that should still retain most of the information the original dataset contains. Then one could use this transformed data to perform some kind of analysis.
Since in the end we derived just one new variable through PCA, we can then use kernel density estimation (KDE) as a way to estimate the probability density function of this new 1-dimensional dataset. This process will help us describe the new variable, and draw some insights into how states distribute themselves according to this metric.
Ultimately, this project serves as a way to explore a common use case of PCA to handle complex and high dimensional data, a common issue in social science fields such as international relations. Then we can use KDE to pull some cursory analyses from this transformed data that we would not have observed if we had just worked with the raw data itself.
Data Exploration and Cleaning
There are 17 variables contained in the dataset including the 12 aforementioned indicator variables. The other variables are country, year, rank (amongst other countries in the fragile state index), the country’s fragile state index total, and percent change from the previous year.
Before cleaning up the data, it is always worth perusing the codebook to get a grasp of what the variables actually represent as the indicator variables might not be immediately easy to interpret. This will be especially important later on for interpreting the meaning of any new variables derived from PCA, where some subjectivity will need to be exercised.
The 12 indicators variables of our interest are as follows:
C1- Security Apparatus; how able is the state to respond to security issues like crime and terrorism?
C2- Factionalized Elites; are the leaders of the state unified or split along ethnic, partisan, religious, etc. lines?
C3- Group Grievances; is there inter-group violence or discrimination?
E1- Economy; what is the general condition of the state’s economy?
E2- Economic Inequality; are wealth and economic resources distributed evenly in the state?
E3- Human Flight and Brain Drain; are the state’s educated professionals and leaders fleeing the country?
P1- State Legitimacy; does the state’s government have the confidence of its citizens?
P2- Public Services; how equitably and readily accessible are public goods such as education, healthcare, and infrastructure?
P3- Human Rights; does the state respect and protect the human rights and civil freedoms of its citizens?
S1- Demographic Pressures; how sustainable is the population growth and are their concerns with public health?
S2- Refugees and Internally Displaced Persons (IDPS); is the state able to sustain any incoming refugees or IDPs if any?
X1- External Intervention; are other countries interfering in the country overtly or covertly?
(The Fund for Peace, 2017)
The data collected by the Fund for Peace ranged from 2010-2020 for all available countries, and all years were included in the original dataset. The ultimate unit of analysis for our purposes is country, not country-year.
So to avoid issues of pseudo replication down the line, we took the median value for each indicator variable per state, across all its available years. For example, Afghanistan had 10 observations in the dataset, so we took the median value of the C1 indicator variable across all those years to stand in for the country’s C1 value. In the end, this leaves us with a sample size of 181 countries, with 12 indicator variables or columns. Now the data is prepared for PCA.
Checking for Multicollinearity
The indicator variables in the dataset, even from a first glance, most likely suffer from multicollinearity. One could easily come up with a simple causal mechanism as to why the variables are correlated with and predict one another. For example, take the variables C1 and E1, security apparatus and the economy respectively. A state with a poor economy probably cannot fund a robust military and police force. And as a result, a poorly funded police force and military cannot easily respond to crime and terrorism attacks which might harm the infrastructure and economic confidence in the country, hurting the economy. And then the even poorer state has even less funds to draw from its economy to fund its already lacking police force and military, leading to even worse crime and terrorism responses, and so on and so forth in an endless negative feedback loop.
But of course, it would be most prudent to check our suspicion quantitatively. This was done by generating a Pearson correlation matrix between all 12 indicator variables. The following heatmap illustrates the results:
Heatmap of the Pearson Correlation Matrix Between all Indicator Variables
Here, one can easily observe that multicollinearity will be an issue. Many of the variables have a strong, positive linear relationship with one another. Ideally, we would want variables that are far more orthogonal to one another. And that is not to mention we also would want to reduce the dimensionality of the dataset to avoid the curse of dimensionality on top of it. Thus, this dataset is a prime candidate for dimensional reduction.
PCA Part 1: Eigenvalues of Principal Components and Screeplot Assessment
After normalizing all the indicator variables by taking their Z-scores, a PCA was run to first generate the eigenvalue vs. principal component (PC) matrix. The eigenvalues and each principle component are reported as follows in order of decreasing value:
Eigenvalues of Each Principal Component
Moreover, the percent of the covariance explained by each has been calculated and is reported as follows in the same order:
Percent of Covariance Explained by Each Principal Component
With this matrix, we can start to deduce which principal components explain the most amount of variation in the data and are the most informative. However feature selection of which principal components are meaningful varies depending on the specific criteria used to assess them. Three criteria methods will be shown here for the sake of demonstration, the elbow criteria, the Kaiser criteria, and Horn’s criteria. Ultimately the Horn’s criteria will be the determining criteria for reasons explained below. In any case, a screeplot to chart the eigenvalues vs. their principal components can be incredibly informative.
Screeplot Plotting Eigenvalues vs. Principal Components
Here we can readily see two interesting observations, that the first PC seems to capture a majority of the variation in the data, and that the eigenvalues drop off in magnitude dramatically after the first PC.
If we first consider the elbow criteria, it would tell us that only 1 PC is meaningful here, the very first PC. This is because the elbow criteria considers any PCs after a “sharp drop-off“ (say the drop off between PC 1 and 2) to not be meaningful, and any PCs before the drop off to be meaningful.
Moving on to the Kaiser criteria, only PCs with an eigenvalue above 1.0 are to be considered meaningful, while the rest are written off as noise. This can be shown on the screeplot with a Kaiser line, the horizontal purple line at eigenvalue = 1:
Screeplot with Kaiser Criteria Line
This Kaiser criteria shows that only one PC, the same as with the elbow criteria, is meaningful here and exceeds the Kaiser line (principal component 2 is just short of the criteria with an eigenvalue of ~0.99).
Yet, both the Kaiser and elbow criteria tend to be regarded as overly liberal criteria, and in other cases might give too much leniency to principal components. Instead the Horn’s criteria may be the most robust method.
The Horn’s criteria works by simulating a noise distribution of the expected eigenvalues for each PC, and so any PC that exceeds its expected value given by the noise distribution is considered meaningful. So, a noise distribution for this screeplot was generated with 10,000 resamples and then imposed upon the screeplot to see which PCs exceed the noise:
Screeplot with Noise Distribution for Horn’s Criteria; The green splotch is made up of 10,000 lines representing the expected screeplot for each PC to simulate a noise distribution
From this one can see that, like before with the other two criteria, only the first PC is meaningful as PC 1 exceeds the noise distribution illustrated by the green splotch (SATA) running across the screeplot.
Thus by the Horn’s criteria, we will consider only the first PC to be meaningful enough to sufficiently capture the information conveyed in the original dataset. Now, since this new principal component will be expressed in an entirely new numeric scale or coordinate space different from the original variables, we will need to interpret the meaning of this principle component and its relation to the original dataset now.
PCA Part 2: Loading Matrix Analysis
PCA generates a second important matrix, the loading matrix. By comparing the eigenvectors for each factor, in each principle component, we can observe to which variables in the original dataset the principle component in question “points towards.“ As in what variables in the original dataset have the highest affinity/correlation with a principle component and thus what variables best describe and contribute to the explanatory power of a PC of interest. Since we found only one meaningful PC, we will only examine its loading matrix and plot it on a bar plot for ease of assessment.
Loading Matrix Barplot
This loading matrix shows us that the eigenvectors on factors 8 and 9 have the highest, positive values. Factors 8 and 9 correspond to the variables “P2- Public Services“ and “P3- Human Rights“ respectively. This tells us that these two variables are fairly representative of the information in this principle component, so we would want to interpret and label this principal component with regards to them.
In spite of this, I would interpret this principle component to mean “Humanitarian Public Goods.“ To elaborate, the commonality between the variables public services and human rights is the fact that they both reflect the availability of some kind of public good in a state.
A public good is some kind of good or service that is non-rival and non-exclusive. The supply of the good is irreducible by consumption and it is impossible to prevent a particular person where the good is provided from enjoying its benefits. An example is national security/military defense. Everyone in a country that has some form of military enjoys the service equally; one person enjoying the protection of the military does not prevent her neighbor from enjoying the same benefit since the “supply“ of military protection has not diminished. Moreover, it is impossible to exclude some citizen from enjoying the service so long as they are in the country. A country will be hard-pressed to say person x is not allowed to be protected by the country’s military while everyone else is, that situation is rather absurd.
For contrast, a private good is the complete opposite, a rival, exclusive good/service. An example is any commodity you could buy in a store. If I buy a soda at the store, that is one soda another person cannot enjoy, and I have thus reduced the current supply of sodas in the store. Moreover, it is exclusionary: if one could not pay for the soda, one would be excluded from enjoying the benefits of consuming it.
Linking this back to the original variables, public services are clearly public goods. Everyone benefits from higher availability and quality of healthcare, education, and public infrastructure. And one’s enjoyment of quality education for example does not prevent her fellow citizens from attaining the same benefits. Plus, it would be quite difficult to make any of them exclusionary, a country would be hard-pressed to say x, y, or z citizens cannot use bridge A under any circumstances (if they could it would no longer really be a public service).
Furthermore, human rights are also a form of public goods in a state. Everyone in the state benefits from a more equitable protection in the eyes of the law, and of respect to their human and civil rights. Meanwhile one cannot reduce the “supply“ of it. Moreover, to make these human rights exclusionary would be rather oxymoronic, the point of them is to include everyone under equal protection by the law.
With this argument as to why the two variables are linked as public goods, I want to elaborate why I described them as “humanitarian.” One should note that things such as military protection also constitutes a form of a public good, and this kind of good is distinctly not humanitarian in nature. We would not want to over attribute factor 1 (C1- Security Apparatus), for example, in the loading matrix as it clearly does not have a high affinity for this principal component. As a result, it would be important to make the distinction with the term humanitarian, to reflect that public goods such as infrastructure or human rights are focused on humanitarian concerns, not security concerns.
With all of this in mind, this PCA seems to suggest that states, based on this dataset, can be best described on an axis of how available/well-developed their humanitarian public goods are. This axis seems to capture the most amount of information in the dataset. Moreover, now we just have a singular, new variable to describe the dataset, having successfully reduced the dimensionality of it. We have gone from 12 variables and transformed them to just one.
Now we can take a look at the original data expressed in terms of this new variable and its new scale.
PCA Part 3: Examining the Transformed Data
From PCA we generate an entirely new set of variables with their own numeric scale different from that of the original dataset. These new values would be meaningless without any subjective interpretation. But we have already done that in the previous step by saying the new variable we have distilled from PCA is quantifying the humanitarian public goods (here on out it will be referred to as HPG) in a state. I took the HPG values and aligned them with their respective states. A snippet of this new dataset, in descending order, is reported as follows:
The Transformed Data Aligned by Country; Note how the new scale of HPG is different from the 1-10 ordinal scale of the original dataset
This small snippet shows us that the new scale for the variable ranges from ~ -5.63 to ~6.64, and that based on the context of the original dataset, states with higher, positive scores in this variable have a higher availability of humanitarian public goods and vice versa for lower, more negative scores. This is the signal amidst the noise of the original dataset.
A next step with having this new variable would be to find a way to characterize or visualize it. Being that we essentially now have a 1-dimensional dataset, a histogram would be the immediately obvious visualization to use in this case, along with its median marked off.
Histogram of HPG; Median = -0.54, Mean = ~0.00, Standard Deviation = ~3.05
This histogram shows us that the empirical distribution of HPG is not necessarily normally distributed and appears to be lightly right skewed (the mean is reported as ~0, while the median is -0.54). However, we would like to be far more specific in characterizing this dataset since a histogram can only tell us so much. Moreover, it would be nice to see if there is some sort of interval divisions we can make along the dataset to “group“ together certain states, akin to a classifier.
Being that a histogram is a few steps away from essentially being a probability density distribution, we might as well treat the HPG variable as a random variable and estimate its probability density function using KDE.
Kernel Density Estimation
KDE estimates the probability density function of a random variable, and we can use the calculations from this to visualize a probability density distribution. But first we need to determine two parameters before we can run our random variable, HPG, through KDE: the bandwidth size and the type of kernel.
The bandwidth parameter determines the size of the kernel or window that KDE iterates over the dataset. This determines how many surrounding data points around any point x to weigh into the estimate of the function at that point. This parameter is quite important since generally speaking, larger bandwidths result in a smoother distribution, while a smaller bandwidth results in a more granular and rougher distribution. In other words, a larger bandwidth generalizes the probability distribution more since more points are considered at any window, and vice versa for smaller bandwidths.
This is essentially a trade off between bias (underfitting) and variance (overfitting); and like with any other model, we would like to strike a balance between the two. Silverman’s rule provides a rule-of-thumb formula to determine an optimal bandwidth, based on the sample size and the inter-quartile range of the sample. We will use this formula to determine our optimal bandwidth for the sake of simplicity, calculated to be ~0.97.
Meanwhile, the “kernel“ is essentially the “shape” of the window iterated over the random variable’s values. This can range from a gaussian/normal kernel, to a uniform/”top-hat” kernel, to even a triangular or pyramid kernel shape. For this we have gone with a simple gaussian kernel again for the sake of simplicity and as it is the the distribution most compatible with Silverman’s rule.
All said and done, with our KDE parameters picked out, we can now estimate and visualize the probability density distribution of the HPG variable.
Probability Density Distribution of HPG
Unsurprisingly, the shape of this distribution approximates the shape of the histogram we found before. Moreover, it helps give some evidence that the distribution of HPG amongst countries is not normally distributed. Notice the “fat-tail“ on the right side of the distribution as compared to the thinner tail of the left side. Yet we still do see that extreme, high-magnitude values of HPG are rather unlikely compared to more moderate values.
But we can take this one step further by dividing the distribution into intervals and try to categorize countries along the axis of HPG based on those intervals. A normal classifier algorithm like K nearest neighbors would not work well here since our data is 1-dimensional. Mathematically, we might do this sort of “classification“ instead by simply dividing the distribution function by its local minima, and then treat local maxima as a sort of “centroid“ for each interval. With this particular probability density function, we find one local minima at ~2.38 HPG and thus two intervals with local maximas of ~ -1.12 and ~3.13. Plotting these points and shading the two intervals yields this plot:
Probability Density Distribution of HPG with Markers; Blue points represent local maxima, the green point represents the local minima, intervals are shaded blue and red
This plot illustrates that the probability density distribution can be divided into exactly two intervals based on the local minima which the green point represents. The intervals are colored red and blue, while the blue points represent the local maxima for each interval.
We might consider these two intervals to represent a slightly more definitive method to separate states along the axis of HPG, rather than say splitting them above and below the median or mean value of HPG. The local minima and maxima pay more respect to the underlying probability density function after all.
Furthermore, we could say the states in the blue interval represent states with a relatively high availability of HPG and are more stable, and that states in the red interval have a relatively low availability of HPG and thus poor stability. One could also make the argument that these two intervals are distributed differently too.
Going off of this interpretation, we observe that many states simply are not relatively stable nor have a high availability of HPG, and even then the local maxima in the blue interval shows a high concentration of stable states troublingly close to the threshold between stable and unstable states. Meanwhile, on the low stability interval, most states are relatively unstable but many do not appear to have particularly dire situations with an extremely low HPG value as the local maxima is centered at a moderate HPG value of ~ -1.12. In fact, the probability density tapers off rapidly after this point to the right and stabilizes a bit around the local minima. This could be interpreted to mean that many states are in a transitory state of developing their availability of HPG.
Concluding Remarks
By the end of this project, we have taken a 12-variable dataset quantifying state instability and transformed it into a singular new variable that we have dubbed “Humanitarian Public Goods“ using PCA. And by using KDE, we can clearly see that this new variable is not distributed normally, and can separate states into two separate intervals of relative low and high stability. This axis of HPG embedded in the original dataset appears to be the most informative axis when it comes to describing the stability of states. All in all, we might now feel more comfortable using this transformed data in a model of sorts.
However, we might still want to be conservative with using this dataset, as after all the data we have derived from PCA is only as worthwhile as the integrity of the data we put in. Several indicators in the first place are quite qualitative and subjective measures, such as state legitimacy. The CAST framework used by the Fund for Peace to quantify these kinds of indicators leaves a lot of room for interpretation and could have possibly generated error in the original sampling.
Moreover, the exact number of indicator variables used in the index could be argued to be too many or too few to evaluate the stability of a state; for example do foreign interventions really affect the stability of a state or do domestic matters far outweigh them? Or perhaps certain government types or democratization matter for stability and should be added as a variable. Depending on how one answers this question, one could preempt the PCA with some manual feature selection to cut irrelevant variables. Thus this issue becomes more of a political science question rather than a data science question.
But working with what we have, we should have a more clear picture of some of the most important factors that likely, and more precisely, characterize state stability.
Works Cited:
Fragile States Index and CAST Framework Methedology. The Fund for Peace. Washington D.C. 2017.
Data Source:
The Fund for Peace. Fragile States Index. Washington D.C. https://fragilestatesindex.org/excel/.