One of the challenges in real data science is getting data from different sources in many different formats. In this notebook, we will explore some facts about the world using data taken from a TSV file, an Excel spreadsheet, and a database query.
First, we'll read in a TSV file containing the most recent CIA World Factbook data using the meta-csv library.
Expanding the results in the data viewer tells us that there are some nil
values in columns of interest, and that our TSV importer was thrown off by this fact and so didn't convert the numerical columns to number types.
We're going to post process this table a bit with ordinary Clojure sequence functions to filter out rows that have nil
s for our columns of interest, select those rows, convert strings to numbers, and — because we're Clojurists — convert keys to keywords.
Things look pretty good in the data structure browser, but it would be easier to get an overview in tabular form. Luckily, Clerk's built in table viewer is able to infer how to handle all of the most common configurations of rows and columns automatically.
:country | :gdp | :life-expectancy |
---|---|---|
China | 9800 | 75.15 |
India | 4000 | 67.8 |
European Union | 34500 | 80.02 |
United States | 52800 | 79.56 |
Indonesia | 5200 | 72.17 |
Brazil | 12100 | 73.28 |
Pakistan | 3100 | 67.05 |
Nigeria | 2800 | 52.62 |
Bangladesh | 2100 | 70.65 |
Russia | 18100 | 70.16 |
Japan | 37100 | 84.46 |
Mexico | 15600 | 75.43 |
Philippines | 4700 | 72.48 |
Ethiopia | 1300 | 60.75 |
Vietnam | 4000 | 72.91 |
Egypt | 6600 | 73.45 |
Turkey | 15300 | 73.29 |
Germany | 39500 | 80.44 |
Iran | 12800 | 70.89 |
Thailand | 9900 | 74.18 |
199 more elided |
We can also graph the data to see if there are any visible correlation between our two variables of interest, GDP per capita and life expectancy.
Unsurprisingly, it seems that living in an extremely poor country has negative consequences for life expectancy. On the other hand, it looks like things start to flatten out once GDP/capita goes above $10-15k/year. Some other interesting patterns also emerge: Singapore and Japan have similar life expectancies, despite the former's GDP being twice the latter's, and Qatar — the richest nation in the dataset by GDP/capita — has similar average life expectancy as the Dominican Republic.
Now, let's try the same experiment using information from a spreadsheet containing the GINI coefficient — a widely used measure of income inequality — for each country. We're going to use a library called Docjure that provides access to Microsoft Office file formats.
Docjure's API is a bit low-level and doesn't make the obvious tasks easy, so we're going to use this helper function to make the code below clearer. Check out the line-by-line comments to see how this function works.
Now we're going to use a few lines of code to:
clojure.set
's join
function to combine our freshly loaded GINI spreadsheet with our previously prepared life expectancy data, which works because they are both sequences of maps that have a :country
key.:gini
key in each map to the World Bank's number, but falling back to the CIA's estimate. (These kinds of small programmatic tasks are a constant feature of data wrangling.)Expanding the Clojure data structures makes it look like this will work for our comparisons. Let's plot the data to see a list of countries from most to least equal:
And now to have a look at whether inequality and life expectancy are correlated:
It seems like the mass of long lived countries are also in the lower two thirds of the inequality distribution. A little filtering shows is that the only really long-lived countries above a GINI coefficient of ~50 is Hong Kong.
:yearCIA | :pop2021 | :life-expectancy | :giniCIA | :gini | :gdp | :giniWB | :yearWB | :country |
---|---|---|---|---|---|---|---|---|
2014 | 60041.994 | 49.56 | 63 | 63 | 11500 | 63 | 2014 | South Africa |
2010 | 2015.494 | 49.87 | 50.7 | 50.7 | 1200 | 50.7 | 2010 | Guinea-Bissau |
2003 | 4919.981 | 51.35 | 43.6 | 56.2 | 700 | 56.2 | 2008 | Central African Republic |
2015 | 18920.651 | 51.83 | 57.1 | 57.1 | 1800 | 57.1 | 2015 | Zambia |
2015 | 2587.344 | 51.85 | 59.1 | 59.1 | 8200 | 59.1 | 2015 | Namibia |
2014 | 32163.047 | 52.6 | 54 | 54 | 1200 | 54 | 2014 | Mozambique |
2015 | 2397.241 | 54.06 | 53.3 | 53.3 | 16400 | 53.3 | 2015 | Botswana |
2018 | 33933.61 | 55.29 | 51.3 | 51.3 | 6300 | 51.3 | 2018 | Angola |
2017 | 15092.171 | 55.68 | 44.3 | 50.3 | 600 | 50.3 | 2019 | Zimbabwe |
2017 | 223.368 | 64.22 | 56.3 | 56.3 | 2200 | 56.3 | 2017 | Sao Tome and Principe |
nil | 404.914 | 68.49 | nil | 53.3 | 8800 | 53.3 | 1999 | Belize |
nil | 591.8 | 71.69 | nil | 57.9 | 12900 | 57.9 | 1999 | Suriname |
2018 | 213993.437 | 73.28 | 53.9 | 53.4 | 12100 | 53.4 | 2019 | Brazil |
2018 | 51265.844 | 75.25 | 50.4 | 51.3 | 11100 | 51.3 | 2019 | Colombia |
2016 | 184.4 | 77.41 | 51.2 | 51.2 | 13100 | 51.2 | 2016 | Saint Lucia |
2016 | 7552.81 | 82.78 | 53.9 | 53.9 | 52700 | nil | nil | Hong Kong |
Let's look at happiness! This time, we'll use jdbc.next to perform a SQL query on a Sqlite data containing a table of countries and their relative happiness ratings. Note that we're changing the column name :country_or_region
to :country
using clojure.set
's rename-keys
function so that this table will be easy to join with our others.
Looking at the happiness data, it appears that all the usual suspects — Nordics, Western Europeans, Canadians, and Kiwis — are living pretty good lives by their own estimation. Looking closer, we see that although the top twenty countries all relatively prosperous, it's clear that GDP is not strongly correlated with happiness within that cohort.
:generosity | :social_support | :freedom | :rank | :score | :perception_of_corruption | :gdp | :country | :healthy_life_expectancy |
---|---|---|---|---|---|---|---|---|
0.153 | 1.587 | 0.596 | 1 | 7.769 | 0.393 | 1.34 | Finland | 0.986 |
0.252 | 1.573 | 0.592 | 2 | 7.6 | 0.41 | 1.383 | Denmark | 0.996 |
0.271 | 1.582 | 0.603 | 3 | 7.554 | 0.341 | 1.488 | Norway | 1.028 |
0.354 | 1.624 | 0.591 | 4 | 7.494 | 0.118 | 1.38 | Iceland | 1.026 |
0.322 | 1.522 | 0.557 | 5 | 7.488 | 0.298 | 1.396 | Netherlands | 0.999 |
0.263 | 1.526 | 0.572 | 6 | 7.48 | 0.343 | 1.452 | Switzerland | 1.052 |
0.267 | 1.487 | 0.574 | 7 | 7.343 | 0.373 | 1.387 | Sweden | 1.009 |
0.33 | 1.557 | 0.585 | 8 | 7.307 | 0.38 | 1.303 | New Zealand | 1.026 |
0.285 | 1.505 | 0.584 | 9 | 7.278 | 0.308 | 1.365 | Canada | 1.039 |
0.244 | 1.475 | 0.532 | 10 | 7.246 | 0.226 | 1.376 | Austria | 1.016 |
0.332 | 1.548 | 0.557 | 11 | 7.228 | 0.29 | 1.372 | Australia | 1.036 |
0.144 | 1.441 | 0.558 | 12 | 7.167 | 0.093 | 1.034 | Costa Rica | 0.963 |
0.261 | 1.455 | 0.371 | 13 | 7.139 | 0.082 | 1.276 | Israel | 1.029 |
0.194 | 1.479 | 0.526 | 14 | 7.09 | 0.316 | 1.609 | Luxembourg | 1.012 |
0.348 | 1.538 | 0.45 | 15 | 7.054 | 0.278 | 1.333 | United Kingdom | 0.996 |
0.298 | 1.553 | 0.516 | 16 | 7.021 | 0.31 | 1.499 | Ireland | 0.999 |
0.261 | 1.454 | 0.495 | 17 | 6.985 | 0.265 | 1.373 | Germany | 0.987 |
0.16 | 1.504 | 0.473 | 18 | 6.923 | 0.21 | 1.356 | Belgium | 0.986 |
0.28 | 1.457 | 0.454 | 19 | 6.892 | 0.128 | 1.433 | United States | 0.874 |
0.046 | 1.487 | 0.457 | 20 | 6.852 | 0.036 | 1.269 | Czech Republic | 0.92 |
136 more elided |
Next, we're computing a linear regression for this dataset using kixi.stats.
We'll use this linear regression to augment out dataset so each datapoint also gets a :regression
value.
Let's graph the relationship between happiness and GDP to get a bird's eye view on the situation over our entire dataset. You can mouse over individual data points to get more info:
It looks, as we might have expected, like richer countries are happier than poor ones in general, though with variations and outliers. For example, Finland is in first place but has a similar GDP/capita as number 58, Japan. Perhaps even more striking, Qatar has the highest GDP/capita in the dataset, but Qataris are on average about as happy as people in El Salvador. Likewise, Botswana has five times the GDP/capita of Malawi, but its people are no happier for it. If I were forced to guess why, I might theorize that a properous country with all of its wealth concentrated in very few hands can still be a fairly wretched place to live for the average person to live.
One way to investigate this possibility is to plot the correlation between equality and happiness in the rich world. We'll use join
again, but we'll first use clojure.set
's project
(named by analogy to SQL projection) to pluck just the :country
and :score
from the happiness dataset, then sort by the GDP and take the top 20 countries.
This does, at least at first glance, support the notion that the happiest people — just like the longest lived ones — tend to inhabit countries in the more equal part of the GINI distribution.
I hope this example gives you some ideas about things you'd like to investigate.