One of the challenges in real data science is getting data from different sources in many different formats. In this notebook, we will explore some facts about the world using data taken from a TSV file, an Excel spreadsheet, and a database query.
Life expectancy
First, we'll read in a TSV file containing the most recent CIA World Factbook data using the meta-csv library.
(defcia-factbook
(csv/read-csv"./datasets/cia-factbook.tsv"))
({"
Airp/cap"
"
0.28693821"
"
Airports"
"
389000000"
"
Birthrate"
"
12.17"
"
Cell phones"
"
1100000000"
"
Cell/cap"
"
0.81139339"
"
Country"
"
China"
"
Education($%GDP)"
nil"
Exp/cap"
"
1630.1631"
"
Exports"
"
2.21e+12"
"
GDP/cap"
"
9800"
13 more elided}{"
Airp/cap"
"
0.04961238"
"
Airports"
"
61338000"
"
Birthrate"
"
19.89"
"
Cell phones"
"
893862000"
"
Cell/cap"
"
0.72298773"
"
Country"
"
India"
"
Education($%GDP)"
"
3.2"
"
Exp/cap"
"
253.32742"
"
Exports"
"
313200000000"
"
GDP/cap"
"
4000"
13 more elided}{"
Airp/cap"
nil"
Airports"
nil"
Birthrate"
nil"
Cell phones"
nil"
Cell/cap"
nil"
Country"
"
European Union"
"
Education($%GDP)"
nil"
Exp/cap"
"
4248.8308"
"
Exports"
"
2.173e+12"
"
GDP/cap"
"
34500"
13 more elided}{"
Airp/cap"
"
0.76828494"
"
Airports"
"
245000000"
"
Birthrate"
"
13.42"
"
Cell phones"
"
310000000"
"
Cell/cap"
"
0.97211564"
"
Country"
"
United States"
"
Education($%GDP)"
"
5.4"
"
Exp/cap"
"
4938.9746"
"
Exports"
"
1.575e+12"
"
GDP/cap"
"
52800"
13 more elided}{"
Airp/cap"
"
0.07886135"
"
Airports"
"
20000000"
"
Birthrate"
"
17.04"
"
Cell phones"
"
281960000"
"
Cell/cap"
"
1.1117874"
"
Country"
"
Indonesia"
"
Education($%GDP)"
"
2.8"
"
Exp/cap"
"
705.41482"
"
Exports"
"
178900000000"
"
GDP/cap"
"
5200"
13 more elided}{"
Airp/cap"
"
0.37492946"
"
Airports"
"
75982000"
"
Birthrate"
"
14.72"
"
Cell phones"
"
248324000"
"
Cell/cap"
"
1.2253426"
"
Country"
"
Brazil"
"
Education($%GDP)"
"
5.8"
"
Exp/cap"
"
1207.9536"
"
Exports"
"
244800000000"
"
GDP/cap"
"
12100"
13 more elided}{"
Airp/cap"
"
0.10414714"
"
Airports"
"
20431000"
"
Birthrate"
"
23.19"
"
Cell phones"
"
125000000"
"
Cell/cap"
"
0.6371882"
"
Country"
"
Pakistan"
"
Education($%GDP)"
"
2.1"
"
Exp/cap"
"
127.69252"
"
Exports"
"
25050000000"
"
GDP/cap"
"
3100"
13 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}{23 more elided}217 more elided)
Expanding the results in the data viewer tells us that there are some nil values in columns of interest, and that our TSV importer was thrown off by this fact and so didn't convert the numerical columns to number types.
We're going to post process this table a bit with ordinary Clojure sequence functions to filter out rows that have nils for our columns of interest, select those rows, convert strings to numbers, and — because we're Clojurists — convert keys to keywords.
(deflife-expectancy
(->> cia-factbook
(remove #(some nil? (map (partial get %) ["Country""GDP/cap""Life expectancy"])))
Things look pretty good in the data structure browser, but it would be easier to get an overview in tabular form. Luckily, Clerk's built in table viewer is able to infer how to handle all of the most common configurations of rows and columns automatically.
(clerk/table life-expectancy)
:country
:gdp
:life-expectancy
China
9800
75.15
India
4000
67.8
European Union
34500
80.02
United States
52800
79.56
Indonesia
5200
72.17
Brazil
12100
73.28
Pakistan
3100
67.05
Nigeria
2800
52.62
Bangladesh
2100
70.65
Russia
18100
70.16
Japan
37100
84.46
Mexico
15600
75.43
Philippines
4700
72.48
Ethiopia
1300
60.75
Vietnam
4000
72.91
Egypt
6600
73.45
Turkey
15300
73.29
Germany
39500
80.44
Iran
12800
70.89
Thailand
9900
74.18
199 more elided
We can also graph the data to see if there are any visible correlation between our two variables of interest, GDP per capita and life expectancy.
(clerk/vl
{:data {:values life-expectancy}
:width700
:height500
:mark {:type"point"
:tooltip {:field"Country"}}
:encoding {:x {:field:gdp
:type:quantitative}
:y {:field:life-expectancy
:type:quantitative}}})
Loading...
Unsurprisingly, it seems that living in an extremely poor country has negative consequences for life expectancy. On the other hand, it looks like things start to flatten out once GDP/capita goes above $10-15k/year. Some other interesting patterns also emerge: Singapore and Japan have similar life expectancies, despite the former's GDP being twice the latter's, and Qatar — the richest nation in the dataset by GDP/capita — has similar average life expectancy as the Dominican Republic.
Inequality
Now, let's try the same experiment using information from a spreadsheet containing the GINI coefficient — a widely used measure of income inequality — for each country. We're going to use a library called Docjure that provides access to Microsoft Office file formats.
Docjure's API is a bit low-level and doesn't make the obvious tasks easy, so we're going to use this helper function to make the code below clearer. Check out the line-by-line comments to see how this function works.
(defnload-first-sheet
"Return the first sheet of an Excel spreadsheet as a seq of maps."
[filename]
(let [rows (->> (ss/load-workbook filename) ; load the file
Use clojure.set's join function to combine our freshly loaded GINI spreadsheet with our previously prepared life expectancy data, which works because they are both sequences of maps that have a :country key.
Assoc a :gini key in each map to the World Bank's number, but falling back to the CIA's estimate. (These kinds of small programmatic tasks are a constant feature of data wrangling.)
:gdp433007 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}143 more elided)
Expanding the Clojure data structures makes it look like this will work for our comparisons. Let's plot the data to see a list of countries from most to least equal:
(clerk/vl
{:data {:values expectancy-and-gini}
:width600
:height1600
:mark {:type"point"
:tooltip {:field:country}}
:encoding {:x {:field:gini
:type:quantitative}
:y {:field:country
:type:nominal
:sort"x"}}})
Loading...
And now to have a look at whether inequality and life expectancy are correlated:
(clerk/vl
{:data {:values expectancy-and-gini}
:mark"rect"
:width700
:height500
:encoding {:x {:bin {:maxbins25}
:field:life-expectancy
:type"quantitative"}
:y {:bin {:maxbins25}
:field:gini
:type"quantitative"}
:color {:aggregate"count":type"quantitative"}}
:config {:view {:stroke"transparent"}}})
Loading...
It seems like the mass of long lived countries are also in the lower two thirds of the inequality distribution. A little filtering shows is that the only really long-lived countries above a GINI coefficient of ~50 is Hong Kong.
Let's look at happiness! This time, we'll use jdbc.next to perform a SQL query on a Sqlite data containing a table of countries and their relative happiness ratings. Note that we're changing the column name :country_or_region to :country using clojure.set's rename-keys function so that this table will be easy to join with our others.
(defworld-happiness
(let [_run-at #inst "2021-11-26T08:28:29.445-00:00"; bump this to re-run the query!
:freedom0.5857 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}{9 more elided}136 more elided)
Looking at the happiness data, it appears that all the usual suspects — Nordics, Western Europeans, Canadians, and Kiwis — are living pretty good lives by their own estimation. Looking closer, we see that although the top twenty countries all relatively prosperous, it's clear that GDP is not strongly correlated with happiness within that cohort.
(clerk/table world-happiness)
:generosity
:social_support
:freedom
:rank
:score
:perception_of_corruption
:gdp
:country
:healthy_life_expectancy
0.153
1.587
0.596
1
7.769
0.393
1.34
Finland
0.986
0.252
1.573
0.592
2
7.6
0.41
1.383
Denmark
0.996
0.271
1.582
0.603
3
7.554
0.341
1.488
Norway
1.028
0.354
1.624
0.591
4
7.494
0.118
1.38
Iceland
1.026
0.322
1.522
0.557
5
7.488
0.298
1.396
Netherlands
0.999
0.263
1.526
0.572
6
7.48
0.343
1.452
Switzerland
1.052
0.267
1.487
0.574
7
7.343
0.373
1.387
Sweden
1.009
0.33
1.557
0.585
8
7.307
0.38
1.303
New Zealand
1.026
0.285
1.505
0.584
9
7.278
0.308
1.365
Canada
1.039
0.244
1.475
0.532
10
7.246
0.226
1.376
Austria
1.016
0.332
1.548
0.557
11
7.228
0.29
1.372
Australia
1.036
0.144
1.441
0.558
12
7.167
0.093
1.034
Costa Rica
0.963
0.261
1.455
0.371
13
7.139
0.082
1.276
Israel
1.029
0.194
1.479
0.526
14
7.09
0.316
1.609
Luxembourg
1.012
0.348
1.538
0.45
15
7.054
0.278
1.333
United Kingdom
0.996
0.298
1.553
0.516
16
7.021
0.31
1.499
Ireland
0.999
0.261
1.454
0.495
17
6.985
0.265
1.373
Germany
0.987
0.16
1.504
0.473
18
6.923
0.21
1.356
Belgium
0.986
0.28
1.457
0.454
19
6.892
0.128
1.433
United States
0.874
0.046
1.487
0.457
20
6.852
0.036
1.269
Czech Republic
0.92
136 more elided
Next, we're computing a linear regression for this dataset using kixi.stats.
:freedom0.574:gdp1.387:generosity0.267:healthy_life_expectancy1.009:perception_of_corruption0.373:rank7:regression1.455202442956292:score7.343:social_support1.487}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}{10 more elided}136 more elided]
Let's graph the relationship between happiness and GDP to get a bird's eye view on the situation over our entire dataset. You can mouse over individual data points to get more info:
Loading...
It looks, as we might have expected, like richer countries are happier than poor ones in general, though with variations and outliers. For example, Finland is in first place but has a similar GDP/capita as number 58, Japan. Perhaps even more striking, Qatar has the highest GDP/capita in the dataset, but Qataris are on average about as happy as people in El Salvador. Likewise, Botswana has five times the GDP/capita of Malawi, but its people are no happier for it. If I were forced to guess why, I might theorize that a properous country with all of its wealth concentrated in very few hands can still be a fairly wretched place to live for the average person to live.
One way to investigate this possibility is to plot the correlation between equality and happiness in the rich world. We'll use join again, but we'll first use clojure.set's project (named by analogy to SQL projection) to pluck just the :country and :score from the happiness dataset, then sort by the GDP and take the top 20 countries.
This does, at least at first glance, support the notion that the happiest people — just like the longest lived ones — tend to inhabit countries in the more equal part of the GINI distribution.
I hope this example gives you some ideas about things you'd like to investigate.