A small data science example 🔢

Exploring the world in data

One of the challenges in real data science is getting data from different sources in many different formats. In this notebook, we will explore some facts about the world using data taken from a TSV file, an Excel spreadsheet, and a database query.

Life expectancy

First, we'll read in a TSV file containing the most recent CIA World Factbook data using the meta-csv library.

(def cia-factbook
(csv/read-csv "./datasets/cia-factbook.tsv"))
({"
Airp/cap"
"
0.28693821"
"
Airports"
"
389000000"
"
Birthrate"
"
12.17"
"
Cell phones"
"
1100000000"
"
Cell/cap"
"
0.81139339"
"
Country"
"
China"
"
Education($%GDP)"
nil "
Exp/cap"
"
1630.1631"
"
Exports"
"
2.21e+12"
"
GDP/cap"
"
9800"
13 more elided}
{"
Airp/cap"
"
0.04961238"
"
Airports"
"
61338000"
"
Birthrate"
"
19.89"
"
Cell phones"
"
893862000"
"
Cell/cap"
"
0.72298773"
"
Country"
"
India"
"
Education($%GDP)"
"
3.2"
"
Exp/cap"
"
253.32742"
"
Exports"
"
313200000000"
"
GDP/cap"
"
4000"
13 more elided}
{"
Airp/cap"
nil "
Airports"
nil "
Birthrate"
nil "
Cell phones"
nil "
Cell/cap"
nil "
Country"
"
European Union"
"
Education($%GDP)"
nil "
Exp/cap"
"
4248.8308"
"
Exports"
"
2.173e+12"
"
GDP/cap"
"
34500"
13 more elided}
{"
Airp/cap"
"
0.76828494"
"
Airports"
"
245000000"
"
Birthrate"
"
13.42"
"
Cell phones"
"
310000000"
"
Cell/cap"
"
0.97211564"
"
Country"
"
United States"
"
Education($%GDP)"
"
5.4"
"
Exp/cap"
"
4938.9746"
"
Exports"
"
1.575e+12"
"
GDP/cap"
"
52800"
13 more elided}
{"
Airp/cap"
"
0.07886135"
"
Airports"
"
20000000"
"
Birthrate"
"
17.04"
"
Cell phones"
"
281960000"
"
Cell/cap"
"
1.1117874"
"
Country"
"
Indonesia"
"
Education($%GDP)"
"
2.8"
"
Exp/cap"
"
705.41482"
"
Exports"
"
178900000000"
"
GDP/cap"
"
5200"
13 more elided}
{"
Airp/cap"
"
0.37492946"
"
Airports"
"
75982000"
"
Birthrate"
"
14.72"
"
Cell phones"
"
248324000"
"
Cell/cap"
"
1.2253426"
"
Country"
"
Brazil"
"
Education($%GDP)"
"
5.8"
"
Exp/cap"
"
1207.9536"
"
Exports"
"
244800000000"
"
GDP/cap"
"
12100"
13 more elided}
{"
Airp/cap"
"
0.10414714"
"
Airports"
"
20431000"
"
Birthrate"
"
23.19"
"
Cell phones"
"
125000000"
"
Cell/cap"
"
0.6371882"
"
Country"
"
Pakistan"
"
Education($%GDP)"
"
2.1"
"
Exp/cap"
"
127.69252"
"
Exports"
"
25050000000"
"
GDP/cap"
"
3100"
13 more elided}
{23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} 217 more elided)

Expanding the results in the data viewer tells us that there are some nil values in columns of interest, and that our TSV importer was thrown off by this fact and so didn't convert the numerical columns to number types.

We're going to post process this table a bit with ordinary Clojure sequence functions to filter out rows that have nils for our columns of interest, select those rows, convert strings to numbers, and — because we're Clojurists — convert keys to keywords.

(def life-expectancy
(->> cia-factbook
(remove #(some nil? (map (partial get %) ["Country" "GDP/cap" "Life expectancy"])))
(map #(sorted-map :country (str/trim (get % "Country"))
:gdp (read-string (get % "GDP/cap"))
:life-expectancy (read-string (get % "Life expectancy"))))))
({:country "
China"
:gdp 9800 :life-expectancy 75.15}
{:country "
India"
:gdp 4000 :life-expectancy 67.8}
{:country "
European Union"
:gdp 34500 :life-expectancy 80.02}
{:country "
United States"
:gdp 52800 :life-expectancy 79.56}
{:country "
Indonesia"
:gdp 5200 :life-expectancy 72.17}
{:country "
Brazil"
:gdp 12100 :life-expectancy 73.28}
{:country "
Pakistan"
:gdp 3100 :life-expectancy 67.05}
{:country "
Nigeria"
:gdp 2800 :life-expectancy 52.62}
{:country "
Bangladesh"
:gdp 2100 :life-expectancy 70.65}
{:country "
Russia"
:gdp 18100 :life-expectancy 70.16}
{:country "
Japan"
:gdp 37100 :life-expectancy 84.46}
{:country "
Mexico"
:gdp 15600 :life-expectancy 75.43}
{:country "
Philippines"
:gdp 4700 :life-expectancy 72.48}
{:country "
Ethiopia"
:gdp 1300 :life-expectancy 60.75}
{:country "
Vietnam"
:gdp 4000 :life-expectancy 72.91}
{:country "
Egypt"
:gdp 6600 :life-expectancy 73.45}
{:country "
Turkey"
:gdp 15300 :life-expectancy 73.29}
{:country "
Germany"
:gdp 39500 :life-expectancy 80.44}
{:country "
Iran"
:gdp 12800 :life-expectancy 70.89}
{:country "
Thailand"
:gdp 9900 :life-expectancy 74.18}
199 more elided)

Things look pretty good in the data structure browser, but it would be easier to get an overview in tabular form. Luckily, Clerk's built in table viewer is able to infer how to handle all of the most common configurations of rows and columns automatically.

(clerk/table life-expectancy)
:country
:gdp
:life-expectancy
China980075.15
India400067.8
European Union3450080.02
United States5280079.56
Indonesia520072.17
Brazil1210073.28
Pakistan310067.05
Nigeria280052.62
Bangladesh210070.65
Russia1810070.16
Japan3710084.46
Mexico1560075.43
Philippines470072.48
Ethiopia130060.75
Vietnam400072.91
Egypt660073.45
Turkey1530073.29
Germany3950080.44
Iran1280070.89
Thailand990074.18
199 more elided

We can also graph the data to see if there are any visible correlation between our two variables of interest, GDP per capita and life expectancy.

(clerk/vl
{:data {:values life-expectancy}
:width 700
:height 500
:mark {:type "point"
:tooltip {:field "Country"}}
:encoding {:x {:field :gdp
:type :quantitative}
:y {:field :life-expectancy
:type :quantitative}}})
Loading...

Unsurprisingly, it seems that living in an extremely poor country has negative consequences for life expectancy. On the other hand, it looks like things start to flatten out once GDP/capita goes above $10-15k/year. Some other interesting patterns also emerge: Singapore and Japan have similar life expectancies, despite the former's GDP being twice the latter's, and Qatar — the richest nation in the dataset by GDP/capita — has similar average life expectancy as the Dominican Republic.

Inequality

Now, let's try the same experiment using information from a spreadsheet containing the GINI coefficient — a widely used measure of income inequality — for each country. We're going to use a library called Docjure that provides access to Microsoft Office file formats.

Docjure's API is a bit low-level and doesn't make the obvious tasks easy, so we're going to use this helper function to make the code below clearer. Check out the line-by-line comments to see how this function works.

(defn load-first-sheet
"Return the first sheet of an Excel spreadsheet as a seq of maps."
[filename]
(let [rows (->> (ss/load-workbook filename) ; load the file
(ss/sheet-seq) ; seq of sheets in the file
first ; take the first (only)
ss/row-seq ; get the rows from it
(mapv ss/cell-seq)) ; each row -> seq of cells
;; break off the headers to produce a seq of maps
headers (mapv (comp keyword ss/read-cell) (first rows))]
;; map over the rows creating new maps with the headers as keys
(mapv #(zipmap headers (map ss/read-cell %)) (rest rows))))
#object[data_science$load_first_sheet 0x1670afb1 "
data_science$load_first_sheet@1670afb1"
]

Now we're going to use a few lines of code to:

  1. Load the spreadsheet data.
  2. Use clojure.set's join function to combine our freshly loaded GINI spreadsheet with our previously prepared life expectancy data, which works because they are both sequences of maps that have a :country key.
  3. Assoc a :gini key in each map to the World Bank's number, but falling back to the CIA's estimate. (These kinds of small programmatic tasks are a constant feature of data wrangling.)
(def expectancy-and-gini
(->> (load-first-sheet "datasets/countries-gini.xlsx")
(join life-expectancy)
(keep #(if-let [gini (or (:giniWB %) (:giniCIA %))]
(assoc % :gini gini)
nil))))
({:country "
Lithuania"
:gdp 22600 :gini 35.7 :giniCIA 37.3 :giniWB 35.7 :life-expectancy 75.98 :pop2021 2689.862 :yearCIA 2017 :yearWB 2018}
{:country "
Turkey"
:gdp 15300 :gini 41.9 :giniCIA 41.9 :giniWB 41.9 :life-expectancy 73.29 :pop2021 85042.738 :yearCIA 2018 :yearWB 2019}
{:country "
Sweden"
:gdp 40900 :gini 30 :giniCIA 28.8 :giniWB 30 :life-expectancy 81.89 :pop2021 10160.169 :yearCIA 2017 :yearWB 2018}
{:country "
Norway"
:gdp 55400 :gini 27.6 :giniCIA 27 :giniWB 27.6 :life-expectancy 81.6 :pop2021 5465.63 :yearCIA 2017 :yearWB 2018}
{:country "
Zimbabwe"
:gdp 600 :gini 50.3 :giniCIA 44.3 :giniWB 50.3 :life-expectancy 55.68 :pop2021 15092.171 :yearCIA 2017 :yearWB 2019}
{:country "
Iran"
:gdp 12800 :gini 42 :giniCIA 40.8 :giniWB 42 :life-expectancy 70.89 :pop2021 85028.759 :yearCIA 2017 :yearWB 2018}
{:country "
Yemen"
:gdp 2500 :gini 36.7 :giniCIA 36.7 :giniWB 36.7 :life-expectancy 64.83 :pop2021 30490.64 :yearCIA 2014 :yearWB 2014}
{:country "
Netherlands"
:gdp 43300 7 more elided}
{9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} 143 more elided)

Expanding the Clojure data structures makes it look like this will work for our comparisons. Let's plot the data to see a list of countries from most to least equal:

(clerk/vl
{:data {:values expectancy-and-gini}
:width 600
:height 1600
:mark {:type "point"
:tooltip {:field :country}}
:encoding {:x {:field :gini
:type :quantitative}
:y {:field :country
:type :nominal
:sort "x"}}})
Loading...

And now to have a look at whether inequality and life expectancy are correlated:

(clerk/vl
{:data {:values expectancy-and-gini}
:mark "rect"
:width 700
:height 500
:encoding {:x {:bin {:maxbins 25}
:field :life-expectancy
:type "quantitative"}
:y {:bin {:maxbins 25}
:field :gini
:type "quantitative"}
:color {:aggregate "count" :type "quantitative"}}
:config {:view {:stroke "transparent"}}})
Loading...

It seems like the mass of long lived countries are also in the lower two thirds of the inequality distribution. A little filtering shows is that the only really long-lived countries above a GINI coefficient of ~50 is Hong Kong.

(clerk/table
(->> (filter #(< 50 (:gini %)) expectancy-and-gini)
(sort-by :life-expectancy)))
:yearCIA
:pop2021
:life-expectancy
:giniCIA
:gini
:gdp
:giniWB
:yearWB
:country
201460041.99449.56636311500632014South Africa
20102015.49449.8750.750.7120050.72010Guinea-Bissau
20034919.98151.3543.656.270056.22008Central African Republic
201518920.65151.8357.157.1180057.12015Zambia
20152587.34451.8559.159.1820059.12015Namibia
201432163.04752.654541200542014Mozambique
20152397.24154.0653.353.31640053.32015Botswana
201833933.6155.2951.351.3630051.32018Angola
201715092.17155.6844.350.360050.32019Zimbabwe
2017223.36864.2256.356.3220056.32017Sao Tome and Principe
nil404.91468.49nil53.3880053.31999Belize
nil591.871.69nil57.91290057.91999Suriname
2018213993.43773.2853.953.41210053.42019Brazil
201851265.84475.2550.451.31110051.32019Colombia
2016184.477.4151.251.21310051.22016Saint Lucia
20167552.8182.7853.953.952700nilnilHong Kong

Happiness

Let's look at happiness! This time, we'll use jdbc.next to perform a SQL query on a Sqlite data containing a table of countries and their relative happiness ratings. Note that we're changing the column name :country_or_region to :country using clojure.set's rename-keys function so that this table will be easy to join with our others.

(def world-happiness
(let [_run-at #inst "2021-11-26T08:28:29.445-00:00" ; bump this to re-run the query!
ds (jdbc/get-datasource {:dbtype "sqlite" :dbname "./datasets/happiness.db"})]
(->> (with-open [conn (jdbc/get-connection ds)]
(jdbc/execute! conn ["SELECT * FROM happiness"]
{:return-keys true :builder-fn rs/as-unqualified-lower-maps}))
(map #(rename-keys % {:country_or_region :country})))))
({:country "
Finland"
:freedom 0.596 :gdp 1.34 :generosity 0.153 :healthy_life_expectancy 0.986 :perception_of_corruption 0.393 :rank 1 :score 7.769 :social_support 1.587}
{:country "
Denmark"
:freedom 0.592 :gdp 1.383 :generosity 0.252 :healthy_life_expectancy 0.996 :perception_of_corruption 0.41 :rank 2 :score 7.6 :social_support 1.573}
{:country "
Norway"
:freedom 0.603 :gdp 1.488 :generosity 0.271 :healthy_life_expectancy 1.028 :perception_of_corruption 0.341 :rank 3 :score 7.554 :social_support 1.582}
{:country "
Iceland"
:freedom 0.591 :gdp 1.38 :generosity 0.354 :healthy_life_expectancy 1.026 :perception_of_corruption 0.118 :rank 4 :score 7.494 :social_support 1.624}
{:country "
Netherlands"
:freedom 0.557 :gdp 1.396 :generosity 0.322 :healthy_life_expectancy 0.999 :perception_of_corruption 0.298 :rank 5 :score 7.488 :social_support 1.522}
{:country "
Switzerland"
:freedom 0.572 :gdp 1.452 :generosity 0.263 :healthy_life_expectancy 1.052 :perception_of_corruption 0.343 :rank 6 :score 7.48 :social_support 1.526}
{:country "
Sweden"
:freedom 0.574 :gdp 1.387 :generosity 0.267 :healthy_life_expectancy 1.009 :perception_of_corruption 0.373 :rank 7 :score 7.343 :social_support 1.487}
{:country "
New Zealand"
:freedom 0.585 7 more elided}
{9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} 136 more elided)

Looking at the happiness data, it appears that all the usual suspects — Nordics, Western Europeans, Canadians, and Kiwis — are living pretty good lives by their own estimation. Looking closer, we see that although the top twenty countries all relatively prosperous, it's clear that GDP is not strongly correlated with happiness within that cohort.

(clerk/table world-happiness)
:generosity
:social_support
:freedom
:rank
:score
:perception_of_corruption
:gdp
:country
:healthy_life_expectancy
0.1531.5870.59617.7690.3931.34Finland0.986
0.2521.5730.59227.60.411.383Denmark0.996
0.2711.5820.60337.5540.3411.488Norway1.028
0.3541.6240.59147.4940.1181.38Iceland1.026
0.3221.5220.55757.4880.2981.396Netherlands0.999
0.2631.5260.57267.480.3431.452Switzerland1.052
0.2671.4870.57477.3430.3731.387Sweden1.009
0.331.5570.58587.3070.381.303New Zealand1.026
0.2851.5050.58497.2780.3081.365Canada1.039
0.2441.4750.532107.2460.2261.376Austria1.016
0.3321.5480.557117.2280.291.372Australia1.036
0.1441.4410.558127.1670.0931.034Costa Rica0.963
0.2611.4550.371137.1390.0821.276Israel1.029
0.1941.4790.526147.090.3161.609Luxembourg1.012
0.3481.5380.45157.0540.2781.333United Kingdom0.996
0.2981.5530.516167.0210.311.499Ireland0.999
0.2611.4540.495176.9850.2651.373Germany0.987
0.161.5040.473186.9230.211.356Belgium0.986
0.281.4570.454196.8920.1281.433United States0.874
0.0461.4870.457206.8520.0361.269Czech Republic0.92
136 more elided

Next, we're computing a linear regression for this dataset using kixi.stats.

^{::clerk/viewer {:transform-fn (clerk/update-val kixi-p/parameters)}}
(def linear-regression
(transduce identity (kixi-stats/simple-linear-regression :score :gdp) world-happiness))
[-0.631189360468119 0.28413343366803906]

We'll use this linear regression to augment out dataset so each datapoint also gets a :regression value.

(def world-happiness+regression
(mapv (fn [{:as datapoint :keys [score]}]
(assoc datapoint :regression (kixi-p/measure linear-regression score)))
world-happiness))
[{:country "
Finland"
:freedom 0.596 :gdp 1.34 :generosity 0.153 :healthy_life_expectancy 0.986 :perception_of_corruption 0.393 :rank 1 :regression 1.5762432856988764 :score 7.769 :social_support 1.587}
{:country "
Denmark"
:freedom 0.592 :gdp 1.383 :generosity 0.252 :healthy_life_expectancy 0.996 :perception_of_corruption 0.41 :rank 2 :regression 1.528224735408978 :score 7.6 :social_support 1.573}
{:country "
Norway"
:freedom 0.603 :gdp 1.488 :generosity 0.271 :healthy_life_expectancy 1.028 :perception_of_corruption 0.341 :rank 3 :regression 1.5151545974602483 :score 7.554 :social_support 1.582}
{:country "
Iceland"
:freedom 0.591 :gdp 1.38 :generosity 0.354 :healthy_life_expectancy 1.026 :perception_of_corruption 0.118 :rank 4 :regression 1.4981065914401657 :score 7.494 :social_support 1.624}
{:country "
Netherlands"
:freedom 0.557 :gdp 1.396 :generosity 0.322 :healthy_life_expectancy 0.999 :perception_of_corruption 0.298 :rank 5 :regression 1.4964017908381577 :score 7.488 :social_support 1.522}
{:country "
Switzerland"
:freedom 0.572 :gdp 1.452 :generosity 0.263 :healthy_life_expectancy 1.052 :perception_of_corruption 0.343 :rank 6 :regression 1.4941287233688132 :score 7.48 :social_support 1.526}
{:country "
Sweden"
:freedom 0.574 :gdp 1.387 :generosity 0.267 :healthy_life_expectancy 1.009 :perception_of_corruption 0.373 :rank 7 :regression 1.455202442956292 :score 7.343 :social_support 1.487}
{10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} 136 more elided]

Let's graph the relationship between happiness and GDP to get a bird's eye view on the situation over our entire dataset. You can mouse over individual data points to get more info:

Loading...

It looks, as we might have expected, like richer countries are happier than poor ones in general, though with variations and outliers. For example, Finland is in first place but has a similar GDP/capita as number 58, Japan. Perhaps even more striking, Qatar has the highest GDP/capita in the dataset, but Qataris are on average about as happy as people in El Salvador. Likewise, Botswana has five times the GDP/capita of Malawi, but its people are no happier for it. If I were forced to guess why, I might theorize that a properous country with all of its wealth concentrated in very few hands can still be a fairly wretched place to live for the average person to live.

One way to investigate this possibility is to plot the correlation between equality and happiness in the rich world. We'll use join again, but we'll first use clojure.set's project (named by analogy to SQL projection) to pluck just the :country and :score from the happiness dataset, then sort by the GDP and take the top 20 countries.

(clerk/vl
{:data {:values (->> (project world-happiness [:country :score])
(join expectancy-and-gini)
(sort-by :gdp >)
(take 20))}
:width 700
:height 500
:mark {:type "point"
:tooltip {:field :country}}
:encoding {:x {:field :score
:type :quantitative
:scale {:zero false}}
:y {:field :gini
:type :quantitative
:scale {:zero false}}}})
Loading...

This does, at least at first glance, support the notion that the happiest people — just like the longest lived ones — tend to inhabit countries in the more equal part of the GINI distribution.

I hope this example gives you some ideas about things you'd like to investigate.