Index•Generated with Clerk from notebooks/data_science.clj@8a1ea54

A small data science example 🔢

Exploring the world in data

One of the challenges in real data science is getting data from different sources in many different formats. In this notebook, we will explore some facts about the world using data taken from a TSV file, an Excel spreadsheet, and a database query.

Life expectancy

First, we'll read in a TSV file containing the most recent CIA World Factbook data using the meta-csv library.

(def cia-factbook

  (csv/read-csv "./datasets/cia-factbook.tsv"))

({"

Airp/cap"

0.28693821"

Airports"

389000000"

Birthrate"

12.17"

Cell phones"

1100000000"

Cell/cap"

0.81139339"

Country"

China"

Education($%GDP)"

nil "

Exp/cap"

1630.1631"

Exports"

2.21e+12"

GDP/cap"

9800"

13 more elided} {"

Airp/cap"

0.04961238"

Airports"

61338000"

Birthrate"

19.89"

Cell phones"

893862000"

Cell/cap"

0.72298773"

Country"

India"

Education($%GDP)"

3.2"

Exp/cap"

253.32742"

Exports"

313200000000"

GDP/cap"

4000"

13 more elided} {"

Airp/cap"

nil "

Airports"

nil "

Birthrate"

nil "

Cell phones"

nil "

Cell/cap"

nil "

Country"

European Union"

Education($%GDP)"

nil "

Exp/cap"

4248.8308"

Exports"

2.173e+12"

GDP/cap"

34500"

13 more elided} {"

Airp/cap"

0.76828494"

Airports"

245000000"

Birthrate"

13.42"

Cell phones"

310000000"

Cell/cap"

0.97211564"

Country"

United States"

Education($%GDP)"

5.4"

Exp/cap"

4938.9746"

Exports"

1.575e+12"

GDP/cap"

52800"

13 more elided} {"

Airp/cap"

0.07886135"

Airports"

20000000"

Birthrate"

17.04"

Cell phones"

281960000"

Cell/cap"

1.1117874"

Country"

Indonesia"

Education($%GDP)"

2.8"

Exp/cap"

705.41482"

Exports"

178900000000"

GDP/cap"

5200"

13 more elided} {"

Airp/cap"

0.37492946"

Airports"

75982000"

Birthrate"

14.72"

Cell phones"

248324000"

Cell/cap"

1.2253426"

Country"

Brazil"

Education($%GDP)"

5.8"

Exp/cap"

1207.9536"

Exports"

244800000000"

GDP/cap"

12100"

13 more elided} {"

Airp/cap"

0.10414714"

Airports"

20431000"

Birthrate"

23.19"

Cell phones"

125000000"

Cell/cap"

0.6371882"

Country"

Pakistan"

Education($%GDP)"

2.1"

Exp/cap"

127.69252"

Exports"

25050000000"

GDP/cap"

3100"

13 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} {23 more elided} 217 more elided)

Expanding the results in the data viewer tells us that there are some nil values in columns of interest, and that our TSV importer was thrown off by this fact and so didn't convert the numerical columns to number types.

We're going to post process this table a bit with ordinary Clojure sequence functions to filter out rows that have nils for our columns of interest, select those rows, convert strings to numbers, and — because we're Clojurists — convert keys to keywords.

(def life-expectancy

  (->> cia-factbook

       (remove #(some nil? (map (partial get %) ["Country" "GDP/cap" "Life expectancy"])))

       (map #(sorted-map :country (str/trim (get % "Country"))

                         :gdp (read-string (get % "GDP/cap"))

                         :life-expectancy (read-string (get % "Life expectancy"))))))

({:country "

China"

:gdp 9800 :life-expectancy 75.15} {:country "

India"

:gdp 4000 :life-expectancy 67.8} {:country "

European Union"

:gdp 34500 :life-expectancy 80.02} {:country "

United States"

:gdp 52800 :life-expectancy 79.56} {:country "

Indonesia"

:gdp 5200 :life-expectancy 72.17} {:country "

Brazil"

:gdp 12100 :life-expectancy 73.28} {:country "

Pakistan"

:gdp 3100 :life-expectancy 67.05} {:country "

Nigeria"

:gdp 2800 :life-expectancy 52.62} {:country "

Bangladesh"

:gdp 2100 :life-expectancy 70.65} {:country "

Russia"

:gdp 18100 :life-expectancy 70.16} {:country "

Japan"

:gdp 37100 :life-expectancy 84.46} {:country "

Mexico"

:gdp 15600 :life-expectancy 75.43} {:country "

Philippines"

:gdp 4700 :life-expectancy 72.48} {:country "

Ethiopia"

:gdp 1300 :life-expectancy 60.75} {:country "

Vietnam"

:gdp 4000 :life-expectancy 72.91} {:country "

Egypt"

:gdp 6600 :life-expectancy 73.45} {:country "

Turkey"

:gdp 15300 :life-expectancy 73.29} {:country "

Germany"

:gdp 39500 :life-expectancy 80.44} {:country "

Iran"

:gdp 12800 :life-expectancy 70.89} {:country "

Thailand"

:gdp 9900 :life-expectancy 74.18} 199 more elided)

Things look pretty good in the data structure browser, but it would be easier to get an overview in tabular form. Luckily, Clerk's built in table viewer is able to infer how to handle all of the most common configurations of rows and columns automatically.

(clerk/table life-expectancy)

:country	:gdp	:life-expectancy
China	9800	75.15
India	4000	67.8
European Union	34500	80.02
United States	52800	79.56
Indonesia	5200	72.17
Brazil	12100	73.28
Pakistan	3100	67.05
Nigeria	2800	52.62
Bangladesh	2100	70.65
Russia	18100	70.16
Japan	37100	84.46
Mexico	15600	75.43
Philippines	4700	72.48
Ethiopia	1300	60.75
Vietnam	4000	72.91
Egypt	6600	73.45
Turkey	15300	73.29
Germany	39500	80.44
Iran	12800	70.89
Thailand	9900	74.18
199 more elided

We can also graph the data to see if there are any visible correlation between our two variables of interest, GDP per capita and life expectancy.

(clerk/vl

 {:data {:values life-expectancy}

  :width 700

  :height 500

  :mark {:type "point"

         :tooltip {:field "Country"}}

  :encoding {:x {:field :gdp

                 :type :quantitative}

             :y {:field :life-expectancy

                 :type :quantitative}}})

Unsurprisingly, it seems that living in an extremely poor country has negative consequences for life expectancy. On the other hand, it looks like things start to flatten out once GDP/capita goes above $10-15k/year. Some other interesting patterns also emerge: Singapore and Japan have similar life expectancies, despite the former's GDP being twice the latter's, and Qatar — the richest nation in the dataset by GDP/capita — has similar average life expectancy as the Dominican Republic.

Inequality

Now, let's try the same experiment using information from a spreadsheet containing the GINI coefficient — a widely used measure of income inequality — for each country. We're going to use a library called Docjure that provides access to Microsoft Office file formats.

Docjure's API is a bit low-level and doesn't make the obvious tasks easy, so we're going to use this helper function to make the code below clearer. Check out the line-by-line comments to see how this function works.

(defn load-first-sheet

  "Return the first sheet of an Excel spreadsheet as a seq of maps."

  [filename]

  (let [rows (->> (ss/load-workbook filename) ; load the file

                  (ss/sheet-seq)              ; seq of sheets in the file

                  first                       ; take the first (only)

                  ss/row-seq                  ; get the rows from it

                  (mapv ss/cell-seq))         ; each row -> seq of cells

        ;; break off the headers to produce a seq of maps

        headers   (mapv (comp keyword ss/read-cell) (first rows))]

    ;; map over the rows creating new maps with the headers as keys

    (mapv #(zipmap headers (map ss/read-cell %)) (rest rows))))

#object[data_science$load_first_sheet 0x7502c979 "

data_science$load_first_sheet@7502c979"

]

Now we're going to use a few lines of code to:

Load the spreadsheet data.
Use clojure.set's join function to combine our freshly loaded GINI spreadsheet with our previously prepared life expectancy data, which works because they are both sequences of maps that have a :country key.
Assoc a :gini key in each map to the World Bank's number, but falling back to the CIA's estimate. (These kinds of small programmatic tasks are a constant feature of data wrangling.)

(def expectancy-and-gini

  (->> (load-first-sheet "datasets/countries-gini.xlsx")

       (join life-expectancy)

       (keep #(if-let [gini (or (:giniWB %) (:giniCIA %))]

                (assoc % :gini gini)

                nil))))

({:country "

Lithuania"

:gdp 22600 :gini 35.7 :giniCIA 37.3 :giniWB 35.7 :life-expectancy 75.98 :pop2021 2689.862 :yearCIA 2017 :yearWB 2018} {:country "

Turkey"

:gdp 15300 :gini 41.9 :giniCIA 41.9 :giniWB 41.9 :life-expectancy 73.29 :pop2021 85042.738 :yearCIA 2018 :yearWB 2019} {:country "

Sweden"

:gdp 40900 :gini 30 :giniCIA 28.8 :giniWB 30 :life-expectancy 81.89 :pop2021 10160.169 :yearCIA 2017 :yearWB 2018} {:country "

Norway"

:gdp 55400 :gini 27.6 :giniCIA 27 :giniWB 27.6 :life-expectancy 81.6 :pop2021 5465.63 :yearCIA 2017 :yearWB 2018} {:country "

Zimbabwe"

:gdp 600 :gini 50.3 :giniCIA 44.3 :giniWB 50.3 :life-expectancy 55.68 :pop2021 15092.171 :yearCIA 2017 :yearWB 2019} {:country "

Iran"

:gdp 12800 :gini 42 :giniCIA 40.8 :giniWB 42 :life-expectancy 70.89 :pop2021 85028.759 :yearCIA 2017 :yearWB 2018} {:country "

Yemen"

:gdp 2500 :gini 36.7 :giniCIA 36.7 :giniWB 36.7 :life-expectancy 64.83 :pop2021 30490.64 :yearCIA 2014 :yearWB 2014} {:country "

Netherlands"

:gdp 43300 7 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} 143 more elided)

Expanding the Clojure data structures makes it look like this will work for our comparisons. Let's plot the data to see a list of countries from most to least equal:

(clerk/vl

 {:data {:values expectancy-and-gini}

  :width 600

  :height 1600

  :mark {:type "point"

         :tooltip {:field :country}}

  :encoding {:x {:field :gini

                 :type :quantitative}

             :y {:field :country

                 :type :nominal

                 :sort "x"}}})

And now to have a look at whether inequality and life expectancy are correlated:

(clerk/vl

 {:data {:values expectancy-and-gini}

  :mark "rect"

  :width 700

  :height 500

  :encoding {:x {:bin {:maxbins 25}

                 :field :life-expectancy

                 :type "quantitative"}

             :y {:bin {:maxbins 25}

                 :field :gini

                 :type "quantitative"}

             :color {:aggregate "count" :type "quantitative"}}

  :config {:view {:stroke "transparent"}}})

It seems like the mass of long lived countries are also in the lower two thirds of the inequality distribution. A little filtering shows is that the only really long-lived countries above a GINI coefficient of ~50 is Hong Kong.

(clerk/table

 (->> (filter #(< 50 (:gini %)) expectancy-and-gini)

      (sort-by :life-expectancy)))

:yearCIA	:pop2021	:life-expectancy	:giniCIA	:gini	:gdp	:giniWB	:yearWB	:country
2014	60041.994	49.56	63	63	11500	63	2014	South Africa
2010	2015.494	49.87	50.7	50.7	1200	50.7	2010	Guinea-Bissau
2003	4919.981	51.35	43.6	56.2	700	56.2	2008	Central African Republic
2015	18920.651	51.83	57.1	57.1	1800	57.1	2015	Zambia
2015	2587.344	51.85	59.1	59.1	8200	59.1	2015	Namibia
2014	32163.047	52.6	54	54	1200	54	2014	Mozambique
2015	2397.241	54.06	53.3	53.3	16400	53.3	2015	Botswana
2018	33933.61	55.29	51.3	51.3	6300	51.3	2018	Angola
2017	15092.171	55.68	44.3	50.3	600	50.3	2019	Zimbabwe
2017	223.368	64.22	56.3	56.3	2200	56.3	2017	Sao Tome and Principe
nil	404.914	68.49	nil	53.3	8800	53.3	1999	Belize
nil	591.8	71.69	nil	57.9	12900	57.9	1999	Suriname
2018	213993.437	73.28	53.9	53.4	12100	53.4	2019	Brazil
2018	51265.844	75.25	50.4	51.3	11100	51.3	2019	Colombia
2016	184.4	77.41	51.2	51.2	13100	51.2	2016	Saint Lucia
2016	7552.81	82.78	53.9	53.9	52700	nil	nil	Hong Kong

Happiness

Let's look at happiness! This time, we'll use jdbc.next to perform a SQL query on a Sqlite data containing a table of countries and their relative happiness ratings. Note that we're changing the column name :country_or_region to :country using clojure.set's rename-keys function so that this table will be easy to join with our others.

(def world-happiness

  (let [_run-at #inst "2021-11-26T08:28:29.445-00:00" ; bump this to re-run the query!

        ds (jdbc/get-datasource {:dbtype "sqlite" :dbname "./datasets/happiness.db"})]

    (->> (with-open [conn (jdbc/get-connection ds)]

           (jdbc/execute! conn ["SELECT * FROM happiness"]

                          {:return-keys true :builder-fn rs/as-unqualified-lower-maps}))

         (map #(rename-keys % {:country_or_region :country})))))

({:country "

Finland"

:freedom 0.596 :gdp 1.34 :generosity 0.153 :healthy_life_expectancy 0.986 :perception_of_corruption 0.393 :rank 1 :score 7.769 :social_support 1.587} {:country "

Denmark"

:freedom 0.592 :gdp 1.383 :generosity 0.252 :healthy_life_expectancy 0.996 :perception_of_corruption 0.41 :rank 2 :score 7.6 :social_support 1.573} {:country "

Norway"

:freedom 0.603 :gdp 1.488 :generosity 0.271 :healthy_life_expectancy 1.028 :perception_of_corruption 0.341 :rank 3 :score 7.554 :social_support 1.582} {:country "

Iceland"

:freedom 0.591 :gdp 1.38 :generosity 0.354 :healthy_life_expectancy 1.026 :perception_of_corruption 0.118 :rank 4 :score 7.494 :social_support 1.624} {:country "

Netherlands"

:freedom 0.557 :gdp 1.396 :generosity 0.322 :healthy_life_expectancy 0.999 :perception_of_corruption 0.298 :rank 5 :score 7.488 :social_support 1.522} {:country "

Switzerland"

:freedom 0.572 :gdp 1.452 :generosity 0.263 :healthy_life_expectancy 1.052 :perception_of_corruption 0.343 :rank 6 :score 7.48 :social_support 1.526} {:country "

Sweden"

:freedom 0.574 :gdp 1.387 :generosity 0.267 :healthy_life_expectancy 1.009 :perception_of_corruption 0.373 :rank 7 :score 7.343 :social_support 1.487} {:country "

New Zealand"

:freedom 0.585 7 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} {9 more elided} 136 more elided)

Looking at the happiness data, it appears that all the usual suspects — Nordics, Western Europeans, Canadians, and Kiwis — are living pretty good lives by their own estimation. Looking closer, we see that although the top twenty countries all relatively prosperous, it's clear that GDP is not strongly correlated with happiness within that cohort.

(clerk/table world-happiness)

:generosity	:social_support	:freedom	:rank	:score	:perception_of_corruption	:gdp	:country	:healthy_life_expectancy
0.153	1.587	0.596	1	7.769	0.393	1.34	Finland	0.986
0.252	1.573	0.592	2	7.6	0.41	1.383	Denmark	0.996
0.271	1.582	0.603	3	7.554	0.341	1.488	Norway	1.028
0.354	1.624	0.591	4	7.494	0.118	1.38	Iceland	1.026
0.322	1.522	0.557	5	7.488	0.298	1.396	Netherlands	0.999
0.263	1.526	0.572	6	7.48	0.343	1.452	Switzerland	1.052
0.267	1.487	0.574	7	7.343	0.373	1.387	Sweden	1.009
0.33	1.557	0.585	8	7.307	0.38	1.303	New Zealand	1.026
0.285	1.505	0.584	9	7.278	0.308	1.365	Canada	1.039
0.244	1.475	0.532	10	7.246	0.226	1.376	Austria	1.016
0.332	1.548	0.557	11	7.228	0.29	1.372	Australia	1.036
0.144	1.441	0.558	12	7.167	0.093	1.034	Costa Rica	0.963
0.261	1.455	0.371	13	7.139	0.082	1.276	Israel	1.029
0.194	1.479	0.526	14	7.09	0.316	1.609	Luxembourg	1.012
0.348	1.538	0.45	15	7.054	0.278	1.333	United Kingdom	0.996
0.298	1.553	0.516	16	7.021	0.31	1.499	Ireland	0.999
0.261	1.454	0.495	17	6.985	0.265	1.373	Germany	0.987
0.16	1.504	0.473	18	6.923	0.21	1.356	Belgium	0.986
0.28	1.457	0.454	19	6.892	0.128	1.433	United States	0.874
0.046	1.487	0.457	20	6.852	0.036	1.269	Czech Republic	0.92
136 more elided

Next, we're computing a linear regression for this dataset using kixi.stats.

(def linear-regression

  (transduce identity (kixi-stats/simple-linear-regression :score :gdp) world-happiness))

[-0.631189360468119 0.28413343366803906]

We'll use this linear regression to augment out dataset so each datapoint also gets a :regression value.

(def world-happiness+regression

  (mapv (fn [{:as datapoint :keys [score]}]

          (assoc datapoint :regression (kixi-p/measure linear-regression score)))

        world-happiness))

[{:country "

Finland"

:freedom 0.596 :gdp 1.34 :generosity 0.153 :healthy_life_expectancy 0.986 :perception_of_corruption 0.393 :rank 1 :regression 1.5762432856988764 :score 7.769 :social_support 1.587} {:country "

Denmark"

:freedom 0.592 :gdp 1.383 :generosity 0.252 :healthy_life_expectancy 0.996 :perception_of_corruption 0.41 :rank 2 :regression 1.528224735408978 :score 7.6 :social_support 1.573} {:country "

Norway"

:freedom 0.603 :gdp 1.488 :generosity 0.271 :healthy_life_expectancy 1.028 :perception_of_corruption 0.341 :rank 3 :regression 1.5151545974602483 :score 7.554 :social_support 1.582} {:country "

Iceland"

:freedom 0.591 :gdp 1.38 :generosity 0.354 :healthy_life_expectancy 1.026 :perception_of_corruption 0.118 :rank 4 :regression 1.4981065914401657 :score 7.494 :social_support 1.624} {:country "

Netherlands"

:freedom 0.557 :gdp 1.396 :generosity 0.322 :healthy_life_expectancy 0.999 :perception_of_corruption 0.298 :rank 5 :regression 1.4964017908381577 :score 7.488 :social_support 1.522} {:country "

Switzerland"

:freedom 0.572 :gdp 1.452 :generosity 0.263 :healthy_life_expectancy 1.052 :perception_of_corruption 0.343 :rank 6 :regression 1.4941287233688132 :score 7.48 :social_support 1.526} {:country "

Sweden"

:freedom 0.574 :gdp 1.387 :generosity 0.267 :healthy_life_expectancy 1.009 :perception_of_corruption 0.373 :rank 7 :regression 1.455202442956292 :score 7.343 :social_support 1.487} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} {10 more elided} 136 more elided]

Let's graph the relationship between happiness and GDP to get a bird's eye view on the situation over our entire dataset. You can mouse over individual data points to get more info:

It looks, as we might have expected, like richer countries are happier than poor ones in general, though with variations and outliers. For example, Finland is in first place but has a similar GDP/capita as number 58, Japan. Perhaps even more striking, Qatar has the highest GDP/capita in the dataset, but Qataris are on average about as happy as people in El Salvador. Likewise, Botswana has five times the GDP/capita of Malawi, but its people are no happier for it. If I were forced to guess why, I might theorize that a properous country with all of its wealth concentrated in very few hands can still be a fairly wretched place to live for the average person to live.

One way to investigate this possibility is to plot the correlation between equality and happiness in the rich world. We'll use join again, but we'll first use clojure.set's project (named by analogy to SQL projection) to pluck just the :country and :score from the happiness dataset, then sort by the GDP and take the top 20 countries.

(clerk/vl

 {:data {:values (->> (project world-happiness [:country :score])

                      (join expectancy-and-gini)

                      (sort-by :gdp >)

                      (take 20))}

  :width 700

  :height 500

  :mark {:type "point"

         :tooltip {:field :country}}

  :encoding {:x {:field :score

                 :type :quantitative

                 :scale {:zero false}}

             :y {:field :gini

                 :type :quantitative

                 :scale {:zero false}}}})

This does, at least at first glance, support the notion that the happiest people — just like the longest lived ones — tend to inhabit countries in the more equal part of the GINI distribution.

I hope this example gives you some ideas about things you'd like to investigate.