A literate programming tutorial using my book reading habits
This page started as a collection of statistics about my book reading habits, but it is also an example of the literate programming features Org Mode provides. More in details, it shows how you can use Org Mode to pass data to source blocks written in different languages. This allows, for instance, to generate data using Ruby, plot it in Gnuplot, analyze it with datamash or R, and export the resulting page to HTML.
There is no need to use Org Mode to achieve the same result: we could write different scripts and then use a building tool such as Make or Rake to glue the code or, possibly, write a monolithic script in Ruby which uses system calls to invoke other tools. Org Mode, however, allows to glue different pieces of code in a more natural way.
You can also view some list of books by year computed from the data.
Getting data about my reading habits
I keep data about my reading habits in Calibre, where I defined a couple of meta data fields, which allow me to record start and end date of each book I read. Other information is read directly from Calibre, which keeps title, authors, genre, and has a plugin to compute the number of pages based on the book length.
Calibre has a function to export all data in CSV. I have a function which makes it into a YAML file, which is the input file for this page. The following code reads the YAML with the books I read and defines a couple of functions which will be useful later.
The first line declares various execution parameters. The code is written in
Ruby and executed in a :session
, so that values are persisted and share
among all other blocks written in the same language. We do not care about
outputting the results of the code evaluation in the buffer, hence we can
declare :results none
.
require 'yaml' require 'date' books = YAML::load_file "../_data/bookstream.yml", permitted_classes: [Date] # group an array of hashes according to the values of a key and add # some data, such as the number of books belonging to each group and # the total number of pages def group_and_array(key, books) grouped = books.group_by { |x| x[key] } grouped.map { |k, v| [ k, v.size, v.map { |p| p["pages"] }.inject(&:+) ] } end # convert an aoh (= array of hashes) into a csv file. # assume all entries have the same keys, although not necessarily in # the same order (do not expect Hash.values to return the values in # the same order among all entries of the array) def aoh_to_a aoh keys = aoh[0].keys array = [] aoh.each do |entry| array << keys.map { |key| entry[key].to_s } end array end
The books
variable contains data which we can use to compute various
statistics.
What genres do I like more?
Simple answer: science fiction and crime. Is it true, however? To answer this
question we use the group_and_array
we defined above, which groups book data
according to a field, computes the number of items per category, the sum of
books per category and returns an array of arrays.
The following piece of code, thus, groups books by genre and returns a table,
which can be nicely shown by Org Mode. Notice that we attach a header ([
["Genre", "Books", "Pages"] ]
) and a separation line [nil]
. I learned the
trick about the separation line here. It allows us to output column headers.
We export both the code and the results of the execution and that we ask Org
Mode to show, as results, the value returned from executing the code, that is,
an array of arrays, in our case. A nice explanation of the values :results
can assume is available in the Org Mode Manual.
header = [["Genre", "Books", "Pages"]] + [nil] body = group_and_array("category", books) # remove books with no genre body = body.select { |x| x[0] != "" } # fix Genre string to improve output body = body.map { |x| [x[0].gsub("_", " ").capitalize, x[1], x[2]] } # sort it by frequency body = body.sort { |x, y| x[1] <=> y[1] }.reverse header + body
Genre | Books | Pages |
---|---|---|
Science fiction | 66 | 25336 |
Crime | 28 | 11947 |
Novel | 15 | 2313 |
Non fiction | 12 | 2545 |
History | 12 | 6819 |
Science | 11 | 2623 |
Food | 11 | 4036 |
Management | 9 | 1447 |
War biography | 8 | 5091 |
Humour | 8 | 860 |
Fiction | 8 | 1987 |
Economics | 4 | 1260 |
Sea | 4 | 1007 |
History of science | 4 | 1184 |
Biography | 3 | 946 |
Tragedy | 3 | 558 |
Comedy | 3 | 637 |
Medicine | 2 | 891 |
Kids | 1 | 40 |
The number of pages is computed by a Calibre plugin whose results I cannot check and which could not run on some my entries, since the electronic version of the book was missing.
How many books do I read in a year?
More precisely: of the books I read, how many books did I start reading in a given year?
Once again, we can use the group_and_array
function, grouping by year. This
allows to show a table, which we enrich with a textual barplot, built using
"-" * N
, a Ruby construct to build strings of N
repetitions of the given
char.
header = [["Year", "Books", "Pages", "Avg. Pages/Day", "Plot" ]] + [nil] body = group_and_array("started_year", books) # remove books with no year body = body.select { |x| x[0] } # add some stats body = body.map { |x| [ x[0], x[1], x[2], x[2] / 365, "-" * x[1] ] } # sort body = body.sort { |x, y| x[0] <=> y[0] } header + body
Year | Books | Pages | Avg. Pages/Day | Plot |
---|---|---|---|---|
2012 | 8 | 2287 | 6 | -------- |
2013 | 8 | 3646 | 9 | -------- |
2014 | 13 | 3921 | 10 | ------------- |
2015 | 19 | 5489 | 15 | ------------------- |
2016 | 5 | 2295 | 6 | ----- |
2017 | 6 | 1472 | 4 | ------ |
2018 | 9 | 3507 | 9 | --------- |
2019 | 4 | 2064 | 5 | ---- |
2020 | 6 | 3695 | 10 | ------ |
2021 | 4 | 3498 | 9 | ---- |
2022 | 5 | 2845 | 7 | ----- |
We can plot the data using Gnuplot, passing as input the data of the table
built by Ruby. This is achieved by the following source block, which takes as
input the table above, through the :var barplot = books-per-year
declaration. Notice that we also need to give a name to the table, with the
#+NAME: books-per-year
declaration.
The reset
command in the Gnuplot script is rather useful, as it ensures that
all settings are reset to their default values, otherwise Gnuplot will use any
setting defined in previous blocks in this buffer.
reset set boxwidth 0.5 set grid ytics linestyle 0 set style fill solid 0.20 border set terminal svg size 1200,800 font 'Arial,10' set title "Books Read" set xlabel "Year" set ylabel "Number of Books" plot barplot using 1:2:xtic(1) with boxes lc rgb "#0045FF" title "Books read", \ barplot using 1:($2+0.25):2 with labels title ""
How long does it take me to read a book?
The next question is how long it takes me to read a book, in calendar days. Notice that this is different from the actual days spent reading since calendar days are different from effort. In some cases I stopped reading some books and got back to finish them when I was in the mood again. In other cases I would read two books in parallel, even though this is something I did more often when I was younger. The table also shows genre and rating, although I have not very consistent in rating all the books I read.
Notice that here we use a slightly different notation for naming the output: we assign the name to the source block, rather than to its output. The effect is the same.
# find the books which I started and ended read = books.select { |x| x["started"] and x["completed"] } header = [ ["Title", "Days", "Pages", "Avg. Pages / Day", "Genre", "Rating"] ] + [nil] body2 = read.map { |x| days = (x["completed"] - x["started"] ).to_i; [ x["title"], days, x["pages"], days != 0 ? ("%.2f" % (x["pages"] / days.to_f)) : "N/A", x["category"].gsub("_", " "), x["my_rating"] ] } body2 = body2.sort { |x, y| x[1] <=> y[1] }.reverse header + body2
Title | Days | Pages | Avg. Pages / Day | Genre | Rating |
---|---|---|---|---|---|
Invisible Planets | 681 | 388 | 0.57 | science fiction | 4 |
Even Dogs in the Wild | 386 | 461 | 1.19 | crime | 4 |
Inferno | 203 | 1969 | 9.70 | war biography | 5 |
Wild Swans: Three Daughters of China | 181 | 714 | 3.94 | history | 5 |
The Birth of Plenty: How the Prosperity of the Modern World was Created | 144 | 533 | 3.70 | history | 5 |
A History of the World | 122 | 841 | 6.89 | history | 5 |
The Gulag Archipelago | 122 | 553 | 4.53 | biography | 4 |
The Stand | 121 | 1595 | 13.18 | science fiction | 4 |
Apollo | 108 | 608 | 5.63 | history of science | 5 |
How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843) | 98 | 558 | 5.69 | non fiction | 3 |
Dune: The Machine Crusade | 85 | 841 | 9.89 | science fiction | 3 |
Sapiens: A Brief History of Humankind | 81 | 455 | 5.62 | history | 5 |
Buying Time | 78 | 337 | 4.32 | science fiction | 4 |
Periodic Tales | 71 | 590 | 8.31 | science | 4 |
Consider the Lobster and Other Essays | 70 | 346 | 4.94 | non fiction | 5 |
The Third Plate: Field Notes on the Future of Food | 69 | 552 | 8.00 | food | 4 |
The Omnivore’s Dilemma: A Natural History of Four Meals | 64 | 491 | 7.67 | food | 4 |
The Korean War | 59 | 919 | 15.58 | war biography | 4 |
Dune: The Butlerian Jihad | 59 | 698 | 11.83 | science fiction | 3 |
The Trial | 57 | 215 | 3.77 | novel | 4 |
The Hydrogen Sonata | 56 | 604 | 10.79 | science fiction | 3 |
The Three-Body Problem (Remembrance of Earth’s Past) | 55 | 427 | 7.76 | science fiction | 4 |
21 Lessons for the 21st Century | 52 | 389 | 7.48 | non fiction | 5 |
The Forever War | 52 | 271 | 5.21 | science fiction | 5 |
The Secret Life of Groceries: The Dark Miracle of the American Supermarket | 50 | 705 | 14.10 | food | 4 |
Travels in the Interior Districts of Africa, 1795-7 | 48 | 254 | 5.29 | history | 4 |
In Search Of Schrodinger’s Cat | 45 | 309 | 6.87 | science | 4 |
The New York Trilogy | 44 | 352 | 8.00 | crime | 5 |
The Battle Of The Atlantic: The Allies’ Submarine Fight Against Hitler’s Gray Wolves Of The Sea | 44 | 337 | 7.66 | history | 4 |
Leviathan Wakes | 43 | 648 | 15.07 | science fiction | 2 |
A Memory Called Empire | 42 | 790 | 18.81 | science fiction | 3 |
An Edible History of Humanity | 41 | 257 | 6.27 | food | 4 |
Do No Harm Stories of Life, Death and Brain Surgery | 39 | 281 | 7.21 | medicine | 4 |
Extreme Ownership: How U.S. Navy SEALs Lead and Win | 36 | 297 | 8.25 | management | 5 |
Standing in Another Man’s Grave: A John Rebus Novel | 36 | 458 | 12.72 | crime | 3 |
Swallow This | 35 | 273 | 7.80 | food | 2 |
Packing for Mars | 33 | 313 | 9.48 | science | 3 |
Project Hail Mary | 31 | 854 | 27.55 | science fiction | 4 |
Command and Control: Nuclear Weapons, the Damascus Accident, and the Illusion of Safety | 31 | 958 | 30.90 | history | 3 |
In Defense of Food | 31 | 232 | 7.48 | food | 3 |
Land grabbing. Come il mercato delle terre crea il nuovo colonialismo (Indi) (Italian Edition) | 30 | 222 | 7.40 | food | 5 |
Solaris | 30 | 246 | 8.20 | science fiction | 4 |
The Naked Sun | 30 | 275 | 9.17 | science fiction | 4 |
F*** You Very Much: The surprising truth about why people are so rude | 28 | 313 | 11.18 | non fiction | 4 |
Return From The Stars | 28 | 300 | 10.71 | science fiction | 3 |
Rendezvous With Rama | 26 | 243 | 9.35 | science fiction | 5 |
The Illustrated Man | 25 | 291 | 11.64 | science fiction | 3 |
The Power of Habit: Why We Do What We Do in Life and Business | 25 | 382 | 15.28 | 5 | |
Reality Is Not What It Seems: The Journey to Quantum Gravity | 22 | 221 | 10.05 | science | 4 |
Slaughterhouse-Five (Kurt Vonnegut Series) | 21 | 180 | 8.57 | science fiction | 5 |
Sbornie sacre, sbornie profane: L’ubriachezza dal Vecchio al Nuovo Mondo (Intersezioni) (Italian Edition) | 20 | 156 | 7.80 | history | 4 |
Saints of the Shadow Bible | 20 | 478 | 23.90 | crime | 4 |
Trash | 19 | 136 | 7.16 | non fiction | 3 |
I signori del cibo. Viaggio nell’industria alimentare che sta distruggendo il pianeta (Italian Edition) | 18 | 284 | 15.78 | food | 5 |
The 5th Wave | 18 | 480 | 26.67 | science fiction | 4 |
The Martian: A Novel | 17 | 412 | 24.24 | science fiction | 5 |
The Man in the High Castle | 17 | 291 | 17.12 | science fiction | 4 |
Spillover. L’evoluzione delle epidemie (2014) | 16 | 610 | 38.12 | medicine | 4 |
Kitchen Confidential Paperback | 16 | 320 | 20.00 | food | 5 |
Tears of the Giraffe | 16 | 204 | 12.75 | fiction | 4 |
The Futurological Congress | 15 | 128 | 8.53 | science fiction | 4 |
A Briefer History of Time | 15 | 133 | 8.87 | science | 4 |
Salt Sugar Fat: How the Food Giants Hooked Us | 15 | 464 | 30.93 | food | 4 |
Okinawa: The Last Battle of World War II | 13 | 200 | 15.38 | history | 3 |
Six Easy Pieces | 12 | 164 | 13.67 | science | 5 |
Machines Like Me | 10 | 529 | 52.90 | science fiction | 4 |
L’orribile karma della formica | 10 | 436 | 43.60 | 0 | |
Pista nera | 9 | 249 | 27.67 | crime | 3 |
Il vecchio e il mare | 9 | 84 | 9.33 | novel | 4 |
Ender’s Game (The Ender Quintet) | 7 | 409 | 58.43 | science fiction | 5 |
We Are the Weather | 6 | 236 | 39.33 | food | 5 |
Micromégas | 6 | 25 | 4.17 | science fiction | 3 |
Worlds Apart: Worlds | 6 | 243 | 40.50 | science fiction | 3 |
Worlds | 6 | 262 | 43.67 | science fiction | 4 |
Deep Descent | 6 | 275 | 45.83 | sea | 5 |
Artemis. La prima città sulla luna (Italian Edition) | 5 | 387 | 77.40 | science fiction | 4 |
L’anima delle macchine: Tecnodestino, dipendenza tecnologica e uomo virtuale | 5 | 256 | 51.20 | science | 3 |
The Circle | 4 | 476 | 119.00 | fiction | 4 |
Making a Submarine Officer - A story of the USS San Francisco (SSN 711) | 3 | 295 | 98.33 | sea | 3 |
The No. 1 Ladies’ Detective Agency | 3 | 230 | 76.67 | humour | 4 |
Spaghetti robot. Il made in Italy che ci cambierà la vita | 2 | 222 | 111.00 | science | 3 |
Morte dei Marmi (Contromano) (Italian Edition) | 2 | 88 | 44.00 | humour | 3 |
Make Your Bed | 1 | 61 | 61.00 | management | 5 |
Mia nonna era un pesce | 0 | 40 | N/A | kids | 3 |
25 Things About Life | 0 | 26 | N/A | non fiction | 3 |
Sette brevi lezioni di fisica | 0 | 52 | N/A | science | 4 |
Who Moved My Cheese | 0 | 32 | N/A | management | 3 |
The results can be summarized using different tools. Rather than diving into R and making the table above into a data frame, we use datamash and Gnuplot.
The command line utility datamash
allows to perform basic operations on CSV
files. Here we compute the fundamental statistics about column 2, that is,
the number of days it takes to read a book:
echo "$bd" | datamash --header-out min 2 q1 2 median 2 q3 2 max 2 sstdev 2
min(field-2) | q1(field-2) | median(field-2) | q3(field-2) | max(field-2) | sstdev(field-2) |
0 | 11 | 28 | 53.5 | 681 | 87.39278611202 |
Then we use Gnuplot to draw the same data with a boxplot:
reset set terminal svg size 1200,800 set grid set title "Books Duration" set style data boxplot set style boxplot set xtics ("Duration" 1) plot data using (1.0):2
Another interesting plot shows duration and length, to see whether there is any correlation between the two. In general we should expect longer books to take more time, but this is not necessarily the case.
reset set terminal svg size 1600,1200 set grid set title "Books Reading Duration and Length" set ylabel "Pages" set xlabel "Days" set xrange [0:150] set mxtics 10 set grid mxtics mytics lc rgb("#AAAAAA") plot data using 2:3 with points pt 5 notitle, \ '' using 2:($3+15):($1) with labels notitle
The last two plot are dedicated to understanding which genre I read in fewer calendar days. This is not necessarily a measure of the quality of the book, since more complex books might take more time to read, but be more interesting that books read fast. On the other hand, it might indicate an increased interest in reading the book.
In general, the most natural structure for input data in Gnuplot is with each variable taking its own column. The boxplot command, however, can take a fourth argument, which is a reference to a categorical variable to use.
reset set terminal svg size 1600,600 set title "Books Reading Speed by Genre" set ylabel "Pages" set grid set nokey set style data boxplot set style boxplot set datafile missing "N/A" set style fill transparent solid 0.1 plot data using (1.0):4:(0.5):5 lc variable