A literate programming tutorial using my book reading habits

This page started as a collection of statistics about my book reading habits, but it is also an example of the literate programming features Org Mode provides. More in details, it shows how you can use Org Mode to pass data to source blocks written in different languages. This allows, for instance, to generate data using Ruby, plot it in Gnuplot, analyze it with datamash or R, and export the resulting page to HTML.

There is no need to use Org Mode to achieve the same result: we could write different scripts and then use a building tool such as Make or Rake to glue the code or, possibly, write a monolithic script in Ruby which uses system calls to invoke other tools. Org Mode, however, allows to glue different pieces of code in a more natural way.

You can also view some list of books by year computed from the data.

Getting data about my reading habits

I keep data about my reading habits in Calibre, where I defined a couple of meta data fields, which allow me to record start and end date of each book I read. Other information is read directly from Calibre, which keeps title, authors, genre, and has a plugin to compute the number of pages based on the book length.

Calibre has a function to export all data in CSV. I have a function which makes it into a YAML file, which is the input file for this page. The following code reads the YAML with the books I read and defines a couple of functions which will be useful later.

The first line declares various execution parameters. The code is written in Ruby and executed in a :session, so that values are persisted and share among all other blocks written in the same language. We do not care about outputting the results of the code evaluation in the buffer, hence we can declare :results none.

require 'yaml'
require 'date'
books  = YAML::load_file "../_data/bookstream.yml",
                         permitted_classes: [Date]

# group an array of hashes according to the values of a key and add
# some data, such as the number of books belonging to each group and
# the total number of pages
def group_and_array(key, books)
  grouped = books.group_by { |x| x[key] }
  grouped.map { |k, v| 
    [ k, v.size, v.map { |p| p["pages"] }.inject(&:+) ] 
  }
end

# convert an aoh (= array of hashes) into a csv file.
# assume all entries have the same keys, although not necessarily in
# the same order (do not expect Hash.values to return the values in
# the same order among all entries of the array)
def aoh_to_a aoh
  keys = aoh[0].keys
  array = []
  aoh.each do |entry|
    array << keys.map { |key| entry[key].to_s } 
  end
  array
end

The books variable contains data which we can use to compute various statistics.

What genres do I like more?

Simple answer: science fiction and crime. Is it true, however? To answer this question we use the group_and_array we defined above, which groups book data according to a field, computes the number of items per category, the sum of books per category and returns an array of arrays.

The following piece of code, thus, groups books by genre and returns a table, which can be nicely shown by Org Mode. Notice that we attach a header ([ ["Genre", "Books", "Pages"] ]) and a separation line [nil]. I learned the trick about the separation line here. It allows us to output column headers.

We export both the code and the results of the execution and that we ask Org Mode to show, as results, the value returned from executing the code, that is, an array of arrays, in our case. A nice explanation of the values :results can assume is available in the Org Mode Manual.

header = [["Genre", "Books", "Pages"]] + [nil] 

body = group_and_array("category", books)
# remove books with no genre
body = body.select { |x| x[0] != "" }
# fix Genre string to improve output
body = body.map { |x| [x[0].gsub("_", " ").capitalize, x[1], x[2]] }
# sort it by frequency
body = body.sort { |x, y| x[1] <=> y[1] }.reverse

header + body

Genre	Books	Pages
Science fiction	65	25154
Crime	28	11947
Novel	15	2313
Non fiction	12	2545
History	12	6819
Science	11	2623
Food	11	4036
Management	9	1447
War biography	8	5091
Humour	8	860
Fiction	8	1987
Economics	4	1260
Sea	4	1007
History of science	4	1184
Tragedy	3	558
Comedy	3	637
Medicine	2	891
Biography	2	553
Kids	1	40

The number of pages is computed by a Calibre plugin whose results I cannot check and which could not run on some my entries, since the electronic version of the book was missing.

How many books do I read in a year?

More precisely: of the books I read, how many books did I start reading in a given year?

Once again, we can use the group_and_array function, grouping by year. This allows to show a table, which we enrich with a textual barplot, built using "-" * N, a Ruby construct to build strings of N repetitions of the given char.

header = [["Year", "Books", "Pages", "Avg. Pages/Day", "Plot" ]] + [nil]

body = group_and_array("started_year", books)
# remove books with no year
body = body.select { |x| x[0] }
# add some stats
body = body.map { |x| [ x[0], x[1], x[2], x[2] / 365, "-" * x[1] ] }
# sort
body = body.sort { |x, y| x[0] <=> y[0] }

header + body

Year	Books	Pages	Avg. Pages/Day	Plot
2012	8	2287	6	--------
2013	8	3646	9	--------
2014	13	3921	10	-------------
2015	19	5489	15	-------------------
2016	5	2295	6	-----
2017	6	1472	4	------
2018	9	3507	9	---------
2019	4	2064	5	----
2020	6	3695	10	------
2021	4	3498	9	----
2022	8	4045	11	--------
2023	7	3331	9	-------
2024	2	460	1	--

We can plot the data using Gnuplot, passing as input the data of the table built by Ruby. This is achieved by the following source block, which takes as input the table above, through the :var barplot = books-per-year declaration. Notice that we also need to give a name to the table, with the #+NAME: books-per-year declaration.

The reset command in the Gnuplot script is rather useful, as it ensures that all settings are reset to their default values, otherwise Gnuplot will use any setting defined in previous blocks in this buffer.

reset 

set boxwidth 0.5
set grid ytics linestyle 0
set style fill solid 0.20 border 

set terminal svg size 1200,800 font 'Arial,10'

set title "Books Read"
set xlabel "Year"
set ylabel "Number of Books"

plot barplot using 1:2:xtic(1) with boxes lc rgb "#0045FF" title "Books read", \
     barplot using 1:($2+0.25):2 with labels title ""

How long does it take me to read a book?

The next question is how long it takes me to read a book, in calendar days. Notice that this is different from the actual days spent reading since calendar days are different from effort. In some cases I stopped reading some books and got back to finish them when I was in the mood again. In other cases I would read two books in parallel, even though this is something I did more often when I was younger. The table also shows genre and rating, although I have not very consistent in rating all the books I read.

Notice that here we use a slightly different notation for naming the output: we assign the name to the source block, rather than to its output. The effect is the same.

# find the books which I started and ended
read = books.select { |x| x["started"] and x["completed"] }

header = [
  ["Title", "Days", "Pages", "Avg. Pages / Day", "Genre", "Rating"]
] + [nil] 
body2 = read.map { |x|
  days = (x["completed"] - x["started"] ).to_i;
  [ x["title"],
    days, x["pages"],
    days != 0 ? ("%.2f" % (x["pages"] / days.to_f)) : "N/A",
    x["category"].gsub("_", " "),
    x["my_rating"] ] 
}
body2 = body2.sort { |x, y| x[1] <=> y[1] }.reverse

header + body2

Title	Days	Pages	Avg. Pages / Day	Genre	Rating
Invisible Planets	681	388	0.57	science fiction	4
Even Dogs in the Wild	386	461	1.19	crime	4
Watchmen	371	0	0.00		0
Inferno	204	1969	9.65	war biography	5
Wild Swans: Three Daughters of China	181	714	3.94	history	5
The Birth of Plenty: How the Prosperity of the Modern World was Created	144	533	3.70	history	5
A History of the World	122	841	6.89	history	5
The Gulag Archipelago	122	553	4.53	biography	4
The Stand	121	1595	13.18	science fiction	4
The elegant universe: superstrings, hidden dimensions, and the quest for the ultimate theory	118	854	7.24		5
Apollo	108	608	5.63	history of science	5
How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843)	98	558	5.69	non fiction	3
Code Warriors: NSA’s Codebreakers and the Secret Intelligence War Against the Soviet Union	89	497	5.58		4
Dune: The Machine Crusade	85	841	9.89	science fiction	3
Sapiens: A Brief History of Humankind	81	455	5.62	history	5
Buying Time	78	337	4.32	science fiction	4
Periodic Tales	71	590	8.31	science	4
Consider the Lobster and Other Essays	70	346	4.94	non fiction	5
The Nutmeg’s Curse	69	703	10.19		3
The Third Plate: Field Notes on the Future of Food	69	552	8.00	food	4
The Omnivore’s Dilemma: A Natural History of Four Meals	64	491	7.67	food	4
The Korean War	59	919	15.58	war biography	4
Dune: The Butlerian Jihad	59	698	11.83	science fiction	3
The Trial	57	215	3.77	novel	4
The Hydrogen Sonata	56	604	10.79	science fiction	3
The Three-Body Problem (Remembrance of Earth’s Past)	55	427	7.76	science fiction	4
Big Bang	52	916	17.62		5
21 Lessons for the 21st Century	52	389	7.48	non fiction	5
The Forever War	52	271	5.21	science fiction	5
The Secret Life of Groceries: The Dark Miracle of the American Supermarket	50	705	14.10	food	4
A Brief History Of Time	48	410	8.54		4
Travels in the Interior Districts of Africa, 1795-7	48	254	5.29	history	4
In Search Of Schrodinger’s Cat	45	309	6.87	science	4
The New York Trilogy	44	352	8.00	crime	5
The Battle Of The Atlantic: The Allies’ Submarine Fight Against Hitler’s Gray Wolves Of The Sea	44	337	7.66	history	4
Leviathan Wakes	43	648	15.07	science fiction	2
A Memory Called Empire	42	790	18.81	science fiction	3
An Edible History of Humanity	41	257	6.27	food	4
Do No Harm Stories of Life, Death and Brain Surgery	39	281	7.21	medicine	4
Extreme Ownership: How U.S. Navy SEALs Lead and Win	36	297	8.25	management	5
Standing in Another Man’s Grave: A John Rebus Novel	36	458	12.72	crime	3
Swallow This	35	273	7.80	food	2
Packing for Mars	33	313	9.48	science	3
Project Hail Mary	31	854	27.55	science fiction	4
Command and Control: Nuclear Weapons, the Damascus Accident, and the Illusion of Safety	31	958	30.90	history	3
In Defense of Food	31	232	7.48	food	3
La fisica del diavolo (Italian Edition)	30	434	14.47		0
Land grabbing. Come il mercato delle terre crea il nuovo colonialismo (Indi) (Italian Edition)	30	222	7.40	food	5
Solaris	30	246	8.20	science fiction	4
The Naked Sun	30	275	9.17	science fiction	4
F*** You Very Much: The surprising truth about why people are so rude	28	313	11.18	non fiction	4
Return From The Stars	28	300	10.71	science fiction	3
Rendezvous With Rama	26	243	9.35	science fiction	5
The Illustrated Man	25	291	11.64	science fiction	3
The Power of Habit: Why We Do What We Do in Life and Business	25	382	15.28		5
Reality Is Not What It Seems: The Journey to Quantum Gravity	22	221	10.05	science	4
Slaughterhouse-Five (Kurt Vonnegut Series)	21	180	8.57	science fiction	5
Sbornie sacre, sbornie profane: L’ubriachezza dal Vecchio al Nuovo Mondo (Intersezioni) (Italian Edition)	20	156	7.80	history	4
Saints of the Shadow Bible	20	478	23.90	crime	4
Trash	19	136	7.16	non fiction	3
I signori del cibo. Viaggio nell’industria alimentare che sta distruggendo il pianeta (Italian Edition)	18	284	15.78	food	5
The 5th Wave	18	480	26.67	science fiction	4
The Martian: A Novel	17	412	24.24	science fiction	5
The Man in the High Castle	17	291	17.12	science fiction	4
Il paradiso maoista	16	460	28.75		4
Spillover. L’evoluzione delle epidemie (2014)	16	610	38.12	medicine	4
Kitchen Confidential Paperback	16	320	20.00	food	5
Tears of the Giraffe	16	204	12.75	fiction	4
The Futurological Congress	15	128	8.53	science fiction	4
A Briefer History of Time	15	133	8.87	science	4
Salt Sugar Fat: How the Food Giants Hooked Us	15	464	30.93	food	4
Denominazione di origine inventata: Le bugie del marketing sui prodotti tipici italiani (Italian Edition)	14	328	23.43		3
Okinawa: The Last Battle of World War II	13	200	15.38	history	3
Six Easy Pieces	11	164	14.91	science	5
Machines Like Me	10	529	52.90	science fiction	4
L’orribile karma della formica	10	436	43.60		0
Pista nera	9	249	27.67	crime	3
Il vecchio e il mare	9	84	9.33	novel	4
Ender’s Game (The Ender Quintet)	7	409	58.43	science fiction	5
We Are the Weather	6	236	39.33	food	5
Micromégas	6	25	4.17	science fiction	3
Worlds Apart: Worlds	6	243	40.50	science fiction	3
Worlds	6	262	43.67	science fiction	4
Deep Descent	6	275	45.83	sea	5
Artemis. La prima città sulla luna (Italian Edition)	5	387	77.40	science fiction	4
L’anima delle macchine: Tecnodestino, dipendenza tecnologica e uomo virtuale	5	256	51.20	science	3
The Circle	5	476	95.20	fiction	4
2001 - A Space Odyssey	4	389	97.25		0
Making a Submarine Officer - A story of the USS San Francisco (SSN 711)	3	295	98.33	sea	3
The No. 1 Ladies’ Detective Agency	3	230	76.67	humour	4
Alien Disgelo	2	0	0.00		0
Spaghetti robot. Il made in Italy che ci cambierà la vita	2	222	111.00	science	3
Morte dei Marmi (Contromano) (Italian Edition)	2	88	44.00	humour	3
Alien Volume 3 Icarus	1	0	0.00		3
Make Your Bed	1	61	61.00	management	5
Mia nonna era un pesce	0	40	N/A	kids	3
25 Things About Life	0	26	N/A	non fiction	3
Sette brevi lezioni di fisica	0	52	N/A	science	4
Who Moved My Cheese	0	32	N/A	management	3

The results can be summarized using different tools. Rather than diving into R and making the table above into a data frame, we use datamash and Gnuplot.

The command line utility datamash allows to perform basic operations on CSV files. Here we compute the fundamental statistics about column 2, that is, the number of days it takes to read a book:

echo "$bd" | datamash --header-out min 2 q1 2 median 2 q3 2 max 2 sstdev 2

min(field-2)	q1(field-2)	median(field-2)	q3(field-2)	max(field-2)	sstdev(field-2)
0	10.5	30	55.5	681	88.987085010699

Then we use Gnuplot to draw the same data with a boxplot:

reset 

set terminal svg size 1200,800
set grid
set title "Books Duration"
set style data boxplot
set style boxplot
set xtics ("Duration" 1)

plot data using (1.0):2

Another interesting plot shows duration and length, to see whether there is any correlation between the two. In general we should expect longer books to take more time, but this is not necessarily the case.

reset 

set terminal svg size 1600,1200
set grid
set title "Books Reading Duration and Length"
set ylabel "Pages"

set xlabel "Days"
set xrange [0:150]
set mxtics 10

set grid mxtics mytics lc rgb("#AAAAAA")

plot data using 2:3 with points pt 5 notitle, \
     '' using 2:($3+15):($1) with labels notitle

The last two plot are dedicated to understanding which genre I read in fewer calendar days. This is not necessarily a measure of the quality of the book, since more complex books might take more time to read, but be more interesting that books read fast. On the other hand, it might indicate an increased interest in reading the book.

In general, the most natural structure for input data in Gnuplot is with each variable taking its own column. The boxplot command, however, can take a fourth argument, which is a reference to a categorical variable to use.

reset 

set terminal svg size 1600,600

set title "Books Reading Speed by Genre"

set ylabel "Pages"
set grid
set nokey
set style data boxplot
set style boxplot
set datafile missing "N/A"
set style fill transparent solid 0.1

plot data using (1.0):4:(0.5):5 lc variable