Making a book recommendation service from scratch.
I read about 30–40 books a year. Not to be morbid, but this equates to only reading about 3,000 books in my lifetime. Since there are millions of books that exist,1 it means that I can only consume a fraction of a fraction of a percent of them. This exceptionally limited scope of my eventual reading repertoire is my motivation to find a systematic method to acquiring good book suggestions.
I’ve noticed that there are five major problems that prevent me from continuing to use the major services like Goodreads and Amazon for book recommendations.
One of the best approaches to finding a new book to read is to look for books with a similar genre. The genre of Weaveworld is classified as “dark fantasy horror” according to Amazon, which is a good approximation of the book. And since this is a pretty specific genre it should provide good recommendations that are similar, right? Unfortunately, no.
On Amazon I can lookup the top sellers for this genre, “dark fantasy horror,” and try to find a book from those. Here is what I see, looking at the top six sellers on this particular day from Amazon:
If the titles and the book covers don’t give it away (although you should not judge a book by its cover), three of the six top sellers for the “dark fantasy horror” genre have descriptions that contain phrases like “sexy”, “adult urban” or “a budding romance.” I do not judge people that like these books, but I would argue that these are in the wrong category. Even though these might have elements of “dark fantasy horror”, it seems that their main themes are adult. (You are probably thinking, this can be solved using multiple genres. You are right, read on!).
Basically the top sellers from Amazon for the genre “dark fantasy horror” contains little, if any, actual “dark fantasy horror” and its a waste of time to go browsing through it.
Instead of looking at top sellers, one can also simply search the genre. This will run into another problem. Here are three sequential results on the first page of searching for “dark fantasy horror” from Amazon:
Of course, one of these has “steamy sex” as a description (Problem #1), but even more problematic is that all three of these books are sequels. I would never recommend you to read a sequel before reading the first book, with exception of a few books — like The Book of Merlyn is a great one to read no matter what. Currently, though, you have to trudge through a bunch of sequels because there is no way to filter them out on Amazon or Goodreads.
If we do a similar search on Goodreads, we will find another problem. Here are the top results for searching the genre “dark horror fantasy”:
These search results are also problematic in terms of providing good recommendations as four of these six books have the same author! The number of distinct authors increases very slowly as you go through the search results from Goodreads. This is not helpful because I believe you should be recommending a new author just as much as a new book. I don’t need a recommendation service to just go look up books from the same author.
I enjoy when friends give me recommendations since I have a good idea of whether (or not) we share taste in books and I can easily decided to trust them (or not). I don’t enjoy having recommendations from random people based on their web history, since you never know if someone just read some book and secretly didn’t like it but told the Internet they did.
For some special people these types of recommendations might work (e.g. if you are just starting out on a genre). But usually these type of results are derivative. Amazon will show me books from “Customers who bought this item” and Goodreads will show me “members who liked X also liked” but the results are similar. For example, here are Amazon’s:
The first two books are possibly good recommendations, until you see that both those books are two of the most popular books in the genre of all time. So if you are new to the genre they are good, but I am not, so of course I have read those before. The other books are both books from the author of the Weaveworld, so I do not want those suggestions (Problem #3).
I want a new recommendation from a different author.
You can certainly find a good book suggestion from Goodreads or Amazon, but it might take a long time. It seems that these sites are designed to keep you looking, in fact, because they have an incentive to elude you from getting complete information quickly. Because of this, I have given up with Goodreads and Amazon. There are other services, like whatshouldireadnext.com, but this hardly provides similar recommendations for Weaveworld by Clive Barker, as it suggested a book about LA ghost stories and a book about Cambodia war crimes.
In light of this, I propose a simple solution.
To make a book recommendation service that actually works, I would avoid Problems #1–3 using simple filters. And it will avoid Problem #5 by being quick and simple — no sign-ups, no ads, no hard-to-navigate user interface. It would avoid Problem #4 by using probabilistic genres.
What I call a probabilistic genre could be called many things: the book’s genome, the hierarchical classification, etc. The basic idea is that each book has multiple genres with different weights or emphasis.
The probabilistic genres can be calculated using the Goodreads API. For example, Weaveworld has the following set of genres according to Goodreads API:
<shelf name=”to-read” count=”6532"/> <shelf name=”fantasy” count=”1017"/> <shelf name=”currently-reading” count=”711"/> <shelf name=”horror” count=”684"/> <shelf name=”fiction” count=”280"/> <shelf name=”favorites” count=”255"/> <shelf name=”owned” count=”104"/> <shelf name=”clive-barker” count=”97"/> <shelf name=”dark-fantasy” count=”62"/> <shelf name=”default” count=”61"/> <shelf name=”books-i-own” count=”60"/> <shelf name=”sci-fi-fantasy” count=”46"/> <shelf name=”science-fiction” count=”34"/> <shelf name=”sci-fi” count=”31"/>
While Weaveworld is defined by Amazon and Goodreads to be “dark fantasy horror”, according to these shelves (and ignoring non-genres) it is actually mostly fantasy (1,017 counts), horror (684 counts), with a little science fiction (~100 counts). It might be better classified as “fantasy horror fiction sci-fi.”
Just for comparison, to see the benefit of these probabilistic genres, recall the very first recommendation from Amazon in the genre “dark fantasy horror”: Agent of Enchantment by C.N. Crawford? Here are its top genres from the Goodreads API:
<shelf name=”urban-fantasy” count=”8"/> <shelf name=”fantasy” count=”6"/> <shelf name=”paranormal” count=”4"/> <shelf name=”magic” count=”3"/> <shelf name=”fae” count=”3"/> <shelf name=”favorites” count=”3"/> <shelf name=”maybe” count=”3"/> <shelf name=”netgalley” count=”3"/> <shelf name=”mystery” count=”2"/> <shelf name=”fairies” count=”2"/> <shelf name=”adult” count=”2"/> <shelf name=”romance” count=”2"/>
Clearly, Agent of Enchantment would be better classified as “urban fantasy paranormal mystery adult romance” and it is definitely not similar to “fantasy horror fiction sci-fi” of Weaveworld. In fact, its not even close to the genre of point “dark horror fantasy”, because “horror” doesn’t even show up.
Anyways, I digress. These genres can be determined in a more quantified way by calculating the percentage of each genre (ignoring non-genres) so Weaveworld might be approximated as “45% fantasy, 25% horror, 15% fiction, 6% dark fantasy, 5% science fiction and 4% sci-fi fantasy”, based solely on the counts. To find a recommendation then, I just need to go through books with similar genres and calculate a metric that determines the “distance” between the two books. Once I have a list of distances, I can sort them and find the closest distances and perform the filtering I want (no similar authors, no sequels).
Now that I had my basic plan of attack, I decided to implement the ideas as a simple single-page app (SPA). I found a cheap domain name: booksuggestions.ninja and made a cute logo and set to work.
I first collected the data for books from Goodreads API and the Google Books API. This took several weeks.
Realistic ratings from Google Books were determined using a Bayesian average which accounted for the average rating count and value in that genre. I especially used genre-specific ratings because I have seen on Amazon how all the “romantic” novels have absurdly high recommendations which means that certain books may self-select groups of people who are predisposed to rating things a common way.
I wrote a program in Go to process the similarities. This program is pretty simple — it loads all the books into a giant map, and then for each book it traverses the map and calculates the distances from it to every other book. The distances are computed as the total sum of the absolute difference between two of the same genres. Once the distances are computed, it sorts the list and outputs a file containing its matches. Simplicity does not equate speed in this case, as the matching takes about 24 days to calculate. Luckily I can use Go worker pools, so it only takes 3 days, during which my looks like this:
I wish I had checked for bugs before running this, because after running it I found I had to start over and wait another 3 days to calculate the similarities a second time.
After determine the book-to-book similarity, I wrote another Go program to generate a SQL file for a database. The SQL file is used to construct the database with the command (the is a cool trick to see how long it takes to push the data into the DB). Of course, this could have been done in Go as well but I’m lazy.
Speaking of laziness, I then wrote a Python script to generate a bash script that would generate text files that contain entries from common searches (e.g. searches for the best books from a specific genre). Gotta love that meta-programming!
I built a backend in Go using the Gin framework. The back-end basically performs SQL queries in the database. The front-end is a webpage with a text box to enter search for books or select a genre. The communication with the backend is written in jQuery. I realized I could have used this opportunity to learn React + Redux, but I just wanted to do the front-end fast rather than well. Even so, I learned some neat things and ended up building my own little DOM history manipulation so you can use the back arrow in the app - it’s a small detail but it makes it feel much better.
I quickly realized that I would need an approximate string matching library to help me find the book that a user is searching. This is a particulary daunting task since there are over 2.5 million books and Levenshtein would be far to slow (it can takes tens of seconds!). I ended up writing my own little algorithm that is based off a bag-of-words approach, and given the right parameters it can find matches in less than 30 milliseconds! I actually made this part of the program a microservice because it eats up a lot of RAM so I do not want to run multiple instances of it if I choose to. The microservice and the Go library are available at github.com/schollz/closestmatch.
Now that I had the technical details implemented, I went to my new website and tried to find a suggestion!
To get a baseline I asked real people what I should read next and they suggested: Imagica by Clive Barker, stuff by Neil Gaiman books,* The Talisman by Stephen King. *I’ve already read all of Neil Gaiman books, and have somehow missed out on reading any Stephen King. Still, these were pretty good recommendations from people familiar with the book I’m reading.
After getting real recommendations, I turned to the Book Suggestions Ninja and entered Weaveworld by Clive Barker. The Book Suggestions Ninja suggested a bunch of books that are similar to Weaveworld.
The top recommendation is The Library at Mount Char by Scott Hawkins. This sounds like a great recommendation - a “horrifying and hilarious” book about gods and secrets with weird and esoteric characters. The other recommendations are great too - Imajica and Night Watch are both books I’ve been interested. The Stand is also a frequent recommendation. Further down there are great books I’ve read before, like The Lies of Locke Lamora and The Warded Man, but for the most part all of these books are really good new recommendations. Looks like I will have plenty of reading to do!
I chose two books to read - the top recommendation and the 19th suggestion which are both listed as SO good. The top recommendation, The Library at Mount Char by Scott Hawkins was really SO good in that I couldn’t put it down and finished it in four days. The other book is Hard Magic by Larry Correia which has these elements of magical realism and horror and fantasy that I liked about Weaveworld. This is impressive that the suggested books stay so close to the original, even this far down the list!
I like these recommendations, a lot. They get straight to the point and have some books that I know are good, which makes me think the rest could be good too. These recommendations literally took me 6 seconds of searching and 2 minutes of scanning the web page.
The result of this rather long article and four weeks of work and 1,600 lines of code is www.booksuggestions.ninja.
No recommendation will probably ever be better than your friends. However, if you carefully analyze genres, look at a diverse list of authors, and filter out sequels, you can find great recommendations. This is what my app, Book Suggestions Ninja, tries to do — maybe it will work for you too? If it does, please let me know, tweet me @nicenovelninja (tweet me if it doesn’t work, too).
In any case, I found a new book to read after finishing Weaveworld which is all I could really ask.