Book recommendation system

Original data file is obtained from "https://www.kaggle.com/jealousleopard/goodreadsbooks". This does not include descriptions of the books.

Using API from GoodReads.com, description of the books is obtained where avaialable. It is saved into 'books_extensive.pkl'

Book recommendation system is implemented based on following concepts,

  • A simple search engine using 'Levenshtein Distance'
  • Content based recommender system: Using description of the books (synopsis of the book + title + first author), obtain book-book similarity coefficients,
  • Using user ratings for certain books, obtain a User-Book matrix.
  • Perform matrix factorization to reduce User-Book matrix into a lower number of latent components, and predict ratings for unseen books. Here, to predict the ratings we use book-book similarity coefficients as weights,
In [1]:
import pandas as pd
import goodreads as gr
from goodreads import client
import string
import pylab as plt
import numpy as np
import regex as re
import datetime
import time
import seaborn as sns
import fuzzywuzzy as fuzzy
from fuzzywuzzy import process

sns.set()

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation, SparsePCA, PCA, TruncatedSVD
from nltk.stem.porter import PorterStemmer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
In [2]:
%matplotlib inline
#inline
#notebook 

Some Data cleaning

  • Filtered for only English books
  • Only first author's name is kept
  • Combined rating is obtained by Log(avg_rating*total_ratings), negatives set to 0.
  • Using API, description and popular shelves of the books is pulled and stored.
  • ISBN and ISBN13 are corrected (mostly)
  • Publication_date is replaced with year only
  • Total 10544 books
In [3]:
all_books_data = pd.read_pickle(r"DataScience\Proj\Projects\books\data\goodreadsData\books_extensive.pkl")
In [4]:
all_books_data.head()
Out[4]:
bookID title authors average_rating isbn num_pages ratings_count text_reviews_count publication_date publisher combo_rating description popular_shelves similar_books isbn13
0 41865 Twilight (Twilight #1) Stephenie Meyer 3.59 0316015849 501 4597666 94265 2006 Little Brown and Company 16.619212 About three things I was absolutely positive. ... [to-read, currently-reading, young-adult, fant... [Twilight: The Complete Illustrated Movie Comp... 9780316015844
1 5907 The Hobbit or There and Back Again J.R.R. Tolkien 4.27 0618260307 366 2530894 32871 2002 Houghton Mifflin 16.195697 In a hole in the ground there lived a hobbit.... [to-read, currently-reading, fantasy, favorite... [Harry Potter and the Sorcerer's Stone (Harry ... 9780618260300
2 5 Harry Potter and the Prisoner of Azkaban (Harr... J.K. Rowling 4.56 043965548X 435 2339585 36325 2004 Scholastic Inc. 16.182807 Harry Potter's third year at Hogwarts is full ... [to-read, fantasy, favorites, currently-readin... [Harry Potter and the Cursed Child: Parts One ... 9780439655484
3 15881 Harry Potter and the Chamber of Secrets (Harry... J.K. Rowling 4.42 0439064864 341 2293963 34692 1999 Arthur A. Levine Books / Scholastic Inc. 16.131931 The Dursleys were so mean and hideous that sum... [to-read, currently-reading, fantasy, favorite... [Harry Potter and the Cursed Child: Parts One ... 9780439064866
4 2 Harry Potter and the Order of the Phoenix (Har... J.K. Rowling 4.49 0439358078 870 2153167 29221 2004 Scholastic Inc. 16.084303 There is a door at the end of a silent corrido... [to-read, fantasy, currently-reading, favorite... [Harry Potter and the Cursed Child: Parts One ... 9780439358071

Data clearning is necessary

However, for this task we will only focus on the columns = bookID, title, author, description

In [5]:
all_books_data.isna().sum()
Out[5]:
bookID                   0
title                    0
authors                  0
average_rating           0
isbn                     9
num_pages                0
ratings_count            0
text_reviews_count       0
publication_date         0
publisher                0
combo_rating             0
description            394
popular_shelves       8543
similar_books         9006
isbn13                   0
dtype: int64
In [6]:
all_books_data.columns = ['book_id', 'title', 'authors', 'average_rating', 'isbn', 'num_pages',
       'ratings_count', 'text_reviews_count', 'publication_date', 'publisher',
       'combo_rating', 'description', 'popular_shelves', 'similar_books',
       'isbn13']

Missing descriptions replace with title

In [7]:
all_books_data.description = all_books_data.description.fillna(all_books_data.title)
In [8]:
all_books_df1 = all_books_data[['book_id', 'title', 'authors', 'description']]
all_books_df1 = all_books_df1.drop_duplicates(subset=['title'], keep='first')
all_books_df1.shape
Out[8]:
(9808, 4)

Merge 'description', 'title', 'authors' --> one 'all_text' feature

In [9]:
all_books_df1['all_text'] = all_books_df1[['description', 'title', 'authors']].agg(' '.join, axis=1)

Using TF-IDF extract and vectorize key words

  • Using the NLTK corpus, drop typical stop words such as a,the, however, although etc.
  • Drop alphanumeric words as well.
  • Limit total words to 10000.
In [10]:
# Create list of word by removing numeric/alphanumeric elements
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
# #tfidf = TfidfVectorizer(stop_words='english', max_features=10000)#, tokenizer=stemming_tokenizer)

word_list = []
# #Tokenize description
for i, row in all_books_df1.all_text.astype('str').iteritems():
    X_tfidf = vectorizer.fit_transform(row.split(' '))
    words = vectorizer.get_feature_names()
    word_list.append([x for x in words if not any(c.isdigit() for c in x)])
    #word_list.append(words)
    
all_books_df1['all_text'] = word_list

print("Example, list of words from first book:\n")
print(np.array(all_books_df1.all_text[0]))
Example, list of words from first book:

['absolutely' 'bite' 'blood' 'deeply' 'didn' 'dominant' 'edward'
 'extraordinarily' 'irrevocably' 'know' 'love' 'meyer' 'positive' 'second'
 'seductive' 'stephenie' 'story' 'suspenseful' 'things' 'thirsted'
 'twilight' 'unconditionally' 'vampire']

With 10000 unique words, represent each book as a vector in this 10000 dimensional space. Use 'Term-Frequency' and 'Inverse-Document-Frequency' method for scoring words in a book.

  • Term-frequency (TF) => how frequent a word occurs within that book's description.
  • Inverse-Document-Frequency (IDF) => how frequent a word occurs in different book descriptions; higher the occurance, lower the score.
  • Score F = TF*IDF
In [11]:
vectorize_feature = all_books_df1.all_text.astype('str')

tfidf = TfidfVectorizer(max_features=10000, stop_words='english' )#, tokenizer=stemming_tokenizer)
X_tfidf = tfidf.fit_transform(vectorize_feature).toarray()
tfidf_feature_names = tfidf.get_feature_names()

print("Total number of books x total words in all books")
print(X_tfidf.shape)
Total number of books x total words in all books
(9808, 10000)

Create a quick book-search method

  • Give any input keywords, either relevant to the title or author.
  • Obtain most matching title from the list using Levenshtein Distance.
In [12]:
tit_auth = all_books_df1[["title", "authors"]].agg(' '.join, axis=1).str.lower()

def getMostMatchingTitle(keywords="harry potter"):
    matout = fuzzy.process.extractBests(keywords.lower(), tit_auth, limit=1)
    idx = matout[0][2]
    matching_title = all_books_df1.title[idx]
    return matching_title

Example:

In [13]:
getMostMatchingTitle(keywords="harry potter")
Out[13]:
'Harry Potter and the Prisoner of Azkaban (Harry Potter  #3)'

Predict most similar books: Book-book similarity

The similarity coefficients are calculated by using the text description of the books.

Using the books matrix (B) of n_books x n_words (i.e. 9808 books x 10000 words), calculate the similarity coefficient between book_i and book_j as below.

$ s_{ij} $ = $ \frac{B^{T}_{i} \cdot B_{j} }{ \| F_{i}^{2} \| \cdot \| F_{j}^{2} \|}$

In [14]:
description_cosine_mat = linear_kernel(X_tfidf, X_tfidf)
cosine_books_df = pd.DataFrame(description_cosine_mat, index = all_books_df1.book_id, columns=all_books_df1.book_id)
cosine_books_df.shape
Out[14]:
(9808, 9808)
In [15]:
# Function that get book recommendations based on the cosine similarity score of book's description
def similar_books(title, n=5, bid=None):
    if bid is None:
        bid = all_books_df1.book_id[all_books_df1.title == title]
    else:
        bid = bid
    sim_scores = cosine_books_df.loc[bid].squeeze()
    sim_scores.sort_values(ascending=False, inplace=True)
    #sim_idxs = np.argsort(sim_scores) #indices of values from max to min
    sim_ids = sim_scores.index.values[1:n+1]
    #ids = []
    #for i in sim_idxs:
    #    ids.append(all_books_df1.iloc[i].book_id)
    #
    prediction = all_books_df1[["title", "authors"]][all_books_df1.book_id.isin(sim_ids)]
    return prediction

Find the books which similar to the given book. You can either specify a book or some keywords which will describe a book. For example, lets find a book that matches the given key words 'jane austin', and get the suggestion list of most similar books.

In [16]:
keywords = "jane austen"
matching_title = getMostMatchingTitle(keywords)

print("Input keywords: ", keywords)
print("================")
print("Most matching title for give keywords is:", matching_title)
print("================")
print("top n books similar to above book based on book-book similarity coefficient:")
similar_books(matching_title, 15)
Input keywords:  jane austen
================
Most matching title for give keywords is: Emma
================
top n books similar to above book based on book-book similarity coefficient:
Out[16]:
title authors
448 Song of Solomon Toni Morrison
666 Ship of Destiny (Liveship Traders #3) Robin Hobb
1107 The Complete Novels Jane Austen
2197 Castaways of the Flying Dutchman (Flying Dutch... Brian Jacques
3660 Pride and Prejudice Jane Austen
3853 Jane Austen: A Life Carol Shields
4225 Mr. Bump Roger Hargreaves
4575 Tea with Jane Austen Kim Wilson
6220 Collected Plays 1944-1961 Arthur Miller
6786 Jane Austen: The Complete Novels Jane Austen
6867 The Complete Novels of Jane Austen Vol 1: Sen... Jane Austen
7107 Three Plays of Euripides: Alcestis/Medea/The B... Euripides
7515 Three Plays: Exit the King / The Killer / Macbett Eugène Ionesco
8399 The Solace Of Leaving Early Haven Kimmel
10171 Mammoth Book Of Lesbian Short Stories (Mammoth... Emma Donoghue

Recommender system based on User-Book matrix

  • Using user ratings information for certain books, obtain a User-Book matrix.
  • Combine text description of the books, and user ratings, to form a User-Book matrix.
  • A small subset of books have ratings. Keep only those books which are rated by the users.
  • Keep only those users that have rated the books in the above subset.

Read user ratings data.

In [17]:
ratings = pd.read_csv(r"C:\Users\jbelapur\Documents\Extra_activities\DataScience\Proj\Projects\books\data\goodreadsData\kaggle\ratings.csv")
In [18]:
ratings.head()
Out[18]:
book_id user_id rating
0 1 314 5
1 1 439 3
2 1 588 5
3 1 1169 4
4 1 1185 4

Not all books have been rated. Keep only those books which are rated by the users.

In [19]:
sub_book_ids = np.intersect1d(all_books_df1.book_id.unique(), ratings.book_id.unique())
sub_book_ids.shape
Out[19]:
(2356,)

There are total 2356 books which have ratings.

In [20]:
sub_books = all_books_df1[all_books_df1.book_id.isin(sub_book_ids)]
sub_books.shape
Out[20]:
(2356, 5)

Keep those users who have rated only those books in the above list.

In [22]:
subdata_ratings = ratings[ratings.book_id.isin(sub_book_ids)]
sub_users = subdata_ratings.user_id.unique()

print("Number of books with ratings: ", subdata_ratings.book_id.nunique())
print("Number of total users who have rated at least 1 of the above books: ", subdata_ratings.user_id.nunique())
print("============================================")
print("Example, books rated by user_id 2338: ")
subdata_ratings[subdata_ratings.user_id.isin([2338])]
Number of books with ratings:  2356
Number of total users who have rated at least 1 of the above books:  42067
============================================
Example, books rated by user_id 2338: 
Out[22]:
book_id user_id rating
267713 2680 2338 1
402468 4031 2338 3
459135 4601 2338 3
496576 4979 2338 1
594482 5969 2338 5
767384 7733 2338 4
794351 8013 2338 4
884387 8954 2338 5

Content based recommendations for a user

  • Get the top 3 highly rated books by a given user.
  • Using book-book similarity coefficients, get a suggestions list for each of the boooks above.
In [23]:
def getTopRatedBookIDsForUser(uderid=2338):
    print("User ID: ", uderid)
    user_df = subdata_ratings[subdata_ratings.user_id.isin([uderid])]
    user_df = user_df.sort_values(by=['rating'], ascending=False)
    sorted_books_ids = user_df.book_id
    #sorted_books_ids = user_df.book_id[user_df.rating >= 3]
    return sorted_books_ids

def recommendedBooks(userid=2338):
    # Get books rated by the user
    sorted_books_ids = getTopRatedBookIDsForUser(uderid=userid)
    matching_titles = sub_books[['title', 'book_id']][sub_books.book_id.isin(sorted_books_ids)]
    #matching_titles = sub_books.title[sub_books.book_id.isin(sorted_books_ids)]
    print("Top 3 books liked by user: \n" )#, np.array(matching_titles[:3])[:,None])
    print(matching_titles.head(3))
    print("-----")
    prediction = pd.DataFrame()
    for title in matching_titles.title:
        prediction= prediction.append(similar_books(title, 2))
    prediction.drop_duplicates(subset=['title'], keep="first", inplace=True)
    prediction = prediction[~prediction.title.isin(matching_titles)]
    return prediction
In [24]:
recommendedBooks(2338)
User ID:  2338
Top 3 books liked by user: 

                        title  book_id
204        Gulliver's Travels     7733
865   A Man Without a Country     4979
1163               Lunar Park     4031
-----
Out[24]:
title authors
8525 Water Water Everywhere: A Splash & Giggle Bat... Julie Aigner-Clark
8908 The Annotated Gulliver's Travels Jonathan Swift
30 Slaughterhouse-Five Kurt Vonnegut Jr.
5045 Cat's Cradle/God Bless You Mr. Rosewater/Break... Kurt Vonnegut Jr.
1669 The Sea The Sea Iris Murdoch
5373 Birthday (Ring #4) Kōji Suzuki
2123 Tandia Bryce Courtenay
3211 Brighid's Quest (Partholon #5) P.C. Cast
665 The Player of Games (Culture #2) Iain M. Banks
5765 Live Like a Jesus Freak: Spend Today as If It ... D.C. Talk
2190 Dooby Dooby Moo Doreen Cronin
6536 The Karma Of Brown Folk Vijay Prashad
10315 Gary Grizzle Roger Hargreaves
10377 Bill Buzz Roger Hargreaves
780 Utopia Thomas More
5714 The United States of Europe: The New Superpowe... T.R. Reid

There are 42000 users, and 2356 books. Due to the computational limitations and memory issues, I can not process such a large matrix, therefore I will limit the number of users to 10000.

Randomly select 10000 users.

In [25]:
random_10kids = np.random.choice(subdata_ratings.user_id.unique(), 10000, replace=False)
subdata_10kusers = subdata_ratings[subdata_ratings.user_id.isin(random_10kids)]
subdata_10kusers.shape
Out[25]:
(55037, 3)
In [26]:
subdata_10kusers.head(5)
Out[26]:
book_id user_id rating
1 1 439 3
8 1 3662 4
14 1 7563 3
19 1 10335 4
25 1 13282 5

Using TF-IDF extract and vectorize key words

  • Drop typical stop words such as a,the, however, although etc.
  • Drop alphanumeric words as well.
  • Limit total words to 10000.
In [ ]:
 
In [28]:
# Create list of word by removing numeric/alphanumeric elements
vectorizer = CountVectorizer(stop_words='english', max_features=10000)
# #tfidf = TfidfVectorizer(stop_words='english', max_features=10000)#, tokenizer=stemming_tokenizer)

word_list = []
# #Tokenize description
for i, row in sub_books.all_text.astype('str').iteritems():
    b_tfidf = vectorizer.fit_transform(row.split(' '))
    words = vectorizer.get_feature_names()
    word_list.append([x for x in words if not any(c.isdigit() for c in x)])
    #word_list.append(words)
    
sub_books['all_text'] = word_list

print("Example, list of words from first book:\n")
print(np.array(sub_books.all_text.iloc[0]))
Example, list of words from first book:

['acclaim' 'adventures' 'anderson' 'baggins' 'bare' 'based' 'bilbo'
 'britain' 'children' 'classic' 'classics' 'collins' 'comfort' 'critical'
 'cruel' 'dangerous' 'dirty' 'douglas' 'dragon' 'dry' 'earth' 'eat'
 'edition' 'ends' 'filled' 'gandalf' 'gollum' 'great' 'ground' 'hero'
 'hobbit' 'hole' 'includes' 'instant' 'introduction' 'lived' 'magnificent'
 'means' 'met' 'middle' 'modern' 'nasty' 'note' 'oozy' 'page' 'paperback'
 'powerful' 'published' 'recognized' 'recounts' 'reluctant' 'ring' 'sandy'
 'sit' 'smaug' 'smell' 'spectacular' 'text' 'timeless' 'tolkien'
 'unforgettable' 'wet' 'wizard' 'world' 'worms' 'written']
D:\Conda3\envs\tf2\lib\site-packages\ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
In [29]:
vectorize_feature = sub_books.all_text.astype('str')

tfidf = TfidfVectorizer(max_features=10000, stop_words='english' )#, tokenizer=stemming_tokenizer)
Xb_tfidf = tfidf.fit_transform(vectorize_feature).toarray()
tfidf_feature_names = tfidf.get_feature_names()

print("Total number of books x total words in all books")
print(Xb_tfidf.shape)
Total number of books x total words in all books
(2356, 10000)

Create cosine similarity matrix

In [30]:
sb_description_cosine_mat = linear_kernel(Xb_tfidf, Xb_tfidf)
sb_cosine_books_df = pd.DataFrame(sb_description_cosine_mat, index = sub_books.book_id, columns=sub_books.book_id)
sb_cosine_books_df.shape
Out[30]:
(2356, 2356)
In [31]:
sb_cosine_books_df.head()
Out[31]:
book_id 5907 5 2 1 960 5107 34 7613 7624 890 ... 1664 8077 9337 9338 2411 7400 1302 5863 3754 3351
book_id
5907 1.000000 0.027361 0.009970 0.034994 0.029174 0.043459 0.143572 0.019706 0.002763 0.002891 ... 0.000000 0.0 0.0 0.0 0.000000 0.011903 0.009293 0.020205 0.00000 0.017870
5 0.027361 1.000000 0.211669 0.161611 0.002285 0.016046 0.036668 0.011882 0.024332 0.021185 ... 0.003032 0.0 0.0 0.0 0.006502 0.011488 0.030340 0.008582 0.00000 0.006158
2 0.009970 0.211669 1.000000 0.217476 0.000000 0.014240 0.020075 0.000000 0.059635 0.014659 ... 0.052997 0.0 0.0 0.0 0.000000 0.015286 0.011979 0.000000 0.00000 0.006986
1 0.034994 0.161611 0.217476 1.000000 0.014039 0.062855 0.024931 0.000000 0.031327 0.008674 ... 0.014775 0.0 0.0 0.0 0.011833 0.000000 0.009022 0.026955 0.02481 0.021466
960 0.029174 0.002285 0.000000 0.014039 1.000000 0.020468 0.019527 0.000000 0.002699 0.002825 ... 0.000000 0.0 0.0 0.0 0.000000 0.018633 0.000000 0.003232 0.00000 0.014889

5 rows × 2356 columns

In [ ]:
 

Create Users-Books matrix

User-Book matrix (A), each row represents a user and each column represents a book and each cell represents rating given by that user to that book. Here, $A_{ij}$ is the rating given by a user ui on book $b{j}$. $A_{ij}$ can range from 1 to 5.

In [32]:
user_mat = subdata_10kusers.pivot_table(index="user_id", columns="book_id", values="rating", aggfunc = np.mean) #fill_value=0
user_mat.shape
Out[32]:
(10000, 2356)
In [33]:
user_mat.head(5)
Out[33]:
book_id 1 2 5 8 9 10 12 13 14 18 ... 9924 9929 9931 9936 9950 9957 9961 9974 9997 10000
user_id
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
27 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
32 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 2356 columns

Fill up ratings for unseen books by a user

Most of the books have missing ratings (i.e. NaN values). The task is to predict and fill up the ratings for the unread books for a user. First, we fill up ratings using average rating by all the users i.e. mass popularity.

In [35]:
# First guess for the rating is, average rating by all the users
user_mat_filled = user_mat.replace(np.nan, user_mat.mean(axis=0, skipna=True))
user_mat_filled.head()

#df = pd.DataFrame({'a': [1,2,3,4,0,0,0,0], 'b': [2,3,4,6,0,5,3,8]}) 
#df=df.replace(0,df.mean(axis=0))
#df.head(10)
Out[35]:
book_id 1 2 5 8 9 10 12 13 14 18 ... 9924 9929 9931 9936 9950 9957 9961 9974 9997 10000
user_id
4 4.12 4.4 3.923077 3.807692 3.310345 4.363636 3.666667 4.392857 4.0 4.25 ... 3.346154 3.344828 3.6 3.68 4.416667 3.37931 3.433333 3.652174 4.62963 3.882353
10 4.12 4.4 3.923077 3.807692 3.310345 4.363636 3.666667 4.392857 4.0 4.25 ... 3.346154 3.344828 3.6 3.68 4.416667 3.37931 3.433333 3.652174 4.62963 3.882353
19 4.12 4.4 3.923077 3.807692 3.310345 4.363636 3.666667 4.392857 4.0 4.25 ... 3.346154 3.344828 3.6 3.68 4.416667 3.37931 3.433333 3.652174 4.62963 3.882353
27 4.12 4.4 3.923077 3.807692 3.310345 4.363636 3.666667 4.392857 4.0 4.25 ... 3.346154 3.344828 3.6 3.68 4.416667 3.37931 3.433333 3.652174 4.62963 3.882353
32 4.12 4.4 3.923077 3.807692 3.310345 4.363636 3.666667 4.392857 4.0 4.25 ... 3.346154 3.344828 3.6 3.68 4.416667 3.37931 3.433333 3.652174 4.62963 3.882353

5 rows × 2356 columns

Estimate ratings using book-book similarity coefficients: $ \hat{r}_{ub} = \frac{ \Sigma_{i \epsilon B_{u}} r_{ui} s_{ib} }{ \Sigma_{i \epsilon B_{u}} s_{ib}}$

Using book-book similarity coefficients as weights to predict ratings of unseen books. Where B_u is the list of books user u already rated.

In [ ]:
#idx = 47

for idx in user_mat_filled.index:
    #u = user_mat_filled.loc[idx]
    #Books rated by user u
    books_by_ui = pd.DataFrame(user_mat.loc[idx][user_mat.loc[idx] > 0])
    #unseen books for user u
    unread_by_ui = np.setdiff1d(user_mat_filled.columns, books_by_ui.index.values)
    r_values = []
    for b in unread_by_ui:
        cos_values = sb_cosine_books_df.loc[b][list(books_by_ui.index.values)]
        cos_values[cos_values == 0.0] = 0.0001
        #r_val = np.average(cos_values, weights=books_by_ui.values.flatten())
        r_val = np.average(books_by_ui.values.flatten(), weights=cos_values)
        #mean of popular rating and weighted rating
        r_val = 0.5*(r_val + user_mat_filled.loc[idx][b])
        user_mat_filled.loc[idx][b] = r_val
In [55]:
user_mat_filled.head()
Out[55]:
book_id 1 2 5 8 9 10 12 13 14 18 ... 9924 9929 9931 9936 9950 9957 9961 9974 9997 10000
user_id
4 3.41 3.80 3.71 3.54 3.08 4.30 3.54 3.90 4.08 3.83 ... 3.54 4.16 3.36 3.48 4.24 4.19 3.53 3.24 3.89 4.16
10 3.61 3.95 3.61 3.78 3.39 4.15 3.63 3.99 3.70 3.92 ... 3.14 3.67 3.12 3.29 3.89 3.24 2.88 3.32 4.18 3.65
19 3.73 3.87 3.63 3.41 3.16 3.84 3.50 3.86 3.67 3.79 ... 3.17 3.18 3.47 3.51 3.72 3.19 3.22 3.49 3.98 3.61
27 4.15 4.36 4.15 4.39 3.72 4.27 3.96 4.33 4.14 4.25 ... 3.62 3.67 3.45 3.84 4.31 3.69 3.74 3.89 4.39 4.17
32 4.56 4.70 4.46 4.40 4.16 4.68 4.33 4.70 4.50 4.62 ... 4.17 4.17 4.30 4.34 4.71 4.19 4.22 4.33 4.81 4.44

5 rows × 2356 columns

In [ ]:
 
In [ ]:
 

Matrix factorization method

  • Extract the latent factors of the User x Books matrix, such that, these latent factors capture most of the information for predicting the user ratings.
  • The User-Book matrix is decomposed into two matrices, (1) User x n_components, and (2) n_components x Books.
In [86]:
n_comps = 10

# Run truncated-SVD
modeltsvd = TruncatedSVD(n_components=n_comps)
tsvd_mat = modeltsvd.fit_transform(user_mat_filled)
tsvd_mat.shape
Out[86]:
(10000, 10)
In [87]:
modeltsvd.components_.shape
Out[87]:
(10, 2356)

Test the difference between ratings predicted by matrix factorization and the actual ratings from User x Book matrix. For a given user, plot difference of ratings for all the books. Notice the narrow spread in differences, (red line indicating $\pm \sigma$ )

In [95]:
predicted_mat = np.dot(tsvd_mat,modeltsvd.components_)
predicted_mat.shape
Out[95]:
(10000, 2356)
In [111]:
diff_ratings = user_mat_filled.iloc[0] - predicted_mat[0,:]
sig_spread = diff_ratings.std()
In [114]:
plt.figure()
plt.plot(user_mat_filled.iloc[0] - predicted_mat[0,:],'g.')
plt.axhline(5,color='k', alpha=0.5)
plt.axhline(sig_spread, color='r', alpha=0.5)
plt.axhline(0,color='k', alpha=0.5)
plt.axhline(-sig_spread, color='r')
plt.axhline(-5,color='k', alpha=0.5)
Out[114]:
<matplotlib.lines.Line2D at 0x5303e288>

Recommend top 10 books based on predicted ratings

Example, for a given user find the top recommendations.

In [125]:
predicted_mat_df = pd.DataFrame(predicted_mat, index=user_mat.index.values, columns=sub_books.book_id.values)
In [137]:
# Get recommendations for a given user using User-Book matrix
idx = 61

books_by_ui = pd.DataFrame(user_mat.loc[idx][user_mat.loc[idx] > 0])
unread_by_ui = np.setdiff1d(user_mat_filled.columns, books_by_ui.index.values)
reco_ids = predicted_mat_df.loc[idx][unread_by_ui].sort_values(ascending=False)

print("Book rated by the user: ")
sub_books[['title', 'book_id']][sub_books.book_id.isin(books_by_ui.index)]
Book rated by the user: 
Out[137]:
title book_id
2284 Burr 8722
3847 Choke 5776
3929 The Wish Giver: Three Tales of Coven Tree 3638
10394 The Birds (Methuen Drama) 2019

Books recommended for the user:

In [138]:
sub_books[['title', 'book_id']][sub_books.book_id.isin(reco_ids.index)][:5]
Out[138]:
title book_id
1 The Hobbit or There and Back Again 5907
2 Harry Potter and the Prisoner of Azkaban (Harr... 5
4 Harry Potter and the Order of the Phoenix (Har... 2
5 Harry Potter and the Half-Blood Prince (Harry ... 1
6 Angels & Demons (Robert Langdon #1) 960
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: