0%

ContentBased

基于内容的推荐——个性化召回算法Content Based原理及自编程实现。

算法原理

基于内容推荐的算法思想是:根据用户过去喜欢的物品,为用户推荐和她过去喜欢的物品相似的物品。关键就在于物品相似性的度量。

基于内容的推荐的大致流程

  • item profile:为每个item抽取出一些特征来表示此item;(Topic finding;Genre Classify;视频中的关键帧,语音特征等)
  • user profile:利用一个用户过去喜欢的item的特征数据来学习出用户的喜好特征;(Genre;Topic;Time Decay)
  • 生成推荐列表:根据上一步得到的用户profile与候选item的特征,为此用户推荐一组相关性最大的item(Find topk Genre/Topic;Get the best n item from fixed Genre/Topic)

基于内容的推荐的算法的优缺点

优点

  • 用户之间的独立性:每个user profile都是依据本身对item的喜好获得的,和别人的行为无关,这一点和CF正好相反;
  • 可解释性:如果需要向用户解释为什么推荐了这些商品给他,只需要告诉他这些产品有什么属性;
  • 新的item可以立即得到推荐。

缺点

  • item的特征抽取一般很难,如果两个item抽取出来的特征完全相同,这种情况下CB就完全无法区分这两个item;
  • 无法挖掘用户的潜在兴趣:基于内容的推荐只依赖与用户过去对某些item的喜好,它产生的推荐也都会和用户过去喜欢的item相似;
  • 无法为新用户进行推荐:新用户没有喜好历史,自然没有user profile。

算法实现

  • 计算每个item的平均评分:所有用户的评分之和/被评分的次数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def get_ave_score(input_file):
"""
得到item的平均评分
:param input_file: user rating file
:return: a dict, key: itemid value: ave_score
"""
if not os.path.exists(input_file):
return {}
linenum = 0
record = {}
ave_score = {}
fp = open(input_file)
for line in fp:
if linenum == 0:
linenum += 1
continue
item = line.strip().split(",")
if len(item) < 4:
continue
userid, itemid, rating = item[0], item[1], float(item[2])
if itemid not in record:
record[itemid] = [0, 0]
record[itemid][0] += rating
record[itemid][1] += 1
fp.close()
for itemid in record:
ave_score[itemid] = round(record[itemid][0] / record[itemid][1], 3)
return ave_score
  • 计算每个item的类别,并且每个类别下的item按照上一步的平均得分从高到低排序,返回两个数据结构:
    • item_cate: key:item value:{category:ratio} 即如果一个item属于多个类别,那么这个item属于每一类别的概率需要等分
    • cate_item_sort: key:cate value:item 每个类别保存评分高的前topk个items
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def get_item_cate(ave_score, input_file):
"""
获取item的类别,以及每个类别下的所有item,这些item按照score倒排序
:param ave_score: a dict, key itemid, value rating score
:param input_file: item info
:return:
a dict: key itemid value a dict, key cat value:radio
a dict: key cate value [itemid1, itemid2, itemid3, ...]
"""
if not os.path.exists(input_file):
return {}, {}
linenum = 0
topk = 100
item_cate = {}
record = {}
cate_item_sort = {}
fp = open(input_file, encoding="utf-8")
for line in fp:
if linenum == 0:
linenum += 1
continue
item = line.strip().split(",")
if len(item) < 3:
continue
itemid = item[0]
cate_str = item[-1]
cate_list = cate_str.strip().split("|")
ratio = round(1 / len(cate_list), 3)
if itemid not in item_cate:
item_cate[itemid] = {}
for fix_cate in cate_list:
item_cate[itemid][fix_cate] = ratio
fp.close()
for itemid in item_cate:
for cate in item_cate[itemid]:
if cate not in record:
record[cate] = {}
itemid_rating_score = ave_score.get(itemid, 0)
record[cate][itemid] = itemid_rating_score
for cate in record:
if cate not in cate_item_sort:
cate_item_sort[cate] = []
for co in sorted(record[cate].items(), key=operator.itemgetter(1), reverse=True)[:topk]:
# cate_item_sort[cate].append(co[0]+"_"+str(co[1]))
cate_item_sort[cate].append(co[0])
return item_cate, cate_item_sort
  • 获取time decay:用户对item评分时间越接近当前时间,则占据的评分越高,因为用户的偏好会随着时间变化
1
2
3
4
5
6
7
8
9
10
def get_time_score(timestamp):
"""
:param timestamp: input timestamp
:return: time score
"""
fix_time_stamp = 1537799250
total_sec = 24 * 60 * 60
delta = (fix_time_stamp - timestamp) / total_sec / 100 # 时间越近,差距越小,分数越大
# print(delta)
return round(1 / (1 + delta), 3)
  • 计算用户的偏好:利用评分文件,用户对某一类别的偏好=rating*time_score*ratio
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def get_up(item_cate, input_file):
"""
:param item_cate: key itemid, value:dict, key category value ratio
:param input_file: user rating file
:return:
a dict: key userid, value [(category, ratio),(category1, ratio1)]
"""
if not os.path.exists(input_file):
return {}
linenum = 0
score_thr = 4.0
topk = 2
record = {}
up = {}
fp = open(input_file)
for line in fp:
if linenum == 0:
linenum += 1
continue
item = line.strip().split(",")
if len(item) < 4:
continue
userid, itemid, rating, timestamp = item[0], item[1], float(item[2]), int(item[3])
if rating < score_thr:
continue
if itemid not in item_cate:
continue
time_score = get_time_score(timestamp)
if userid not in record:
record[userid] = {}
for fix_cate in item_cate[itemid]:
if fix_cate not in record[userid]:
record[userid][fix_cate] = 0
record[userid][fix_cate] += rating * time_score * item_cate[itemid][fix_cate]
fp.close()
for userid in record:
if userid not in up:
up[userid] = []
total_score = 0
for co in sorted(record[userid].items(), key=operator.itemgetter(1), reverse=True)[:topk]:
up[userid].append((co[0], co[1]))
total_score += co[1]
for index in range(len(up[userid])):
up[userid][index] = (up[userid][index][0], round(up[userid][index][1] / total_score, 3))
return up
  • 根据得到的user profile和item profile进行推荐
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def recom(cate_item_sort, up, userid, topk=10):
"""

:param cate_item_sort: reverse sort
:param up: user profile
:param userid: fix userid to recom
:param topk: recom num
:return: a dict, key userid value [itemid1, itemid2,...]
"""
if userid not in up:
return {}
recom_result = {}
if userid not in recom_result:
recom_result[userid] = []
for co in up[userid]:
cate = co[0]
ratio = co[1]
num = int(topk * ratio) + 1
if cate not in cate_item_sort:
continue
recom_list = cate_item_sort[cate][:num]
recom_result[userid] += recom_list
return recom_result