- Item2Item的推荐方式效果显著
- NN model的特征抽象能力
- 算法论文:ITEM2VEC: Neural Item Embedding For Collaborative Filtering
- 将用户的行为序列转化为item组成的句子
- 模仿word2vec训练word embedding将item embedding
- 用户的行为序列时序性缺失
- 用户行为序列中的item强度是无区分性的
- 从log中抽取用户行为序列(day级别等)
- 将行为序列当成语料训练word2vec得到item embedding
- 得到item sim关系用于推荐
CBOW (continuous bag of words)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
| def produce_train_data(input_file, out_file): """ :param input_file: user behavior file :param out_file: output file """ if not os.path.exists(input_file): return record = {} score_thr = 4 linenum = 0 fp = open(input_file) for line in fp: if linenum == 0: linenum += 1 continue item = line.strip().split(',') if len(item) < 4: continue userid, itemid, rating = item[0], item[1], float(item[2]) if rating < score_thr: continue if userid not in record: record[userid] = [] record[userid].append(itemid) fp.close() fw = open(out_file, "w+") for userid in record: fw.write(" ".join(record[userid]) + "\n") fw.close()
- 将item序列喂给word2vec模型,得到item的embedding文件
- gensim 或 word2vec c版本
1 2 3
| sentences = word2vec.LineSentence("../data/train_data.txt") model = word2vec.Word2Vec(sentences, sg=1, vector_size=128, window=5, sample=1e-3, hs=0, negative=5, epochs=100) model.wv.save_word2vec_format("../data/item_vec.txt", binary=False)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
| def load_item_vec(input_file): """ :param input_file: item vec file :return: dict key:itemid value:np.array([num1,num2...]) """ if not os.path.exists(input_file): return {} linenum = 0 item_vec = {} fp = open(input_file) for line in fp: if linenum == 0: linenum += 1 continue item = line.strip().split() if len(item) < 129: continue itemid = item[0] if itemid == "</s>": continue item_vec[itemid] = np.array([float(ele) for ele in item[1:]]) fp.close() return item_vec
def cal_item_sim(item_vec, itemid, output_file): """ :param item_vec: item embedding vector :param itemid: fixed itemid to clac item sim :param output_file: the file to store result """ if itemid not in item_vec: return score = {} topk = 10 fix_item_vec = item_vec[itemid] for tmp_itemid in item_vec: if tmp_itemid == itemid: continue tmp_itemvec = item_vec[tmp_itemid] fenmu = np.linalg.norm(fix_item_vec) * np.linalg.norm(tmp_itemvec) if fenmu == 0: score[tmp_itemid] = 0 else: score[tmp_itemid] = round(np.dot(fix_item_vec, tmp_itemvec) / fenmu, 3)
fw = open(output_file, "w+") out_str = itemid + "\t" tmp_list = [] for co in sorted(score.items(), key=operator.itemgetter(1), reverse=True)[:topk]: tmp_list.append(co[0] + "_" + str(co[1])) out_str += ";".join(tmp_list) fw.write(out_str + "\n") fw.close()
- 编写shell脚本串联过程,使用spark等进行并行计算