专栏首页深度学习框架Taking advantage of context features
原创

Taking advantage of context features

In the featurization tutorial we incorporated multiple features beyond just user and movie identifiers into our models, but we haven't explored whether those features improve model accuracy.

Many factors affect whether features beyond ids are useful in a recommender model:

  1. Importance of context: if user preferences are relatively stable across contexts and time, context features may not provide much benefit. If, however, users preferences are highly contextual, adding context will improve the model significantly. For example, day of the week may be an important feature when deciding whether to recommend a short clip or a movie: users may only have time to watch short content during the week, but can relax and enjoy a full-length movie during the weekend. Similarly, query timestamps may play an important role in modelling popularity dynamics: one movie may be highly popular around the time of its release, but decay quickly afterwards. Conversely, other movies may be evergreens that are happily watched time and time again.
  2. Data sparsity: using non-id features may be critical if data is sparse. With few observations available for a given user or item, the model may struggle with estimating a good per-user or per-item representation. To build an accurate model, other features such as item categories, descriptions, and images have to be used to help the model generalize beyond the training data. This is especially relevant in cold-start situations, where relatively little data is available on some items or users.
import os
import tempfile
​
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
​
import tensorflow_recommenders as tfrs

We follow the featurization tutorial and keep the user id, timestamp, and movie title features.

ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")
​
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "timestamp": x["timestamp"],
})
movies = movies.map(lambda x: x["movie_title"])

We also do some housekeeping to prepare feature vocabularies.

timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100)))
​
max_timestamp = timestamps.max()
min_timestamp = timestamps.min()
​
timestamp_buckets = np.linspace(
    min_timestamp, max_timestamp, num=1000,
)
​
unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000))))
unique_user_ids = np.unique(np.concatenate(list(ratings.batch(1_000).map(
    lambda x: x["user_id"]))))

Model definition

Query model

We start with the user model defined in the featurization tutorial as the first layer of our model, tasked with converting raw input examples into feature embeddings. However, we change it slightly to allow us to turn timestamp features on or off. This will allow us to more easily demonstrate the effect that timestamp features have on the model. In the code below, the use_timestamps parameter gives us control over whether we use timestamp features.

class UserModel(tf.keras.Model):
​
  def __init__(self, use_timestamps):
    super().__init__()
​
    self._use_timestamps = use_timestamps
​
    self.user_embedding = tf.keras.Sequential([
        tf.keras.layers.experimental.preprocessing.StringLookup(
            vocabulary=unique_user_ids, mask_token=None),
        tf.keras.layers.Embedding(len(unique_user_ids) + 1, 32),
    ])
​
    if use_timestamps:
      self.timestamp_embedding = tf.keras.Sequential([
          tf.keras.layers.experimental.preprocessing.Discretization(timestamp_buckets.tolist()),
          tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
      ])
      self.normalized_timestamp = tf.keras.layers.experimental.preprocessing.Normalization()
​
      self.normalized_timestamp.adapt(timestamps)
​
  def call(self, inputs):
    if not self._use_timestamps:
      return self.user_embedding(inputs["user_id"])
​
    return tf.concat([
        self.user_embedding(inputs["user_id"]),
        self.timestamp_embedding(inputs["timestamp"]),
        self.normalized_timestamp(inputs["timestamp"]),
    ], axis=1)

Note that our use of timestamp features in this tutorial interacts with our choice of training-test split in an undesirable way. Because we have split our data randomly rather than chronologically (to ensure that events that belong to the test dataset happen later than those in the training set), our model can effectively learn from the future. This is unrealistic: after all, we cannot train a model today on data from tomorrow.

This means that adding time features to the model lets it learn future interaction patterns. We do this for illustration purposes only: the MovieLens dataset itself is very dense, and unlike many real-world datasets does not benefit greatly from features beyond user ids and movie titles.

This caveat aside, real-world models may well benefit from other time-based features such as time of day or day of the week, especially if the data has strong seasonal patterns.

Candidate model

For simplicity, we'll keep the candidate model fixed. Again, we copy it from the featurization tutorial:

class MovieModel(tf.keras.Model):
​
  def __init__(self):
    super().__init__()
​
    max_tokens = 10_000
​
    self.title_embedding = tf.keras.Sequential([
      tf.keras.layers.experimental.preprocessing.StringLookup(
          vocabulary=unique_movie_titles, mask_token=None),
      tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
    ])
​
    self.title_vectorizer = tf.keras.layers.experimental.preprocessing.TextVectorization(
        max_tokens=max_tokens)
​
    self.title_text_embedding = tf.keras.Sequential([
      self.title_vectorizer,
      tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
      tf.keras.layers.GlobalAveragePooling1D(),
    ])
​
    self.title_vectorizer.adapt(movies)
​
  def call(self, titles):
    return tf.concat([
        self.title_embedding(titles),
        self.title_text_embedding(titles),
    ], axis=1)

Combined model

With both UserModel and MovieModel defined, we can put together a combined model and implement our loss and metrics logic.

Note that we also need to make sure that the query model and candidate model output embeddings of compatible size. Because we'll be varying their sizes by adding more features, the easiest way to accomplish this is to use a dense projection layer after each model:

class MovielensModel(tfrs.models.Model):
​
  def __init__(self, use_timestamps):
    super().__init__()
    self.query_model = tf.keras.Sequential([
      UserModel(use_timestamps),
      tf.keras.layers.Dense(32)
    ])
    self.candidate_model = tf.keras.Sequential([
      MovieModel(),
      tf.keras.layers.Dense(32)
    ])
    self.task = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=movies.batch(128).map(self.candidate_model),
        ),
    )
​
  def compute_loss(self, features, training=False):
    # We only pass the user id and timestamp features into the query model. This
    # is to ensure that the training inputs would have the same keys as the
    # query inputs. Otherwise the discrepancy in input structure would cause an
    # error when loading the query model after saving it.
    query_embeddings = self.query_model({
        "user_id": features["user_id"],
        "timestamp": features["timestamp"],
    })
    movie_embeddings = self.candidate_model(features["movie_title"])
​
    return self.task(query_embeddings, movie_embeddings)

Experiments

Prepare the data

We first split the data into a training set and a testing set.

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)
​
train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)
​
cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()

Baseline: no timestamp features

We're ready to try out our first model: let's start with not using timestamp features to establish our baseline.

model = MovielensModel(use_timestamps=False)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
​
model.fit(cached_train, epochs=3)
​
train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
​
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")

This gives us a baseline top-100 accuracy of around 0.2

Capturing time dynamics with time features

Do the result change if we add time features?

model = MovielensModel(use_timestamps=True)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))
​
model.fit(cached_train, epochs=3)
​
train_accuracy = model.evaluate(
    cached_train, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
test_accuracy = model.evaluate(
    cached_test, return_dict=True)["factorized_top_k/top_100_categorical_accuracy"]
​
print(f"Top-100 accuracy (train): {train_accuracy:.2f}.")
print(f"Top-100 accuracy (test): {test_accuracy:.2f}.")

This is quite a bit better: not only is the training accuracy much higher, but the test accuracy is also substantially improved.

代码链接: https://codechina.csdn.net/csdn_codechina/enterprise_technology/-/blob/master/NLP_recommend/Taking%20advantage%20of%20context%20features.ipynb

原创声明,本文系作者授权云+社区发表,未经许可,不得转载。

如有侵权,请联系 yunjia_community@tencent.com 删除。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 特征提取和泛化中深度ReLU网络的深度选择(CS.LG)

    深度学习被认为能够发现深层特征以进行表示学习和模式识别,而无需借助人类的独创性和先验知识就可以使用优雅的特征工程技术。因此,它引发了机器学习和模式识别方面的大量...

    蔡小雪7100294
  • Tree - Decision Tree with sklearn source code

    After talking about Information theory, now let's come to one of its application...

    风雨中的小七
  • What's The Future Of Cognitive Computing? IBM Watson

    By Jay Bellisimo, IBM Watson Group We are entering a new period of computing his...

    首席架构师智库
  • Mid-Level 视觉表示 增强通用性和采样高效 for Learning Active Tasks

    Mid-Level Visual Representations Improve Generalization and Sample Efficiency fo...

    用户1908973
  • 【论文推荐】最新5篇视频分类相关论文—上下文门限、深度学习、时态特征、结构化标签推理、自动机器学习调优

    【导读】专知内容组整理了最近五篇视频分类(Video Classification)相关文章,为大家进行介绍,欢迎查看! 1.Learnable pooling...

    WZEARW
  • 语境很重要:手语识别的自我注意(cs,AI)

    本文为连续手语识别任务提出了一注意力网络。拟议的方法利用相互依赖的数据流来建模手语模式。这些不同的信息渠道可以共享彼此复杂的时间结构。因此,我们关注同步并帮助捕...

    天然 8129060
  • 新增21篇! | CVPR 2019 论文实现代码

    AI研习社
  • Best Programming Editors? A Never Ending Battle With No Clear Winner

    原文:Best Programming Editors? A Never Ending Battle With No Clear Winner

    WindCoder
  • 使用数字游戏式学习为复杂无线网络教育增加互动性

    我们能让本科工程教育变得更容易、更有趣吗?这项研究的目的是看我们是否能回答这个雄心勃勃的问题! 数字游戏化学习(DGBL)由于能够让学生感到轻松和有趣,在应用于...

    随便起个名
  • 【论文推荐】最新5篇目标跟踪(Object Tracking)相关论文—并行跟踪和验证、光流、自动跟踪、相关滤波集成、CFNet

    【导读】专知内容组整理了最近五篇目标跟踪(Object Tracking)相关文章,为大家进行介绍,欢迎查看! 1. Parallel Tracking and...

    WZEARW
  • 自然语言处理和语义网本体学习的人工智能算法(CS)

    进化聚类算法被认为是最流行和最广泛使用的进化算法,用于最小化优化和几乎所有领域的实际问题。本文提出了一种新的进化聚类算法星(ECA*)。此外,还进行了一些实验,...

    用户8440711
  • 量子边缘检测算法(CS)

    量子计算在图像处理领域的应用已经产生了几个很有前途的应用:量子图像表示技术的发展表明,通过利用纠缠和叠加等量子特性,许多图像处理算法相比之下可以有指数级的加速他...

    柴艺
  • 【论文推荐】最新七篇推荐系统相关论文—影响兴趣、知识Embeddings、 音乐推荐、非结构化、一致性、显式和隐式特征、知识图谱

    【导读】专知内容组整理了最近七篇推荐系统(Recommender System)相关文章,为大家进行介绍,欢迎查看! 1.Learning Recommenda...

    WZEARW
  • 朝向室温超导电性(G PG)

    栾博舒
  • 什么是 SAP Fiori Tools

    SAP Fiori Tools 是 Visual Studio Code 和 SAP Business Application Studio 开发工具的一个扩展...

    Jerry Wang
  • A Cold Dive into React Native (Tutorial for Beginners)

    原文:A Cold Dive into React Native (Tutorial for Beginners)

    WindCoder
  • SkyCell:一种基于空间修剪的并行Skyline算法

    Skyline计算是一种重要的数据库操作,在推荐系统等多标准决策场景中有很多应用。现有的算法主要集中在检查点控制上,这在大数据集上缺乏效率。我们提出了一个基于网...

    用户8839980
  • 掌握了Docker Layer Caching才敢自称精通Dockerfile

    长话短说:本次原创将向您展示在Docker中使用Layer Cache以加快镜像构建。

    小码甲
  • 将JDK5.0开发的项目转为JDK1.4可运行的项目Retrotranslator

    将JDK5.0开发的项目转为JDK1.4可运行的项目Retrotranslator、Retroweaver

    阿敏总司令

扫码关注云+社区

领取腾讯云代金券