Learning General-Purpose User Representations through In-App Logs

Introduction

User modeling is the process of creating useful mathematical and computational representations of users that encode relevant user characteristics (e.g., preferences, needs) useful for downstream applications like content recommendation, malicious behavior detection, and churn forecasting. An increasingly popular approach in user modeling involves leveraging user behavioral logs, such as click streams on web pages and interaction logs on apps. These logs inherently take the form of sequences, similar to sentences in natural languages (e.g., English, German, Chinese), with the caveat that the tokens correspond to units of user behavior rather than words or word-pieces. Given the recent prominence of foundational language models for downstream applications, exploring foundational behavioral models is an appealing direction with relatively untapped potential.

In our recent work, we offer a case study of how foundational behavioral models can play a role in various user-centric prediction tasks. Figure 1 illustrates the framework. We leverage language modeling techniques from natural language processing to learn from user behavioral logs, enabling the effective learning of general-purpose user representations. However, applying language modeling methods to use-cases beyond natural language requires careful adaptation. While the approach has proven effective for search engines and e-commerce platforms, less is understood about how and when it generalizes to social platforms like Snapchat, where users can interact with multiple product surfaces and there are multiple tasks of interest (besides the conventional next-item prediction). Our recent work General-Purpose User Modeling with Behavioral Logs: A Snapchat Case Study, to be presented at SIGIR 2024, aims to address these two research gaps with a case study using Snapchat data. In this work:

We demonstrate the effectiveness of self-supervised language modeling techniques in learning general-purpose user representations for Snapchat through rigorous experiments and evaluations.
We implement a customized Transformer-based user model with two training objectives: Masked Behavior Prediction and User Contrastive Learning. Additionally, we adopt a novel position encoding method: Attention with Linear Biases (ALiBi) to allow inferences for user sequences longer than seen at training time. We show that such customization based on data characteristics and task goals is necessary for improved learning of user representations.
Lastly, we introduce three new downstream tasks unseen in previous user modeling research: Reported Account Prediction, Ad View Time Prediction, and Account Self-Deletion Prediction. These concern diverse areas spanning user safety, ad engagement and churn, three pivotal topics in user research.

Figure 1: Typical foundational behavioral model design: First, user behavioral sequences are used to train a user model that learns general-purpose user representations. At inference time, the model generates representations from new behavioral sequences, which are used for downstream task.

Modeling Choices

To guide our model design and evaluation, we define five criteria for our user model:

The user model is solely trained on user behavioral data.
The user model's training objective is not tied directly to specific downstream tasks.
The user representations encode behavioral information.
The user representations capture user-specific information that can distinguish one user from another.
The user model can perform distinct downstream tasks.

Criterion 1 concerns the fact that user logs typically contain noisy events unrelated to user actions (e.g., app notifications, error reports), which should be left out. In our study, we meticulously examine all log events and curate a shortlist of events that are purely behavioral and initiated by user actions. Additionally, we take care to use a large and randomly selected sample of Snapchat users and their behavioral sequences to ensure diverse representation of the user community.

Criterion 2, 3 and 4 concern the choice of model training objectives. Masked Behavior Prediction and User Contrastive Learning are two suitable ones. The former involves randomly masking parts of users' behavior sequences, compelling the model to predict the masked behaviors based on their context and thus allowing the model to learn user behavioral information (fulfilling Criterion 3). The latter uses a contrastive loss function to maximize the distance between representations of different users and minimize the distance between representations of the same user based on behavioral sequences from different time points. Hence, the model learns user-specific information that distinguishes one user from another (fulfilling Criterion 4). Since these two objectives are not tied to any downstream goal, Criterion 2 is also fulfilled.

Criterion 5 concerns model evaluation, emphasizing that the learned user representations should generalize to different downstream tasks. To this end, we introduce three distinct downstream tasks: Reported Account Prediction, Ad View Time Prediction, and Account Self-deletion Prediction. They concern predicting accounts/users who get reported by other users (e.g., for displaying malicious behaviors), who engage with an ad above a certain duration threshold, and who voluntarily delete their own accounts.

Results

Our Model

We utilize the Transformer architecture with two customized training objectives: Masked Behavior Prediction and User Contrastive Learning. Additionally, we apply Attention with Linear Biases (ALiBi).

Baseline Approaches

Term Frequency (TF) and Term Frequency - Inverse Document Frequency (TF-IDF).
Skip-Gram with Negative Sampling (SGNS), which learns a fixed vector for each user behavior by using the trained weights from a two-layer neural network that learns to predict context behaviors from a given target behavior.
Untrained user representations, where a fixed vector is randomly generated for each unique user behavior. This can be considered as a "high-dimensional" TF approach, and has shown competitive performances.
Transformer Encoder (Enc) and Decoder (Dec) are our implementations of BERT and GPT2 for user modeling, sharing the same architecture as BERT-base and GPT2-117M. We train them respectively with Masked Behavior Prediction and Next Behavior Prediction objectives.

Selected Findings

Figure 2: Prediction performance for three downstream tasks across time gaps.

Figure 2 shows the results of our model and baseline models on the three downstream tasks across different time gaps, demonstrating the usefulness of user representations at different levels of staleness. We highlight two observations. Firstly, our model consistently outperforms all baselines (except for time gap 1 and 5 in Reported Account Prediction). Second, our model can detect malicious accounts (reported by other users) and predict user account self-deletion with high AUC scores up to one week in advance. However, our model predicts ad view time less well (still better than the baselines and chance). This is expected, as ad view time also depends on other important factors like ad content and users' cognitive states.

Table 1: Ablation Study on Masked Behavior Prediction (MBP) objective, User Contrastive Learning (UCL) objective, and ALiBi, evaluated on 6 tasks: MBP, User Representation Similarity Analysis (URSA), User Retrieval (UR), Reported Account Prediction (RAP), Ad View Time Prediction (AVTP), and Account Self-deletion Prediction (ASP). ACC (accuracy), COS (cosine difference), MRR (Mean Reciprocal Rank), and AUC are converted to the same scale.

Table 1 shows the impact of User Contrastive Learning (UCL) and ALiBi on our model's performance. Without UCL, our model improves slightly on Masked Behavior Prediction, but significantly underperforms other tasks. Also, without ALiBi, performance suffered across all evaluation tasks. These results further show that naively applying language models to user modeling should be avoided.

Conclusion

This work is a case study exploring the use of language modeling techniques for modeling user behavior on Snapchat. We show that naive application of language modeling techniques on behavioral tasks is unideal, and incorporating user distinguishability into the loss function helps task performance. Moreover, we show that we can use ALiBi to overcome inference challenges of long(er than training time) behavioral sequences. Our work only just scratches the surface of what’s possible with foundational behavior models, and we seek to continue exploring this area more deeply with exploration of diverse token definitions, more complex and feature-rich event sequences, and better strategies for self-supervision. If you’re interested in this line of work, come find us at SIGIR 2024!