Asset Embeddings: A New Language for Markets

This paper introduces asset embeddings—vector representations of stocks learned from investor portfolios using techniques like BERT and Word2Vec. These embeddings outperform traditional characteristics in explaining return comovement, offering a new framework for understanding investor behavior.

Takeaway:
Embeddings trained on investor holdings reveal powerful latent characteristics of assets. They beat traditional finance models in predicting valuations, returns, and portfolio structure.


Key Idea: What Is This Paper About?

This paper introduces asset embeddings—vector representations of stocks learned from investor portfolio holdings using transformer and language modeling techniques (like BERT/GPT). By treating investor portfolios like “sentences,” the authors extract hidden firm characteristics that explain returns, valuations, and investor behavior more effectively than traditional accounting data or risk factors.


Economic Rationale: Why Should This Work?

Portfolios reflect investor beliefs about firms’ risk, return, and fundamentals. Asset embeddings capture this rich, high-dimensional information structure.

Relevant Economic Theories and Justifications:

  • Demand System Asset Pricing: Investors reveal preferences through holdings
  • Latent Factor Models: Asset returns are driven by unobserved characteristics
  • Information Frictions: Embeddings reflect signals that traditional data miss

Why It Matters:
Observable firm characteristics explain only a slice of return variation. Embeddings can be learned from real-world investor behavior to uncover deeper, alpha-generating insights.


Data, Model, and Strategy Implementation

Data Used

  • Data Sources: CRSP, Compustat, 13F filings, mutual fund & ETF holdings
  • Time Period: 2000–2021
  • Asset Universe: US public equities

Model / Methodology

  • Type of Model: Recommender system, Word2Vec, PCA, AssetBERT (transformer)
  • Key Features:
    • Portfolios are treated like sentences (ranked stock positions)
    • AssetBERT masks and predicts holdings like BERT predicts words
    • Embeddings are trained using holdings levels, ranks, or rebalancing
    • Both asset and investor embeddings are generated

"### Prompt Used (AssetBERT):
For each investor i, we order the assets ai(1), ai(2), ..., ai(A) by decreasing holdings size.
The model is trained to predict masked stocks in the ranked portfolio, similar to masked language modeling in BERT.
The sentence: Apple, IBM, Tesla, ..., Walmart
is treated like: “The Fed decided to ___ rates to fight inflation,” where the model learns the structure of holdings.
Let me know if you want this post tailored for another application (e.g., return predictability or investor similarity)."


Trading Strategy (Conceptual Applications)

  • Signal Generation: Use embeddings to spot over-/under-valued firms or crowding
  • Portfolio Construction: Build thematic/factor-like portfolios using embedding similarity
  • Macro Sentiment: Track how asset exposures shift with investor preferences
  • Rebalancing Frequency: Quarterly or rolling updates using holdings data

Final Thought

💡 Portfolio data is the new language of markets. Embeddings are how we learn to speak it. 🧠📊


Paper Details (For Further Reading)

  • Title: Asset Embeddings
  • Authors: Xavier Gabaix, Ralph S.J. Koijen, Robert J. Richmond, Motohiro Yogo
  • Publication Year: 2023
  • Journal/Source: SSRN Preprint
  • Link: https://ssrn.com/abstract=4572831

Read next