Sentiment & Alt Data: What's Real, What's Not

"Alternative data" is the marketing label for everything that isn't price, volume, or fundamentals: news sentiment, social media chatter, satellite imagery, credit-card flows, web-scraping signals. The promise is alpha that isn't in the public price feed. The reality is mostly noise dressed as edge. This article covers what kinds of alt data have produced documented edge, what hasn't, and how to evaluate any new alt-data claim.

The Categories

News Sentiment

Algorithmic scoring of news article tone — positive, negative, neutral — applied to specific tickers. Tetlock (2007) showed that media tone (specifically pessimism in major news outlets) predicts short-term equity returns: high-pessimism days are followed by short-term declines, then partial reversal. The effect is real but small (basis points per day at the index level) and competes against transaction costs.

For single-name strategies, news sentiment matters most around scheduled events (earnings, M&A) where it captures the market's read of the news faster than slow-reading retail flow.

Social Media / Retail Sentiment

Twitter chatter, Reddit forums, retail-broker order flow. Da, Engelberg & Gao (2015) documented that Google search volume on "fear" terms predicts equity returns. The effect is detectable but noisy; quality varies enormously across data providers.

The 2021 retail-meme-stock episode (GameStop, AMC) demonstrated that retail flow CAN move large-cap stocks meaningfully in some regimes — but the regime itself is rare and hard to time.

Satellite / Geospatial

Parking lot car counts at retailers, oil tanker movements, container ship traffic, agricultural acreage. These data sources have generated headlines for predicting earnings or commodity supply, but the alpha has been competed away rapidly. Most of the documented edges are in commodity-related single-name and futures positions, with high data costs that make retail use uneconomic.

Credit Card / Payment Flows

Aggregated retail spending data sold by data providers. Useful for predicting same-store-sales surprises in retail names. Meaningful in single-name research; expensive in subscription cost; mostly available only to institutional users.

Web Scraping

Pricing pages, job postings, app downloads, review counts. The Indeed-job-postings → company-growth relationship has been documented in academic work. Most retail-accessible web-scraping signals are noisy and require very specific contexts to be useful.

The Decay Curve

Once an alt-data signal becomes widely known, the alpha decays quickly — often within 1–3 years. Signals that worked in academic backtests on 2010–2015 data often produce no edge in 2023+ live trading. Always check whether the signal's edge has been persistent in recent data.

What Tends to Be Real

News sentiment around scheduled events. Reactions to earnings, FDA decisions, M&A. Real, fast, narrowly applicable.
Aggregate fear/greed indices. Persistent but small effects on broad indices. Useful as regime filters.
Specific-context geospatial. Oil tanker tracking for energy, container ship counts for retail. Real but increasingly competed.
Macroeconomic Now-casting. High-frequency credit-card and payment data improving GDP nowcasts. Mostly relevant for macro and rates strategies.

What Tends to Be Hype

"AI sentiment scores" without methodology disclosure. Black-box signals you can't audit. Often have look-ahead bias in their backtests.
Twitter sentiment for individual stocks. Most signal is noise. The big retail-meme moves are concentrated in a small number of names and rare regimes.
"Quantum AI" / "Neural network" signals from data vendors. Marketing language correlates inversely with real edge.
Vendor-supplied backtests. Always optimistic. The vendor selected the period, the universe, and the parameters. Re-test in your own framework before believing.

Evaluating Any Alt-Data Claim

Mechanism. Is there a plausible economic story for why this data should predict returns? "Satellite parking-lot counts predict same-store sales which predict earnings surprises" is a story. "Reddit upvote counts predict next-month returns" needs more.
Data quality and history. How long is the historical record? Has the data construction methodology changed? How is missing data handled?
Out-of-sample. Does the signal still work post-publication of any academic paper? Most documented anomalies decay 50%+ within 5 years of publication.
Cost vs edge. How much does the data cost? What's the after-fees Sharpe of the strategy? For most retail use, alt-data subscription costs exceed the edge.
Capacity. Even if the signal works, how much capital can it absorb? Many alt-data edges are in low-capacity strategies that institutional money has already saturated.

How QuanterLab Approaches Alt Data

The Supplementary Data module exposes a small number of well-validated alt-data sources (macro indicators, calendar events, sentiment indices) rather than offering an everything-bucket. The philosophy: a few honest signals are more useful than many marketing-driven ones. Users can build custom alt-data integrations when needed, but the platform doesn't pretend that exotic data sources are the path to alpha.

The Bottom Line

Most alt data is more interesting than profitable. The signals that have produced sustained edge are narrow, mechanism-grounded, and increasingly competed. Treat alt-data offerings the same way you'd treat a strategy backtest: skeptically, with explicit walk-forward validation, with realistic cost assumptions, and with awareness that publication and adoption typically destroy edge faster than discovery creates it.

Baker, M. & Wurgler, J. (2006). Investor Sentiment and the Cross-Section of Stock Returns. Journal of Finance, 61(4), 1645–1680.
Tetlock, P. C. (2007). Giving Content to Investor Sentiment: The Role of Media in the Stock Market. Journal of Finance, 62(3), 1139–1168.
Da, Z., Engelberg, J. & Gao, P. (2015). The Sum of All FEARS Investor Sentiment and Asset Prices. Review of Financial Studies, 28(1), 1–32.

Textbook references

López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley.

Try it in QuanterLab

When evaluating any vendor-supplied alt-data signal, replicate the vendor backtest in your own framework with your costs and your universe. Most vendor-claimed Sharpes drop by 30–60% in honest re-tests.