About StockHark
What StockHark is
StockHark monitors public conversations across Reddit and select online communities to surface emerging stock trends and market sentiment. Using automated data collection, natural language processing, and structured aggregation, it produces concise metrics and alerts showing what communities are discussing in near real-time.
How it works — detailed methodology
StockHark converts noisy public conversations into structured signals via a layered pipeline. Below is a more detailed walk-through of each major step.
- Collection: we continuously collect posts and comments from a curated list of finance-focused Reddit communities using authenticated API clients and rate-limited collectors. Collections capture metadata (subreddit, author, timestamp, upvotes) alongside the raw text so we can weight signals later.
- Normalization & entity mapping: raw text is tokenized and normalized (lowercasing, punctuation cleanup, lemmatization). We map company names and shorthand to tickers using a permissive name-to-symbol map plus heuristics that consider surrounding context (e.g. 'Tesla' vs 'TSLA', or '$TSLA'). This step also detects false-positives using simple rules (e.g. common words that look like tickers) and company-name disambiguation.
- Sentiment & intensity scoring: each mention is scored with an ensemble of approaches — FinBERT (primary model) for contextual sentiment, and a compact rule-based analyzer for speed and fallback. We also compute an intensity score (how emphatic the mention is) using features such as sentiment magnitude, punctuation, and community engagement.
- Aggregation & weighting: mention-level signals are aggregated into per-symbol metrics that incorporate recency, source reliability, and author credibility. Source reliability weights reduce noise from meme-heavy communities, while recency weighting emphasizes recent surges.
- Signal detection & alerting: aggregated metrics are compared against user-configured thresholds and site-wide heuristics (e.g. Top-25 override). If thresholds are satisfied, alerts are queued and deduplicated using a cooldown window to prevent fatigue.
- Human-review & safety: anomaly detection filters flag outlier signals for manual review if they show suspicious patterns (bot amplification, sudden spam spikes). We also ensure alerts include context links so recipients can quickly verify the source posts.
For privacy and safety, we only surface public posts and do not store or share private messages. The methodology balances automation with pragmatic rules to reduce false positives while surfacing early signals.
What you get — Dashboard & Alerts
Dashboard
Prioritized list of trending stocks with mention counts, sentiment, and quick links to the posts driving the signal. Ideal for spotting momentum early.
Responsible use & limitations
- Informational only — not investment advice.
- Sampled public conversations; not comprehensive.
- Signals are probabilistic and require corroboration.
Note: This page summarizes methodology and monitored communities at a high level. Specific model parameters and training data are proprietary.