I research quantitative signals — patterns in data that predict outcomes. One paper is in finance, one in biostatistics. The fields are different, but the work is the same: engineer features from raw data, build a model, and test whether the signal holds up on data the model has never seen.
SEC Form 4 filings are public. When a corporate insider buys their own stock, it's a signal — that much is well-documented. What's less studied is whether the context around the purchase changes the strength of that signal: the price trend leading up to it, the insider's role, the filing delay, how many other insiders bought around the same time.
I pulled 17,237 open-market purchase filings from 2018–2024 and engineered 40+ features capturing momentum context, filing behavior, transaction clustering, and firm liquidity. I trained an XGBoost classifier to rank purchases by conviction.
The key finding: context changes everything. Insiders who buy after a >10% pre-disclosure run-up generate a 30-day cumulative abnormal return of 6.3%. Those who buy into price weakness generate 2.3%. Same event type, very different predictive power. I used the model's output to construct a dollar-neutral long-short portfolio that achieved a Sharpe ratio above 1.5 after transaction costs.
This was my first research project, and it started with a practical problem: individual gene markers are unreliable across patient cohorts. A gene that performs well in one study often fails in another. I wanted to find features that were stable across populations.
I ran a random-effects meta-analysis across 6 independent gene-expression datasets. Individual gene expression levels were inconsistent, but certain ratios between genes held steady across all 6 cohorts. I built a composite ratio-based classifier on top of these stable features — it achieved an AUC of 0.92 with 85% sensitivity on independent validation, up from 0.52 using individual genes alone.
The lesson carried forward into everything I've done since: predictive power often lives in the relationships between features, not in any single feature on its own.