STS Top-400 — What Wins · Tian2

01 · Evidence base

Eleven winners, read in full — application and paper.

把 11 位入围者的完整材料（申请 + 研究论文）逐字读过，外加一份官方满分范例。

This report is built by reading the complete submissions — the full application (Tasks 1–10) and the actual research paper — of 11 confirmed "On The Table" (OTT) winners, deliberately spanning the entire OTT range: from rank #23 (18.6/20) down to rank #302 (16.5/20). The calibrated 19.33/20 gold-standard exemplar anchors the very top. Twelve independent deep-reads were synthesised against the official 4-criterion rubric and the full 463-row score sheet.

Read this caveat first

The deep-read corpus is Physics only (the category we hold files for), n = 11, and is not a random sample. Elite academics, research-magnet schools, and university-lab access are correlates and structural advantages — not proven causes. Treat what follows as a high-resolution map of what strong submissions share, not a formula.

02 · How the cut works

Top 400 by Z-score ∪ Top 350 by raw score.

约 2,471 份申请 → 463 份"在桌上"（≈ 19%）。两把尺子取并集。

"On The Table" is not a single ranked list. A project is admitted if it lands in the Top 400 by Z-score or the Top 350 by raw average, out of 2,471 applications (the largest entrant pool since 1967). The four official selection groups reconstruct from the score sheet within ±1 (ties at the "not-scored" sentinel).

Group	Definition	Official count	Reconstructed	What it means
BLUE — both	Top by Z and by raw	220	219	Unambiguously strong on both lenses.
PINK — Z only	Top 400 by Z only	127	127	Rescued by Z: strong relative to a harshly-graded category/panel.
GREEN — raw only	Top 350 by raw only	85	85	High absolute score, lower Z (lenient grading context).
YELLOW — outliers	Hand-pulled exceptions	31	32	Judgement calls outside the formula.

16.68

Mean raw score

/20. Median 16.75, sd 1.12. Whole OTT tier sits in a tight ~3-pt band (deciles 15.2 → 18.1).

0.97

Mean Z-score

Range −0.23 to +2.46. Z normalises away category and evaluator severity.

"Not scored" rows

Carry a 9999 sentinel — ranked in, but parked pending plagiarism / COI / AI / team screening.

03 · The key insight

The two rankings barely agree.

Z 排名与原始分排名几乎不相关（r ≈ 0.18）。

Across all 463 projects, rank-by-Z and rank-by-raw correlate at only r ≈ 0.175 — nearly independent. A project can be #50 on one lens and #380 on the other. That low correlation is the entire reason the Society takes a union of two top-lists: it hedges against evaluator severity and category difficulty. If raw score alone were used, 127 projects would have been dropped; if Z alone, 85 would have been dropped.

Why this matters to you

Scoring carries real contextual variance — which panel and which category you're measured against meaningfully moves your standing. The rational response: maximise the things you control — rigour, independence, communication, honesty — rather than over-indexing on a "glamorous" topic.

04 · Why some score higher

The raw bar is not the same for everyone.

不同学科的"原始分门槛"差异很大；Z 分用来抵消这种差异。

Evaluators in different fields grade harder or softer, so a raw score is not comparable across categories. The clearest illustration: Behavioral Sciences had the lowest mean raw (15.50) but the highest mean Z (+1.15), while Chemistry's high raw (17.67) came with a lower Z (+0.87). A 16.5 in a hard-graded field can be more selective than a 16.5 in a soft-graded one.

Category	OTT n	Mean raw	Mean Z
Chemistry	21	17.67	+0.87
Biochemistry	13	17.44	+0.83
Physics (deep-read corpus)	24	17.43	+0.90
Medicine & Health	64	17.25	+0.81
Cellular & Molecular Biology	45	16.85	+0.93
Comp. Bio & Bioinformatics	37	16.72	+1.00
Environmental Science	37	16.53	+0.96
Computer Science	27	16.07	+1.11
Social Sciences	10	15.83	+0.87
Behavioral Sciences	37	15.50	+1.15

Structural concentration is large

About 12 elite research high schools supply ~25% of the entire table, and New York + California together account for roughly half of all OTT projects — the places with research classes, mentor pipelines, and university-lab proximity.

Top OTT high schools

Bronx Science 28 · NC School of Sci & Math 16 · Thomas Jefferson HSST 10 · Jericho 10 · Stuyvesant 8 · Bergen Academies 8 · Montgomery Blair 7 · JFK 7 · Ossining 7 · Harker 6 · Great Neck South 6 · Herricks 5

Top states (by residence)

New York 35% · California 14% · New Jersey 6% · Texas 5% · North Carolina 5% · Virginia 4% · Massachusetts 4% · Florida 4%
Gender on the table: 51% M / 47% F — tracks the entrant pool, not a selection lever.

05 · The mental model

Floor × Differentiators.

先过四项门槛，再叠加差异化因素——两者缺一不可。

1 · Clear the floor

The composite is a sum of four equal 25% criteria, not a maximum — so weakness on any one caps the total. Every winner, even the #302 floor case, cleared a high bar on all four.

2 · Stack differentiators

What separated the 18+ scorers from the 16.5 floor was a small set of repeatable "moves." Floor-OTT projects had the table stakes but few differentiators; top-OTT projects stacked several.

06 · The floor

What every winner had.

If you cannot honestly check all four, the priority is fixing the gap — not polishing a strength.

学术档案 · 25%

Entry Form

SAT 1510–1600 (most ≥1550) or ACT 35–36; 6–8 AP 5s; genuine college coursework (multivariable calculus, linear algebra, analysis, quantum); a founded/led activity. No winner had merely "good" academics.

科学价值 · 25%

Scientific Merit

A project rated 4–5/5 for sophistication — graduate-level methods (PDE/spectral solvers, DFT, Monte-Carlo, real statistics), never a science-fair demonstration. Even the floor cases were graduate-level — just confirmatory.

学生贡献 · 25%

Student Contribution

The student demonstrably did the core work (code, analysis, derivations) and the mentor corroborated it specifically — naming the student's independent contributions phase by phase, not just offering generic praise.

科学潜力 · 25%

Scientific Potential

An accessible layperson summary with a real hook or analogy, plus a distinctive identity essay that signals a genuine intellectual stake — not a résumé in essay form.

07 · The differentiators

Six moves that move you up.

The repeatable patterns that separated 18+ from the 16.5 floor. Stack as many as your project allows — cross-validation and naming limitations are nearly free.

Cross-validate every key claim with a second independent method

The strongest rigour signal in the entire corpus. The top-scoring projects combined theory, simulation, and physical experiment so that each method provides independent confirmation. Redundancy reads as rigour — and it is the single most consistent trait of 18+ submissions.

Contribute something new — don't just run an existing tool

Add a term, a method, a generalization, or a fitting approach. The clearest distinction between high and floor OTT scorers: floor projects reproduced a result already known analytically; top projects derived, invented, or extended something the field did not have before.

III

A rigorous null result is a winning result

If framed as shrinking the field's search space. The strongest null-result submissions ruled out a competing hypothesis with multiple statistical tests and reported the finding honestly — without trying to spin a "no" into a "yes." Evaluators reward this.

Get an external validation stamp

First-authorship on a peer-reviewed paper, a published arXiv preprint, or a top-tier science fair removes the "is this really the student's work, and is it any good?" doubt entirely. ISEF Grand Awards, USAPhO Gold, and USAMO serve the same function for Entry Form.

Own — or visibly extend — the question, and bound the mentor's role

Top scorers either found the question themselves or took a seed and substantially extended it. Name exactly what the mentor gave (a dataset, stability tips, an existing code) so the independent core is unmistakable. The mentor's letter must corroborate phase by phase — not just "top 10%."

Quantify significance and repeat the number

A crisp, memorable figure threaded through the paper, the layperson summary, and the essays. "76% less power per chip." "Improve the mass error from 100 keV to 10 keV." "8.6% median error continent-wide." Evaluators remember and re-cite a number; vague claims of "significant improvement" do not survive.

08 · The bar per criterion

What a 5/5 looks like.

Criterion	Floor (≈4)	5/5 — the ceiling
Entry Form	1550+/35+, 6+ AP 5s, college math, one founded activity, solid recs.	Add national/international distinction (ISEF Grand Award, USAPhO/USAMO, National JSHS) and recommenders who independently rank you Top 1% with specific anecdotes — not "top 15%" with generic praise.
Scientific Merit	A graduate-level method, correctly executed, with one real control/validation.	A novel contribution on a significant question, cross-validated, on real or large data, with honestly quantified limitations. Ceiling-cappers: thin datasets, confirmatory results, short timelines (~7 weeks), or writing "Limitations: None."
Student Contribution	You did the core analysis/code; the mentor confirms.	You originated or substantially extended the question; you narrate specific failure-and-fix stories (rewrote the integrator, derived new boundary conditions, switched statistical test); the mentor letter corroborates phase-by-phase; help is honestly bounded. The hardest criterion to fake — and the most decisive at the top.
Scientific Potential	Clear writing, plausible trajectory.	A coherent multi-year through-line (escalating projects), external traction (invited talks, publications, peer-review service), and a distinctive intellectual identity that signals "future leader," not "good student."

09 · Red flags

What visibly cost points.

The biggest avoidable one — "Limitations: None / N/A"

Multiple lower-scored OTT winners left the limitations field empty or answered "N/A." Every real study has limitations; omitting them signals weak self-assessment and lowers the merit score. Always name 2–4 concrete limitations and why they matter.

Confirmatory dressed as discovery

Numerically reproducing a result already known analytically — including one already in the mentor's own preprint — caps Scientific Merit.

Thin data vs. big claims

Five grain shapes. Three one-hour trials. A modest N undermines strong conclusions, especially when limitations are not acknowledged.

Relative or assigned mentor

A parent as primary mentor on their own research material — even when disclosed — visibly capped perceived independence. An assigned-and-not-extended question scores below a self-originated one.

Thin or generic recommendations

A "Top 15%" letter with a copy-paste name error, or a near-empty mentor form, undercuts Student Contribution even when the science is strong.

Also observed: overclaiming impact ("revolutionize," "democratize") ahead of demonstrated results, without staging who must do what next.

10 · Playbook by access

No lab? You can still reach the top tier.

两位居家独立选手和满分范例都没有实验室——靠的是理论/计算 + 公共数据 + 严谨。

A · You have lab / RSI / program access

Take an unglamorous foundational task inside a real research code and own it end-to-end. Push toward a publishable or first-author output. Get the PI to specify your independent contributions in writing, phase by phase.

B · You have no lab (home / independent)

Choose a theory or computational topic where a laptop is the entire instrument — lab access then carries zero penalty. Mine public archives (Chandra, Hubble, MODIS, arXiv datasets) and make rigour your moat: justify sample selection, validate your pipeline against an established tool, test statistical assumptions, report nulls honestly. Turn "no mentor" into a documented strength through triple-verified independence.

Either way the project must be legible: a non-specialist evaluator should grasp the question, the novelty, and one quantified result within a page.

11 · Before you submit

Pre-submission scorecard.

A submission that checks most of these lands in OTT territory; checking nearly all is what 18+ looks like. Checks are saved in your browser.

12 · Caveats & how to use

Read these limits honestly.

Physics-only, n = 11 deep corpus + one cross-domain exemplar. Patterns are directional, not a regression.
No scores for non-OTT projects exist in the source data, so "chosen vs. not" is a content inference, not a fitted model. A true driver model needs the scores for all ~2,471 entrants.
Correlation, not causation. Elite academics, magnet schools, and lab/RSI access are advantages many winners shared — but two fully home-based projects and the 19.33 exemplar prove they are not prerequisites.
The actionable core is what you control: rigour (cross-validate), a genuinely novel element, demonstrable and corroborated independence, honest limitations, and journalist-quotable communication.
Individual case analyses underlying this report are kept offline per Tian2's student-privacy policy. All named applicant details are excluded from this public page.

⁂

Part of the Tian2 Programs research library. ← Return to Programs