Data sources
CivicAlign pulls from these primary sources only:
- Congress.gov — bills, bill text, sponsors, and roll-call votes for the U.S. Senate and House.
- FEC.gov — federal campaign finance: total raised, cash on hand, top contributors. Used for U.S. Senate and U.S. House candidates only.
- State campaign finance agencies — Governor races and other state-level offices. The specific agency varies by state (e.g., Texas Ethics Commission, Florida Division of Elections). Governor finance is never sourced from the FEC because the FEC does not track state races.
- GovInfo.gov — committee reports and supplementary congressional documents.
- Member official websites and press releases — public statements used for the "said vs. voted" comparison.
We do not use Twitter/X, news aggregators, or third-party tracking sites as primary sources for factual claims.
How summaries are generated
- Ingest. New bills and roll-call votes from Congress.gov are pulled daily. Campaign finance is pulled nightly from the FEC bulk feed and from each state finance agency that publishes structured data.
- Chunk and summarize. Long bill text is broken into sections. Each section is passed through an LLM with explicit instructions to extract operative provisions, not editorialize. Output is constrained to factual restatement plus a short impact note.
- Attach source. Every summary carries the Congress.gov URL it was generated from. The citation travels with the summary — a snipped or shared summary keeps its source.
- Publish. Summaries appear under labels like "AI summary" or "AI title" with a visible link to the underlying source.
What's human-reviewed vs. AI-generated
| Surface | Human-reviewed | AI-generated |
|---|---|---|
| Bill text, vote tallies, sponsor names | Yes (verbatim from Congress.gov) | — |
| Campaign finance figures | Yes (verbatim from FEC / state agencies) | — |
| Plain-English bill summaries | — | Yes |
| One-sentence bill titles ("AI titles") | — | Yes |
| Said-vs-voted contradiction analyses | — | Yes |
| Stance scores ("strongly for / mixed / against") | — | Yes (derived from vote records) |
| Hero copy, navigation labels, FAQs | Yes | — |
Where a piece of content is AI-generated, the surface says so — "AI summary," "AI analysis," "AI title."
Update cadence
| Data | Refresh interval |
|---|---|
| New bills from Congress.gov | Daily |
| Roll-call vote records | Daily |
| AI-generated bill summaries | Generated on first ingest; regenerated only if the bill text changes |
| FEC campaign finance | Nightly (bulk import) |
| State campaign finance | Nightly, when the state publishes structured data |
| Public statements (said-vs-voted) | Ad hoc, generally weekly |
How the bill-passage prediction works
Every bill on CivicAlign carries a passage-probability badge. That number comes from a two-engine system designed for transparency: a rule-based heuristic anchored to historical Congressional outcomes, and an ML.NET LightGBM classifier trained on resolved bills from previous Congresses. We take whichever produces the higher probability for each bill, with the source labeled on the tooltip.
Base rates we calibrate against
Of bills introduced in a given Congress (roughly 10,000–15,000 over two years), only about 3–5% become law. That's the floor every prediction sits on. The heuristic uses these published rates (from GovTrack and CRS analyses across the 113th–118th Congresses) as the base by procedural stage:
| Stage | Base passage rate |
|---|---|
| Introduced | ~4% |
| In committee | ~5% |
| Reported out of committee | ~30% |
| Floor debate | ~42% |
| Passed one chamber | ~55% |
| In conference | ~75% |
| Sent to the President | ~95% |
Simple and concurrent resolutions (H.RES, S.RES, H.CON.RES, S.CON.RES) get separate, higher base rates because they don't require POTUS.
Adjustments the heuristic layers on top
- Cosponsor momentum. 0–4 cosponsors hurts (−3 pts); 50–100 helps (+6); 200+ adds +12.
- Bipartisan ratio. Cosponsors from the opposite party of the primary sponsor add up to +15 points — the single strongest non-procedural signal in real Congressional data.
- Staleness penalty. Bills with no recent action for 180+ days lose 8–15 points depending on stage.
What the ML model learns
The LightGBM classifier was trained on 31,244 resolved bills from closed Congresses (1,728 enacted, 5.53% base rate — matching the published number). It learns from signals the heuristic doesn't see directly: cosponsor counts by party, withdrawal rates, committee complexity, amendment counts, related-bill counts, policy area, Congress era, sponsor party, and a 128-dimensional semantic embedding of each bill's plain-English summary (from text-embedding-3-small, a Matryoshka-trained OpenAI embedding model — we keep the leading 128 dims because they carry most of the variance). Stage is deliberately excluded from the ML features because it leaks the outcome in training data.
Held-out test-set numbers after the most recent retrain (May 25, 2026):
| Metric | Value | What it means |
|---|---|---|
| AUC | 0.96 | Given any pair of (passed, failed) bills, the model ranks the passed one higher 96% of the time. |
| F1 | 0.61 | Harmonic mean of precision and recall on the positive class (becomes-law). Hard to push higher because positives are only ~5% of bills. |
| Accuracy | 0.95 | Fraction of all predictions that match the ground truth, on the held-out test set. |
Adding the summary embedding lifted AUC by +5 points and F1 by +26 points over the structural-features-only baseline. The model retrains automatically on the 1st of every month so newly-resolved bills join the training set.
What the prediction will not do
- It will not tell you which specific bills will pass — only how likely each one is.
- It does not model lobbying, public sentiment, current-events context, or political dealmaking that isn't visible in the structured Congress.gov record.
- It treats "rolled into omnibus" outcomes as "didn't pass" for the bill ID, even if the underlying policy made it into law.
- Predictions above 80% on a freshly-introduced bill should be treated with skepticism — at that stage the base rate is ~4%, and most signal comes from sponsorship which can change.
Verifying the model is calibrated
We snapshot every bill's probability daily and back-test it against actual outcomes. The admin accuracy dashboard groups snapshots by predicted probability (0–9%, 10–24%, 25–49%, 50–74%, 75–100%) and reports the actual pass rate per bucket. If the model is well-calibrated, those rates roughly match the bucket labels. If they drift, that's our cue to retrain or adjust features.
Known limitations
- State campaign finance coverage is uneven. Some state agencies do not publish structured data; for those, the UI shows "State finance source pending" rather than fabricate a number.
- Said-vs-voted requires a recorded floor vote. Statements made on bills that never reached a recorded vote can't be paired this way.
- AI summaries are statistical paraphrase, not legal interpretation. They reflect what a bill says in plain English; they do not predict how a court will interpret it.
- Stance scoring uses a recent window of roll-call history. A representative who recently changed position will show their current voting pattern, not their past one.
- We do not track state legislatures. CivicAlign covers the U.S. Senate, U.S. House, Governor races, and the federal corpus on Congress.gov.
Found a mistake?
If a fact is wrong, a citation is broken, or an analysis misses context, file a correction at /corrections. Resolved corrections are logged publicly.
Questions about the methodology? Email info@civic-align.com.
