NYC Nonprofit Search — EIN (exact) / NAME (SoftTF-IDF + Jaro

NYC Nonprofit Search

API: initializing…

Client-side matcher over data.csv. Search by EIN (exact) or NAME (SoftTF-IDF + Jaro–Winkler; optional phonetic recall). Results show a confidence score per item.

Search field

EIN (exact) NAME (semantic)

Query (9-digit EIN or org name)

Max results

Advanced (candidate gen + scorer)

Top-K candidates (TF-IDF blocker)

Token match threshold (Jaro–Winkler)

Min SoftTF-IDF to show

Enable phonetic fallback (NYSIIS)

Loading data.csv…

Sort: click a column header

Scorer: SoftTF-IDF (token TF-IDF × JW) + small domain boosts

EIN	NAME	CITY	STATE	ZIP	NTEE	RULING	ASSETS	INCOME	REVENUE	SoftTF-IDF	JW(full)	CONF	Hits

Search API

A same-origin API implemented via a Service Worker. Supports GET or POST, returns JSON with top results (default 5). Include confidence (0–1), SoftTF-IDF, and Jaro–Winkler per item.

Endpoints

GET /api/search?q=<query>&field=name|ein&limit=5&topk=80&tokThresh=0.85&minSoft=0&phonetic=true
POST /api/search with JSON body using the same fields.

Parameters

Param	Type	Default	Description
`q`	string	–	Query text (org name) or EIN (9 digits for `field=ein`).
`field`	string	`name`	`name` uses SoftTF-IDF+JW; `ein` is exact.
`limit`	int	5	Max items to return.
`topk`	int	80	Candidate count from TF-IDF blocker.
`tokThresh`	float	0.85	Min Jaro–Winkler for token matches in SoftTF-IDF.
`minSoft`	float	0	Min SoftTF-IDF to keep (after small boosts).
`phonetic`	bool	true	Use NYSIIS fallback when no lexical hits.

Responses

200 OK

{
  "meta": {
    "query": "alzheimer nyc",
    "field": "name",
    "limit_requested": 5,
    "limit_returned": 5,
    "candidates_considered": 80,
    "took_ms": 7,
    "thresholds": {"topk":80,"tokThresh":0.85,"minSoft":0,"phonetic":true},
    "corpus_size": 199873
  },
  "results": [
    {
      "rank": 1,
      "ein": "01XXXXXXX",
      "name": "ALZHEIMER'S ASSOCIATION GREATER NEW YORK CHAPTER",
      "city": "NEW YORK", "state": "NY", "zip": "100XX",
      "ntee": "H92",
      "assets": 12345678, "income": 2345678, "revenue": 1987654,
      "confidence": 0.964,
      "soft_tfidf": 0.951,
      "jw_full": 0.987,
      "hits": 6,
      "explain": {"numbersMatched":0,"rareTokenMatches":2}
    }
  ]
}

400 (missing q or bad params), 503 (index still warming).

Examples

# GET (name, top 5 default)
curl -G "http://localhost:8000/api/search" \
  --data-urlencode "q=habitat for humanity nyc"

# GET (ein)
curl -G "http://localhost:8000/api/search" \
  --data-urlencode "q=002022084" --data-urlencode "field=ein"

# POST (JSON)
curl -X POST "http://localhost:8000/api/search" \
  -H "Content-Type: application/json" \
  -d '{"q":"alzheimer new york","field":"name","limit":10,"topk":120,"tokThresh":0.86,"minSoft":0.1,"phonetic":true}'

Scoring & Confidence

SoftTF-IDF sums TF-IDF-weighted token matches where within-token similarity ≥ tokThresh (Jaro–Winkler). We add small domain boosts for exact numbers (“PS 118”, “Chapter 12”) and rare token matches. confidence is a calibrated blend: 0.8×SoftTF-IDF + 0.2×JW(full), clipped to [0,1]. For EIN exact matches, confidence is 1.0.