======================================================================
NER PII Benchmark — NerGuard Base × nvidia-pii
======================================================================

Tier: 2
Evaluated labels: 16
  System labels: 20
  Dataset labels: 54
  Mapping applied: True

Samples: 1000
Tokens: 135894

--- Token-Level Metrics ---
  Precision (macro/micro/weighted): 0.4455 / 0.5778 / 0.7508
  Recall    (macro/micro/weighted): 0.5244 / 0.6471 / 0.6471
  F1        (macro/micro/weighted): 0.4175 / 0.6105 / 0.6491

--- Entity-Level Metrics (seqeval) ---
  Precision: 0.5616
  Recall:    0.6619
  F1:        0.6076

--- Latency ---
  Mean:   33.23 ms
  Median: 29.70 ms
  P95:    53.08 ms
  P99:    67.35 ms
  Throughput: 30.1 samples/sec

--- Per-Entity F1 Scores ---
  B-date_of_birth                P=1.0000  R=0.9695  F1=0.9845  (n=131.0)
  B-email                        P=0.9604  R=0.8846  F1=0.9210  (n=494.0)
  B-time                         P=0.8447  R=0.7861  F1=0.8144  (n=173.0)
  I-time                         P=0.9677  R=0.6452  F1=0.7742  (n=93.0)
  B-date                         P=0.7885  R=0.7486  F1=0.7680  (n=712.0)
  B-last_name                    P=0.9404  R=0.6262  F1=0.7518  (n=428.0)
  I-street_address               P=0.9964  R=0.5783  F1=0.7318  (n=479.0)
  I-date                         P=0.8800  R=0.6226  F1=0.7293  (n=212.0)
  I-phone_number                 P=0.6071  R=0.8990  F1=0.7248  (n=208.0)
  B-first_name                   P=0.7660  R=0.6655  F1=0.7122  (n=595.0)
  B-phone_number                 P=0.5260  R=0.6585  F1=0.5848  (n=246.0)
  B-city                         P=0.4734  R=0.7583  F1=0.5829  (n=211.0)
  B-age                          P=0.3465  R=0.9459  F1=0.5072  (n=37.0)
  B-postcode                     P=0.3439  R=0.9062  F1=0.4986  (n=96.0)
  I-last_name                    P=0.2500  R=1.0000  F1=0.4000  (n=1.0)
  I-city                         P=0.2464  R=0.7727  F1=0.3736  (n=44.0)
  I-postcode                     P=0.2500  R=0.5000  F1=0.3333  (n=2.0)
  B-street_address               P=0.3158  R=0.3279  F1=0.3217  (n=183.0)
  B-ssn                          P=0.1200  R=0.9231  F1=0.2124  (n=52.0)
  B-certificate_license_number   P=0.0932  R=0.3023  F1=0.1425  (n=86.0)
  B-tax_id                       P=0.0714  R=0.4545  F1=0.1235  (n=22.0)
  I-first_name                   P=0.0303  R=0.2000  F1=0.0526  (n=5.0)
  I-credit_debit_card            P=1.0000  R=0.0219  F1=0.0429  (n=274.0)
  B-credit_debit_card            P=0.1000  R=0.0105  F1=0.0190  (n=95.0)
  B-gender                       P=0.0000  R=0.0000  F1=0.0000  (n=36.0)
  I-certificate_license_number   P=0.0000  R=0.0000  F1=0.0000  (n=5.0)
  I-email                        P=0.0000  R=0.0000  F1=0.0000  (n=1.0)
  I-ssn                          P=0.0000  R=0.0000  F1=0.0000  (n=0.0)
  I-tax_id                       P=0.0000  R=0.0000  F1=0.0000  (n=1.0)

--- Per-Length Bucket ---
  short   : F1=0.7637 (n=9)
  medium  : F1=0.4497 (n=238)
  long    : F1=0.4243 (n=753)

--- Error Summary ---
  False positives: 99
  False negatives: 500