======================================================================
NER PII Benchmark — NerGuard Hybrid V2 (gpt-4o) × nvidia-pii
======================================================================

Tier: 2
Evaluated labels: 16
  System labels: 20
  Dataset labels: 54
  Mapping applied: True

Samples: 500
Tokens: 68359

--- Token-Level Metrics ---
  Precision (macro/micro/weighted): 0.6161 / 0.7042 / 0.7763
  Recall    (macro/micro/weighted): 0.7038 / 0.7649 / 0.7649
  F1        (macro/micro/weighted): 0.6094 / 0.7333 / 0.7444

--- Entity-Level Metrics (seqeval) ---
  Precision: 0.6453
  Recall:    0.7599
  F1:        0.6980

--- Latency ---
  Mean:   39.89 ms
  Median: 34.54 ms
  P95:    65.92 ms
  P99:    91.99 ms
  Throughput: 25.1 samples/sec

--- Per-Entity F1 Scores ---
  B-date_of_birth                P=1.0000  R=1.0000  F1=1.0000  (n=68.0)
  I-last_name                    P=1.0000  R=1.0000  F1=1.0000  (n=1.0)
  I-credit_debit_card            P=0.9796  R=0.9474  F1=0.9632  (n=152.0)
  B-email                        P=0.8359  R=0.9605  F1=0.8939  (n=228.0)
  B-first_name                   P=0.8615  R=0.7992  F1=0.8292  (n=249.0)
  B-last_name                    P=0.9635  R=0.7174  F1=0.8224  (n=184.0)
  B-age                          P=0.6923  R=1.0000  F1=0.8182  (n=18.0)
  B-date                         P=0.7664  R=0.8179  F1=0.7913  (n=357.0)
  B-time                         P=0.8438  R=0.7200  F1=0.7770  (n=75.0)
  I-phone_number                 P=0.6221  R=0.9817  F1=0.7616  (n=109.0)
  B-credit_debit_card            P=0.6024  R=0.9615  F1=0.7407  (n=52.0)
  I-street_address               P=0.9937  R=0.5772  F1=0.7302  (n=272.0)
  B-postcode                     P=0.5824  R=0.9636  F1=0.7260  (n=55.0)
  B-city                         P=0.7238  R=0.6667  F1=0.6941  (n=114.0)
  I-time                         P=0.9355  R=0.5472  F1=0.6905  (n=53.0)
  B-ssn                          P=0.5179  R=0.9667  F1=0.6744  (n=30.0)
  I-date                         P=0.8590  R=0.5492  F1=0.6700  (n=122.0)
  B-tax_id                       P=0.3929  R=0.9167  F1=0.5500  (n=12.0)
  B-phone_number                 P=0.3412  R=1.0000  F1=0.5088  (n=115.0)
  I-city                         P=0.3636  R=0.6667  F1=0.4706  (n=24.0)
  B-street_address               P=0.2788  R=0.2788  F1=0.2788  (n=104.0)
  B-gender                       P=0.6667  R=0.1053  F1=0.1818  (n=19.0)
  B-certificate_license_number   P=0.1250  R=0.1562  F1=0.1389  (n=32.0)
  I-first_name                   P=0.0714  R=1.0000  F1=0.1333  (n=1.0)
  I-certificate_license_number   P=0.0000  R=0.0000  F1=0.0000  (n=3.0)
  I-postcode                     P=0.0000  R=0.0000  F1=0.0000  (n=1.0)

--- Per-Length Bucket ---
  short   : F1=0.7257 (n=6)
  medium  : F1=0.6228 (n=123)
  long    : F1=0.6424 (n=371)

--- Error Summary ---
  False positives: 40
  False negatives: 340