Dataset: IEEE-CIS Fraud Detection (Kaggle)

A large labelled payment-fraud dataset whose card/email/device identifiers make it the real-data companion to the identifier-graph method — and whose chain-propagated label is itself an entity-resolution lesson.

A two-table payment-transaction dataset (transaction + identity, joined on TransactionID) with a binary isFraud label and a rich set of card, address, email-domain, and device/browser identifiers. It is the closest public data to the project’s identifier-graph method — and its label, propagated across a card/account chain after the first reported fraud, is the same entity-resolution subtlety that method is about.

This is a dataset-reference page: what the resource is, what it approximates, what it cannot show. Running the identifier-graph method against it would be a separate investigation — and a natural one, already flagged in the live-site TODO. The descriptive content is the output of notebooks/eda/ieee-fraud-detection.ipynb; the framing is this page’s addition.

Why this dataset matters here

The synthetic identifier-graph experiment built rings by hand and so detected them by construction. This dataset is the real-data counterpart: actual card hashes, email domains, and device strings, with an actual fraud label, at a scale where the giant-component and edge-weighting problems are real rather than staged. Two of its properties map directly onto the method’s core concerns:

  • The identifiers are exactly the method’s edge types — card attributes, address, email domains, device/browser strings — at realistic cardinality.
  • The label is chain-propagated: once a card/account is flagged, related transactions inherit the positive label. That is entity resolution showing up in the label itself, which is the method’s whole subject. It is a feature for understanding the problem and a trap for naive per-row modelling.

Scope note: this is payment fraud, which the project admits only on its web-facing / bot-driven slice (carding, payment-flow abuse at checkout). The dataset is not bot-labelled; it is fraud-labelled. Used here for the identifier structure, not as a bot classifier.

Access

Source: Kaggle competition ieee-fraud-detection (via kaggle competitions download).

  • Local path: data/ieee-fraud-detection
  • File format: CSV.
  • Date inspected: 2026-05-23.
  • Files on disk: sample_submission.csv — 5.8 MB; test_identity.csv — 24.6 MB; test_transaction.csv — 584.8 MB; train_identity.csv — 25.3 MB; train_transaction.csv — 651.7 MB.

Structure

  • train_transaction.csv: 590,540 rows × 394 cols. One row = one transaction.
  • train_identity.csv: 144,233 rows × 41 cols. Identity / device features attached to a subset of transactions.
  • Join key: TransactionID (left join transaction ← identity). Identity available for 144,233 / 590,540 transactions (24.4%) — so ~76% of transactions have no identity row.
  • Temporal coverage: TransactionDT ranges from 86,400 to 15,811,131 seconds (~182.0 days) from an unspecified reference time.

Schema

Most columns are anonymised by Vesta with no public mapping. The interpretable ones are the target (isFraud), amount (TransactionAmt), timing offset (TransactionDT), product code (ProductCD), email domains (P_/R_emaildomain), and — in the identity table — DeviceType, DeviceInfo, plus id_30 (OS), id_31 (browser), id_33 (screen resolution). Everything in the V*, C*, D*, M*, id_*, card*, addr*, dist* families is anonymised. Full V-column range: V1..V339 (339 columns).

train_transaction (non-V columns + V1..V20 of 339 V cols)

column dtype example description
TransactionID int64 2987000 Unique transaction id
isFraud int64 0 Target: 1 = fraudulent
TransactionDT int64 86400 Time-delta from reference (s)
TransactionAmt float64 68.5 Transaction amount (USD)
ProductCD object W Product code (categorical)
card1 int64 13926 card attribute (anonymised)
card2 float64 card attribute (anonymised)
card3 float64 150.0 card attribute (anonymised)
card4 object discover card network
card5 float64 142.0 card attribute (anonymised)
card6 object credit card type
addr1 float64 315.0 address region (anonymised)
addr2 float64 87.0 address country (anonymised)
dist1 float64 19.0 distance feature (anonymised)
dist2 float64 distance feature (anonymised)
P_emaildomain object Purchaser email domain
R_emaildomain object Recipient email domain
C1C14 float64 1.0 count-type features (anonymised)
D1D15 float64 14.0 day time-delta features (anonymised)
M1M9 object T match-type categoricals (anonymised)
V1V20 float64 1.0 Vesta-engineered numeric features (anonymised); continues to V339

train_identity (41 columns, selected)

column dtype example description
TransactionID int64 2987004 Join key to train_transaction
id_01id_11 float64 70787.0 identity features (anonymised)
id_12id_29 object/float NotFound / New identity features (anonymised)
id_30 object Android 7.0 OS string
id_31 object samsung browser 6.2 browser string
id_32 float64 32.0 identity feature (anonymised)
id_33 object 2220x1080 screen resolution
id_34id_38 object match_status:2 / T identity features (anonymised)
DeviceType object mobile Device class (desktop / mobile)
DeviceInfo object SAMSUNG SM-G892A Build/NRD90M Device info string (browser/UA-derived)

Label

Label column: isFraud in train_transaction.csv (1 = fraudulent).

isFraud count rate
0 569,877 0.96501
1 20,663 0.03499
ImportantThe label is chain-propagated — an entity-resolution lesson, not a per-row truth

Per Kaggle, a fraud label is propagated to all transactions in a card / account / email chain after the first reported fraud. So a positive label is partly a chain-level proxy, not a per-transaction confirmation: a single confirmed fraud paints every linked transaction positive.

This is the identifier-graph problem appearing inside the ground truth itself. It inflates the positive count, complicates per-row interpretation, and means that “predicting isFraud” is partly “predicting membership of a flagged entity chain” — which is exactly what the identifier-graph method does explicitly. Treated naively it’s a leakage hazard; treated knowingly it’s the most realistic feature of the dataset for this project’s purposes.

Identifier inventory

No explicit user ID. Weak actor identifiers come from card hashes, email domains, and (when the identity table joins) device/browser strings — the exact edge types the identifier-graph method uses.

column n_unique null_rate role
TransactionID 590,540 0.0000 transaction primary key
card1 13,553 0.0000 card hash (weak account identifier)
card2 500 0.0151 card attribute
card3 114 0.0027 card attribute
card4 4 0.0027 card network (visa/mc/etc.)
card5 119 0.0072 card attribute
card6 4 0.0027 card type (debit/credit)
addr1 332 0.1113 billing address region
addr2 74 0.1113 billing country
P_emaildomain 59 0.1599 purchaser email domain (weak actor signal)
R_emaildomain 60 0.7675 recipient email domain
DeviceType 2 0.0237 device class
DeviceInfo 1,786 0.1773 device fingerprint string
id_30 75 0.4622 OS string
id_31 130 0.0274 browser string
id_33 260 0.4919 screen resolution

The cardinality spread previews the giant-component problem directly: card1 at 13,553 distinct values is a usefully discriminative edge, while card4 (4 values) and card6 (4 values) are near-universal attributes that would fuse everything if treated as linking edges — they are node attributes, not edges, exactly as the method’s edge-weighting section argues.

Temporal structure

  • TransactionDT: integer seconds from an undisclosed reference. No wall-clock timestamps, no timezone.
  • Range: 86,400 to 15,811,131 seconds (182.0 days of activity).
  • D1..D15 are documented as day-granularity time-delta features relative to prior events on the same card/account — themselves entity-linked features.
  • Diurnal patterns are inferable modulo the unknown reference; not plotted here.

Missing data

  • train_transaction: 95,566,686 null cells overall (41.07% of cells); 212 of 394 columns have >20% missingness.
  • train_identity: 2,104,107 null cells overall (35.58%); 19 of 41 columns have >20% missingness.
  • Identity rows themselves are missing for ~76% of transactions.

Top most-null columns in train_transaction: dist2 (0.9363), D7 (0.9341), D13 (0.8951), D14 (0.8947), D12 (0.8904), D6 (0.8761), D9 (0.8731), D8 (0.8731), and a block of V-columns (V153, V149, V141, V146, V154, V162, V142) all at 0.8612.

Top most-null columns in train_identity: id_24 (0.9671), id_25 (0.9644), id_07 (0.9643), id_08 (0.9643), id_21 (0.9642), id_26/id_23/id_27/id_22 (~0.9642), then id_18 (0.6872), id_04/id_03 (0.5402), id_33 (0.4919), id_10/id_09 (0.4805).

Quirks and observations

  • Two-file structure joined on TransactionID; only 24% of transactions have identity rows.
  • V*/C*/D*/M*/id_* columns are anonymised; no public mapping. Treat as opaque.
  • TransactionDT is seconds from an unstated reference; cannot anchor to calendar dates.
  • Labels are chain-propagated (see Label callout) — the single most important property for this project.
  • High missingness clusters in long blocks of V/D/id_ columns — typical of features only defined for certain product codes or device classes, not missing at random.

Framing distance

What real problem it approximates: identifier-linked payment fraud at realistic scale and cardinality — the real-data setting for the identifier-graph method, with the entity-resolution structure present in both the features and the label.

What it fails to represent: it is payment fraud, not bot detection; the label is chain-propagated, not per-transaction truth; most features are anonymised, so the method’s edge-weighting can’t be reasoned about semantically beyond the card/email/device columns; and it is a transaction table, not a session/clickstream, so behavioural-synchrony signals are absent.

What further evidence would be needed: a bot-labelled (not fraud-labelled) source; per-event rather than chain-propagated labels; de-anonymised identifier semantics; and session-level behavioural data to pair with the transaction-level identifiers.

What it cannot show

A reader should not read isFraud performance as bot-detection performance, nor a per-row metric as if the labels were per-row truth. The dataset shows how identifier structure and chain-propagated labels behave on real fraud data — it does not show prevalence, bot-specific behaviour, or the meaning of its anonymised features.

Reproduction

Generated by notebooks/eda/ieee-fraud-detection.ipynb, which calls openbotrisk.eda.loaders.load_ieee_meta (pandas full-read).

jupyter nbconvert --to notebook --execute --inplace \
  notebooks/eda/ieee-fraud-detection.ipynb \
  --ExecutePreprocessor.timeout=600

Loader runtime on the reference machine: 14.0 s. Both train CSVs fit in memory; no chunking needed.