Dataset: TalkingData AdTracking (Kaggle)

A large mobile ad-click dataset whose per-IP/device click timing is a natural real-data home for the behavioural-synchrony method — provided its label is read for what it is: a conversion flag, not a bot flag.

A single large flat table of mobile ad clicks (~185 million rows), each with anonymised ip, app, device, os, channel, a click_time, and an is_attributed target marking whether the click led to an app install. It is a natural real-data setting for the behavioural-synchrony method — rapid repeated clicks per IP/device that never convert — as long as the label is read correctly: it marks conversion, not bot.

This is a dataset-reference page. Running the synchrony method against the click-timing structure would be a separate investigation, already flagged in the live-site TODO. The descriptive content is the output of notebooks/eda/talkingdata-adtracking.ipynb; the framing is this page’s addition.

Why this dataset matters here

Click fraud is squarely in scope (ad fraud is a listed abuse type) and is underrepresented in the project so far. More usefully, the dataset’s structure — many clicks per ip/device in tight time windows — is exactly the input the behavioural-synchrony method consumes. It is the real-data counterpart to the synthetic synchrony experiment, where coordinated clicks were generated by hand.

The catch is the label, treated below as the load-bearing caution: is_attributed is a conversion signal, so the positive class is legitimate install, and fraud must be inferred from the absence of attribution plus volume/timing patterns. The dataset supports synchrony-style analysis well and a supervised bot classifier badly.

Access

Source: Kaggle competition talkingdata-adtracking-fraud-detection (via kaggle competitions download).

Local path: data/talkingdata-adtracking-fraud-detection
File format: CSV.
Date inspected: 2026-05-23.
Files on disk: sample_submission.csv — 186.5 MB; test.csv — 823.3 MB; test_supplement.csv — 2.5 GB; train.csv — 7.0 GB; train_sample.csv — 3.9 MB.

Structure

train.csv: 184,903,890 rows, 8 columns. One row = one ad click.
test.csv: no label.
Temporal coverage in train: 2017-11-06 14:32:21 to 2017-11-09 16:00:00.
Single flat table; no join keys between train/test beyond click features.

Schema

column	dtype	example	description
`ip`	BIGINT	`83230`	IP address of click (anonymised integer encoding)
`app`	BIGINT	`3`	App ID for marketing (encoded)
`device`	BIGINT	`1`	Device type ID (encoded, e.g. iphone 6 plus, iphone 7)
`os`	BIGINT	`13`	OS version ID (encoded)
`channel`	BIGINT	`379`	Channel ID of mobile ad publisher (encoded)
`click_time`	TIMESTAMP	`2017-11-06 14:32:21`	UTC timestamp of click
`attributed_time`	TIMESTAMP	`NaT`	UTC timestamp of app download if attributed
`is_attributed`	BIGINT	`0`	Target: 1 if app downloaded after click, else 0

All feature columns (ip, app, device, os, channel) are anonymised integer IDs with no public mapping.

Label

Label column: is_attributed (1 = click led to an app install, 0 = not).

is_attributed	count	rate
0	184,447,044	0.99753
1	456,846	0.00247

This is a conversion label, not a bot label — read it that way or not at all

is_attributed = 1 means an install followed the click. The positive class is therefore legitimate conversion, and the overwhelming negative class (99.75%) mixes ordinary non-converting clicks with fraudulent ones. There is no direct bot/human flag.

Consequences:

Not a supervised bot classifier. Training to predict is_attributed predicts conversion, not automation. Presenting such a model as bot detection would be wrong.
Fraud is inferred, not labelled. The signal of interest — many rapid clicks from one ip/device that never convert — has to be derived from the click structure, which is precisely a behavioural-synchrony / rate-anomaly analysis, not a classification against the given label.
The label is time-delayed. attributed_time is the install time, after the click, so the target is only known with lag (and is null for all negatives by design).

This is why the dataset is filed as a synchrony companion rather than a labelled detector: its value is the click-timing structure, not its target column.

Identifier inventory

No per-user ID. ip is the closest actor identifier but is coarse and NAT-prone (mobile carrier NAT puts many devices behind one IP). device is a device type, not a per-device fingerprint. No browser, cookie, or session-token columns exist.

column	n_unique	unique_rate	note
`app`	706	0.000004	app being advertised
`device`	3,475	0.000019	device type, not a per-device fingerprint
`os`	800	0.000004	OS version code
`channel`	202	0.000001	publisher channel
`is_attributed`	2	0.000000	label
`ip`	277,396	0.001500	weak actor identifier; shared by NAT/carrier traffic

The low ip unique-rate against 185M rows is the point: many clicks share each IP, which is what makes per-IP click-timing a usable coordination signal — and simultaneously what makes IP a weak identity signal, since mobile NAT means one IP is many devices.

Temporal structure

click_time: UTC timestamp at second granularity. Range 2017-11-06 14:32:21 to 2017-11-09 16:00:00 (~3 days 01:27:39 of wall-clock time).
attributed_time: UTC second granularity; populated only on positive labels.
Click density follows diurnal mobile-ad patterns; not plotted here.

The second-granularity click_time per ip/device is the field the synchrony method would tokenise — the dataset’s main asset for this project.

Missing data

Only attributed_time is meaningfully missing, and by design — it is null whenever is_attributed = 0. Feature columns are fully populated.

column	null_count	null_rate
`attributed_time`	184,447,044	0.99753
all others	0	0.00000

Quirks and observations

train.csv is ~7 GB; loaded via DuckDB streaming. Pandas would not fit it comfortably in memory.
All feature columns are anonymised integer IDs — no human-readable meaning, no PII.
attributed_time null whenever is_attributed = 0; by definition, not data loss.
IP cardinality is large but each IP can map to many devices (mobile NAT), so IP is a weak linkage signal.
test.csv lacks is_attributed / attributed_time.

Framing distance

What real problem it approximates: large-scale mobile click fraud with the per-IP/device click-timing structure the behavioural-synchrony method consumes — a real-data setting for coordinated-action detection.

What it fails to represent: no bot/human label (only conversion); anonymised integer features with no semantics; device is a coarse type, not a fingerprint; ip is NAT-prone; and there is no session/navigation structure, only isolated click events.

What further evidence would be needed: a bot-labelled source; de-anonymised or richer device/network signals; and session-level context to distinguish coordinated automation from ordinary high-volume legitimate traffic.

What it cannot show

A reader should not treat is_attributed as a bot label, nor read a conversion model as a fraud or bot detector. The dataset shows how click-timing structure behaves at scale and supports synchrony-style analysis; it does not, by itself, label or measure automated traffic.

Reproduction

Generated by notebooks/eda/talkingdata-adtracking.ipynb, which calls openbotrisk.eda.loaders.load_talkingdata_meta (DuckDB single-pass scans; no full in-memory materialisation).

jupyter nbconvert --to notebook --execute --inplace \
  notebooks/eda/talkingdata-adtracking.ipynb \
  --ExecutePreprocessor.timeout=600

Loader runtime on the reference machine: 6.2 s. No file is fully materialised in pandas; DuckDB performs the row count, null counts, cardinality, label counts, and time-range queries in a single pass over train.csv.