Dataset: TalkingData AdTracking (Kaggle)
A single large flat table of mobile ad clicks (~185 million rows), each with anonymised ip, app, device, os, channel, a click_time, and an is_attributed target marking whether the click led to an app install. It is a natural real-data setting for the behavioural-synchrony method — rapid repeated clicks per IP/device that never convert — as long as the label is read correctly: it marks conversion, not bot.
This is a dataset-reference page. Running the synchrony method against the click-timing structure would be a separate investigation, already flagged in the live-site TODO. The descriptive content is the output of notebooks/eda/talkingdata-adtracking.ipynb; the framing is this page’s addition.
Why this dataset matters here
Click fraud is squarely in scope (ad fraud is a listed abuse type) and is underrepresented in the project so far. More usefully, the dataset’s structure — many clicks per ip/device in tight time windows — is exactly the input the behavioural-synchrony method consumes. It is the real-data counterpart to the synthetic synchrony experiment, where coordinated clicks were generated by hand.
The catch is the label, treated below as the load-bearing caution: is_attributed is a conversion signal, so the positive class is legitimate install, and fraud must be inferred from the absence of attribution plus volume/timing patterns. The dataset supports synchrony-style analysis well and a supervised bot classifier badly.
Access
Source: Kaggle competition talkingdata-adtracking-fraud-detection (via kaggle competitions download).
- Local path:
data/talkingdata-adtracking-fraud-detection - File format: CSV.
- Date inspected: 2026-05-23.
- Files on disk:
sample_submission.csv— 186.5 MB;test.csv— 823.3 MB;test_supplement.csv— 2.5 GB;train.csv— 7.0 GB;train_sample.csv— 3.9 MB.
Structure
train.csv: 184,903,890 rows, 8 columns. One row = one ad click.test.csv: no label.- Temporal coverage in train: 2017-11-06 14:32:21 to 2017-11-09 16:00:00.
- Single flat table; no join keys between train/test beyond click features.
Schema
| column | dtype | example | description |
|---|---|---|---|
ip |
BIGINT | 83230 |
IP address of click (anonymised integer encoding) |
app |
BIGINT | 3 |
App ID for marketing (encoded) |
device |
BIGINT | 1 |
Device type ID (encoded, e.g. iphone 6 plus, iphone 7) |
os |
BIGINT | 13 |
OS version ID (encoded) |
channel |
BIGINT | 379 |
Channel ID of mobile ad publisher (encoded) |
click_time |
TIMESTAMP | 2017-11-06 14:32:21 |
UTC timestamp of click |
attributed_time |
TIMESTAMP | NaT |
UTC timestamp of app download if attributed |
is_attributed |
BIGINT | 0 |
Target: 1 if app downloaded after click, else 0 |
All feature columns (ip, app, device, os, channel) are anonymised integer IDs with no public mapping.
Label
Label column: is_attributed (1 = click led to an app install, 0 = not).
| is_attributed | count | rate |
|---|---|---|
| 0 | 184,447,044 | 0.99753 |
| 1 | 456,846 | 0.00247 |
is_attributed = 1 means an install followed the click. The positive class is therefore legitimate conversion, and the overwhelming negative class (99.75%) mixes ordinary non-converting clicks with fraudulent ones. There is no direct bot/human flag.
Consequences:
- Not a supervised bot classifier. Training to predict
is_attributedpredicts conversion, not automation. Presenting such a model as bot detection would be wrong. - Fraud is inferred, not labelled. The signal of interest — many rapid clicks from one
ip/devicethat never convert — has to be derived from the click structure, which is precisely a behavioural-synchrony / rate-anomaly analysis, not a classification against the given label. - The label is time-delayed.
attributed_timeis the install time, after the click, so the target is only known with lag (and is null for all negatives by design).
This is why the dataset is filed as a synchrony companion rather than a labelled detector: its value is the click-timing structure, not its target column.
Identifier inventory
No per-user ID. ip is the closest actor identifier but is coarse and NAT-prone (mobile carrier NAT puts many devices behind one IP). device is a device type, not a per-device fingerprint. No browser, cookie, or session-token columns exist.
| column | n_unique | unique_rate | note |
|---|---|---|---|
app |
706 | 0.000004 | app being advertised |
device |
3,475 | 0.000019 | device type, not a per-device fingerprint |
os |
800 | 0.000004 | OS version code |
channel |
202 | 0.000001 | publisher channel |
is_attributed |
2 | 0.000000 | label |
ip |
277,396 | 0.001500 | weak actor identifier; shared by NAT/carrier traffic |
The low ip unique-rate against 185M rows is the point: many clicks share each IP, which is what makes per-IP click-timing a usable coordination signal — and simultaneously what makes IP a weak identity signal, since mobile NAT means one IP is many devices.
Temporal structure
click_time: UTC timestamp at second granularity. Range 2017-11-06 14:32:21 to 2017-11-09 16:00:00 (~3 days 01:27:39 of wall-clock time).attributed_time: UTC second granularity; populated only on positive labels.- Click density follows diurnal mobile-ad patterns; not plotted here.
The second-granularity click_time per ip/device is the field the synchrony method would tokenise — the dataset’s main asset for this project.
Missing data
Only attributed_time is meaningfully missing, and by design — it is null whenever is_attributed = 0. Feature columns are fully populated.
| column | null_count | null_rate |
|---|---|---|
attributed_time |
184,447,044 | 0.99753 |
| all others | 0 | 0.00000 |
Quirks and observations
train.csvis ~7 GB; loaded via DuckDB streaming. Pandas would not fit it comfortably in memory.- All feature columns are anonymised integer IDs — no human-readable meaning, no PII.
attributed_timenull wheneveris_attributed = 0; by definition, not data loss.- IP cardinality is large but each IP can map to many devices (mobile NAT), so IP is a weak linkage signal.
test.csvlacksis_attributed/attributed_time.
Framing distance
What real problem it approximates: large-scale mobile click fraud with the per-IP/device click-timing structure the behavioural-synchrony method consumes — a real-data setting for coordinated-action detection.
What it fails to represent: no bot/human label (only conversion); anonymised integer features with no semantics; device is a coarse type, not a fingerprint; ip is NAT-prone; and there is no session/navigation structure, only isolated click events.
What further evidence would be needed: a bot-labelled source; de-anonymised or richer device/network signals; and session-level context to distinguish coordinated automation from ordinary high-volume legitimate traffic.
What it cannot show
A reader should not treat is_attributed as a bot label, nor read a conversion model as a fraud or bot detector. The dataset shows how click-timing structure behaves at scale and supports synchrony-style analysis; it does not, by itself, label or measure automated traffic.
Reproduction
Generated by notebooks/eda/talkingdata-adtracking.ipynb, which calls openbotrisk.eda.loaders.load_talkingdata_meta (DuckDB single-pass scans; no full in-memory materialisation).
jupyter nbconvert --to notebook --execute --inplace \
notebooks/eda/talkingdata-adtracking.ipynb \
--ExecutePreprocessor.timeout=600Loader runtime on the reference machine: 6.2 s. No file is fully materialised in pandas; DuckDB performs the row count, null counts, cardinality, label counts, and time-range queries in a single pass over train.csv.