Public datasets

Reference pages for the public datasets the project inspects, each documenting what the data approximates and — more importantly — what it cannot show.

The project reads public datasets the way it reads any source: for what they approximate and what they fail to represent. These pages document each dataset’s structure, labels, and — the load-bearing part — its framing distance from the real bot/abuse problem. They are reference pages, not investigations: they describe a resource, they do not run a method against it.

Every page follows the same shape — access, structure, schema, label, identifiers, temporal structure, missingness, quirks, framing distance, and an explicit “what it cannot show.” The framing-distance and cannot-show sections carry the analytical weight, because the project’s standing position is that no public dataset stands in cleanly for the real problem, and the honest move is to name the gap per source.

The datasets

  • Web Robot Sessions (Figshare 3477932) — labelled, session-level web traffic; the closest public data to the project’s actual unit of analysis. Strongest fit. Its labels are a heuristic, not ground truth, which is the central caution.
  • Facebook Recruiting IV: Human or Robot? (Kaggle) — actor-level auction-bid bot labelling; useful mainly as a methods read on weak labels, rare positives, temporal feature engineering, and what published Kaggle solutions converged on. Dated and low-proximity, not a benchmark.
  • IEEE-CIS Fraud Detection (Kaggle) — labelled payment-fraud transactions with card/email/device identifiers; the real-data companion to the identifier-graph method. Its chain-propagated label is itself an entity-resolution lesson.
  • TalkingData AdTracking (Kaggle) — large-scale mobile ad clicks; a natural home for the behavioural-synchrony method, provided its is_attributed target is read as a conversion flag, not a bot flag.

CTU-13 botnet netflow is documented in the repository as a scope-boundary case, but it is not linked here or rendered into the site because it sits at the network layer this project generally excludes.

A note on what these are for

Two complementary roles. As provenance, they document exactly what was inspected and how, so any later use of a dataset is traceable. As calibration anchors, the labelled ones (Web Robot Sessions especially) provide the measurable-error-rate setting that the project’s own synthetic and live experiments lack — the place where a method’s accuracy can be checked against real labels, with the standing caveat that the labels themselves are imperfect.