Dataset: Web Robot Sessions (Figshare 3477932)

A labelled, session-level web-traffic dataset for bot detection — the closest public data to the project’s actual unit of analysis, and the one whose labels demand the most care.

Labelled web sessions with engineered HTTP/behavioural and page-semantic features, derived from a real library OPAC’s access logs. This is the most on-remit public dataset the project holds: its unit is the web session, its label is a bot/human flag, and it ships both the engineered features and the raw request log underneath. Its central caveat is the labels — they are the original authors’ heuristic, not ground truth, and a model that merely relearns that heuristic has measured nothing.

This is a dataset-reference page: it documents what the resource is, what it approximates, and what it cannot show. It does not run a detection method against the data — that would be a separate investigation (and a strong candidate for one). The descriptive content below is the output of notebooks/eda/web-robot-sessions.ipynb; the framing around it is the addition this page makes.

Why this dataset matters here

The project’s recurring problem is that public data sits at a distance from the real bot/abuse problem, and most datasets sit far — synthetic toys, vendor blogs, fraud tables labelled by chain-propagation. This one sits closer than most: the unit of analysis is a web session, which is the project’s actual subject, and it carries a direct ROBOT label rather than a proxy. That makes it the natural calibration anchor for any behavioural-detection work — including the local red-team experiment, which has perfect ground truth but only against its own attacker model. Where the red-team shows what signals look like on traffic you generated, this dataset shows whether those signals separate bots from humans at a measurable error rate on real sessions.

The cost of that proximity is the labelling problem, treated below as the load-bearing caution.

Access

Source: Figshare dataset 3477932 (web robot detection, session-level features + raw HTTP logs).

Underlying logs are from the Aristotle University of Thessaloniki library OPAC (search.lib.auth.gr), captured in early 2018 (per timestamps in public_v2.json). The dataset corresponds to the Rovetta et al. line of web-robot-detection work using semantic + behavioural session features.

  • Local path: data/web-robot-sessions
  • File formats: CSV (engineered session features) + JSON (raw per-request log).
  • Date inspected: 2026-05-23.
  • Files on disk: public_v2.json — 3.0 GB; semantic_features.csv — 4.1 MB; simple_features.csv — 15.2 MB.

Structure

  • simple_features.csv: 67,352 rows × 32 cols. One row = one web session. HTTP/behavioural features + binary ROBOT label.
  • semantic_features.csv: 67,352 rows × 7 cols. One row = one web session. Page-topic/semantic features + binary ROBOT label.
  • public_v2.json: ~3.0 GB. Raw per-request log as a single JSON object keyed by request ID (Elasticsearch-style). Each value is a dict with HTTP request fields. Not loaded fully; only the first 100 entries were stream-parsed for schema inspection.
  • Join key: ID links the two CSVs (67,352 / 67,352 session IDs overlap = 100%). The session ID is the parent key under which raw requests are grouped in the source logs; the raw JSON uses request-level IDs, so a direct session↔︎raw-request join would require an external mapping not included here.

Schema

simple_features (32 columns)

column dtype example description
ID object obSnwGoBCue8G08E_WCX Session id (join key)
NUMBER_OF_REQUESTS int64 79 Number of HTTP requests in session
TOTAL_DURATION int64 592 Session duration (seconds)
AVERAGE_TIME float64 7.5897436 Mean inter-request interval (s)
STANDARD_DEVIATION float64 1.8005404 Std of inter-request interval (s)
REPEATED_REQUESTS float64 0.0 Fraction of repeated resource requests
HTTP_RESPONSE_2XX float64 0.8734177 Fraction of 2xx responses
HTTP_RESPONSE_3XX float64 0.1265823 Fraction of 3xx responses
HTTP_RESPONSE_4XX float64 0.0 Fraction of 4xx responses
HTTP_RESPONSE_5XX float64 0.0 Fraction of 5xx responses
GET_METHOD float64 1.0 Fraction of GET requests
POST_METHOD float64 0.0 Fraction of POST requests
HEAD_METHOD float64 0.0 Fraction of HEAD requests
OTHER_METHOD float64 0.0 Fraction of other HTTP methods
NIGHT float64 0.0 Fraction of requests during night hours
UNASSIGNED float64 1.0 Fraction of requests with unassigned referrer
IMAGES float64 0.1012658 Fraction of image resources
TOTAL_HTML float64 0.8987342 Fraction of HTML resources
HTML_TO_IMAGE float64 0.1126761 HTML-to-image request ratio
HTML_TO_CSS float64 0.0 HTML-to-CSS request ratio
HTML_TO_JS float64 0.0 HTML-to-JS request ratio
WIDTH float64 44.0 Session navigation graph width
DEPTH float64 4.0 Session navigation graph depth
STD_DEPTH float64 0.4940411 Std of navigation depth
CONSECUTIVE float64 0.1012658 Fraction of consecutive sequential requests
DATA float64 1555089.0 Total bytes transferred
PPI float64 27183337.3 Pages-per-interval (request rate proxy)
SF_REFERRER float64 0.0 Same-frame referrer fraction
SF_FILETYPE float64 0.2051282 Same-frame filetype fraction
MAX_BARRAGE int64 1 Max burst size (consecutive rapid requests)
PENALTY int64 0 Heuristic penalty score
ROBOT int64 1 Target: 1 = bot, 0 = human

semantic_features (7 columns)

column dtype example description
ID object obSnwGoBCue8G08E_WCX Session id (join key)
TOTAL_TOPICS int64 242 Total page topics visited
UNIQUE_TOPICS int64 500 Distinct page topics visited
PAGE_SIMILARITY float64 2.0661157 Mean pairwise page-content similarity
PAGE_VARIANCE float64 92.2595556 Variance of page-content vectors
BOOLEAN_PAGE_VARIANCE float64 0.1654137 Binary indicator of nontrivial page variance
ROBOT int64 1 Target: 1 = bot, 0 = human

public_v2.json (per-entry schema, from first 100 entries)

field type example description
referrer str http://search.lib.auth.gr/Record/68b03… HTTP Referer header (URL or -)
request str search.lib.auth.gr:80 66.249.34457 - - [01/Mar/2018… Full raw access-log request line
method str GET HTTP method
resource str /AJAX/d780f3cf8bf4e286eb6dec2f372f6d78… Requested URL path
bytes str 491 Response size in bytes (string)
response str 200 HTTP status code (string)
ip str 66.249.34457 Client IP (final octet digit-jumbled in source)
useragent str Mozilla/5.0 (compatible; Googlebot/2.1; …) Client User-Agent string
timestamp str 2018-02-28T22:00:01.000Z ISO-8601 UTC timestamp

Label

Label column: ROBOT in both CSVs (1 = bot, 0 = human).

ROBOT count rate
0 53,858 0.79965
1 13,494 0.20035

Label agreement between the two CSVs on shared ID: 1.0000 (labels are derived from the same ground-truth session classification). Class imbalance is roughly 4:1 human:bot — mild, manageable without resampling.

ImportantThe labels are a heuristic, not ground truth — this is the load-bearing caution

Per the Figshare/paper description, the ROBOT flag was assigned in the source dataset by heuristic plus manual review of session user-agents and behaviour, and is session-level rather than request-level. It is not an independently verified bot/human determination — no such thing exists for real traffic at this scale.

Two consequences for any modelling:

  1. Circularity risk. A classifier trained on these features to predict ROBOT may simply relearn the user-agent / behaviour heuristic that produced the label. If so, its apparent accuracy measures agreement with that heuristic, not detection of bots. The honest experiment holds out UA-derived signal, or explicitly benchmarks a UA-baseline against the behavioural model to measure the marginal signal over the labelling rule.
  2. Ceiling, not truth. Error rates measured against this label are error rates against the heuristic — useful calibration of “do these behavioural features agree with the standard labelling,” not proof of catching real bots.

A specific, sharper version of the risk lives in the data: the PENALTY column looks like a heuristic bot-score, and if it fed the ROBOT labelling it would leak the label directly. Verify and likely drop it before using it as a feature.

Identifier inventory

The CSVs expose only the session ID; per-session actor attributes are absent. Actor-level signals (IP, User-Agent) live in the raw public_v2.json log at the request level.

column source n_unique (in scope) role
ID both CSVs 67,352 session primary key (Elasticsearch-style id)
ip JSON (per request) 9 (in 100-sample) client IP (obfuscated, weak actor id)
useragent JSON (per request) 7 (in 100-sample) UA string (weak actor/bot signal)
referrer JSON (per request) n/a referring URL

The source obfuscates the final octet of each IPv4 address by digit-jumbling (e.g. 66.249.34457), so IPs cannot be geolocated or joined to external lists. The bot user-agents in the sample are academic crawlers (Googlebot, BUbiNG, ICC-Crawler) — previewing a framing point below: the bot class here is dominated by indexing crawlers, not adversarial automation.

Temporal structure

The CSVs hold only aggregated session-level temporal features (TOTAL_DURATION, AVERAGE_TIME, STANDARD_DEVIATION, NIGHT, MAX_BARRAGE); no wall-clock session start/end timestamps are exposed. Wall-clock timestamps live only in public_v2.json at the per-request level.

  • Format: ISO-8601 UTC string, e.g. 2018-02-28T22:00:01.000Z (millisecond precision).
  • Sample range (100 requests from the head of the file): 2018-02-28 22:00:01 to 2018-02-28 22:00:25 UTC.
  • The raw access-log line inside request also carries the original local timestamp with a +0200 (Athens) offset, confirming a European source.
  • The full-file temporal range cannot be reported without scanning the 3 GB JSON, which is out of scope for this bounded EDA.

Missing data

  • simple_features: 43,221 null cells overall (2.0054% of cells); 3 of 32 columns have any nulls.
  • semantic_features: 26,328 null cells overall (5.5843%); 3 of 7 columns have any nulls.
  • public_v2.json: dense in the 100-row sample; referrer is the literal string - when absent (Apache convention), not JSON null, so a full-file null check must look for - sentinels.

Columns with any nulls in simple_features: STANDARD_DEVIATION (0.2139), SF_FILETYPE (0.2139), SF_REFERRER (0.2139). Columns with any nulls in semantic_features: BOOLEAN_PAGE_VARIANCE (0.1303), PAGE_VARIANCE (0.1303), PAGE_SIMILARITY (0.1303).

Quirks and observations

  • Three-file layout: two session-feature CSVs (engineered) + one 3 GB raw JSON log. The CSVs are pre-computed features; modelling can use them directly.
  • Both CSVs have identical row counts and 100% ID overlap; two feature blocks for the same session table, inner-joinable on ID.
  • The raw JSON is a single top-level object rather than NDJSON. Streaming parse works only because each entry happens to be on its own line; any reformatter would break naive line-parsers.
  • Raw-log IDs are per-request, not per-session; there is no in-file mapping from a session ID to its constituent request IDs.
  • Client IPs are partially obfuscated (final octet digit-shuffled), so they cannot be geolocated.
  • referrer uses "-" for missing values instead of JSON null.
  • PENALTY looks like a heuristic bot-score; if it was used to derive ROBOT it leaks the label. Verify before using as a feature.
  • The source host is a single library OPAC, so traffic patterns and bot mix are domain-specific — academic search crawlers (Googlebot, ICC-Crawler) dominate the bot class.

Framing distance

What real problem it approximates: behavioural bot detection on real web sessions — the project’s actual unit of analysis — with engineered features close to what a defender would compute from access logs.

What it fails to represent: the labels are a heuristic, not verified ground truth; the traffic is a single 2018 library OPAC, not a transactional commercial site under adversarial bot pressure; the bots are predominantly academic indexing crawlers, not sophisticated targeted automation; and the obfuscated IP plus absent TLS/fingerprint signals mean the network and device layers are invisible.

What further evidence would be needed: independently verified labels (unavailable for real traffic); a transactional site with scarce-resource flows; adversarial/targeted bot traffic rather than crawlers; and the raw network/TLS signals the engineered features omit.

What it cannot show

A reader should not conclude that a behavioural model’s accuracy here transfers to (a) sophisticated adversaries, (b) commercial transactional sites, or (c) the detection of bots in general rather than agreement with one labelling heuristic on crawler-dominated traffic. It calibrates behavioural-feature separability against a standard heuristic — a real and useful thing, and a bounded one.

Reproduction

Generated by notebooks/eda/web-robot-sessions.ipynb, which calls openbotrisk.eda.loaders.load_web_robot_meta (pandas full-read for the two CSVs; manual line-streaming for the first 100 entries of public_v2.json).

jupyter nbconvert --to notebook --execute --inplace \
  notebooks/eda/web-robot-sessions.ipynb \
  --ExecutePreprocessor.timeout=300

Loader runtime on the reference machine: 0.2 s. The two CSVs fit in memory; the 3 GB JSON is never fully materialised — only the first 100 entries are read.