Dataset: Web Robot Sessions (Figshare 3477932)
Labelled web sessions with engineered HTTP/behavioural and page-semantic features, derived from a real library OPAC’s access logs. This is the most on-remit public dataset the project holds: its unit is the web session, its label is a bot/human flag, and it ships both the engineered features and the raw request log underneath. Its central caveat is the labels — they are the original authors’ heuristic, not ground truth, and a model that merely relearns that heuristic has measured nothing.
This is a dataset-reference page: it documents what the resource is, what it approximates, and what it cannot show. It does not run a detection method against the data — that would be a separate investigation (and a strong candidate for one). The descriptive content below is the output of notebooks/eda/web-robot-sessions.ipynb; the framing around it is the addition this page makes.
Why this dataset matters here
The project’s recurring problem is that public data sits at a distance from the real bot/abuse problem, and most datasets sit far — synthetic toys, vendor blogs, fraud tables labelled by chain-propagation. This one sits closer than most: the unit of analysis is a web session, which is the project’s actual subject, and it carries a direct ROBOT label rather than a proxy. That makes it the natural calibration anchor for any behavioural-detection work — including the local red-team experiment, which has perfect ground truth but only against its own attacker model. Where the red-team shows what signals look like on traffic you generated, this dataset shows whether those signals separate bots from humans at a measurable error rate on real sessions.
The cost of that proximity is the labelling problem, treated below as the load-bearing caution.
Access
Source: Figshare dataset 3477932 (web robot detection, session-level features + raw HTTP logs).
Underlying logs are from the Aristotle University of Thessaloniki library OPAC (search.lib.auth.gr), captured in early 2018 (per timestamps in public_v2.json). The dataset corresponds to the Rovetta et al. line of web-robot-detection work using semantic + behavioural session features.
- Local path:
data/web-robot-sessions - File formats: CSV (engineered session features) + JSON (raw per-request log).
- Date inspected: 2026-05-23.
- Files on disk:
public_v2.json— 3.0 GB;semantic_features.csv— 4.1 MB;simple_features.csv— 15.2 MB.
Structure
simple_features.csv: 67,352 rows × 32 cols. One row = one web session. HTTP/behavioural features + binaryROBOTlabel.semantic_features.csv: 67,352 rows × 7 cols. One row = one web session. Page-topic/semantic features + binaryROBOTlabel.public_v2.json: ~3.0 GB. Raw per-request log as a single JSON object keyed by request ID (Elasticsearch-style). Each value is a dict with HTTP request fields. Not loaded fully; only the first 100 entries were stream-parsed for schema inspection.- Join key:
IDlinks the two CSVs (67,352 / 67,352 session IDs overlap = 100%). The sessionIDis the parent key under which raw requests are grouped in the source logs; the raw JSON uses request-level IDs, so a direct session↔︎raw-request join would require an external mapping not included here.
Schema
simple_features (32 columns)
| column | dtype | example | description |
|---|---|---|---|
ID |
object | obSnwGoBCue8G08E_WCX |
Session id (join key) |
NUMBER_OF_REQUESTS |
int64 | 79 |
Number of HTTP requests in session |
TOTAL_DURATION |
int64 | 592 |
Session duration (seconds) |
AVERAGE_TIME |
float64 | 7.5897436 |
Mean inter-request interval (s) |
STANDARD_DEVIATION |
float64 | 1.8005404 |
Std of inter-request interval (s) |
REPEATED_REQUESTS |
float64 | 0.0 |
Fraction of repeated resource requests |
HTTP_RESPONSE_2XX |
float64 | 0.8734177 |
Fraction of 2xx responses |
HTTP_RESPONSE_3XX |
float64 | 0.1265823 |
Fraction of 3xx responses |
HTTP_RESPONSE_4XX |
float64 | 0.0 |
Fraction of 4xx responses |
HTTP_RESPONSE_5XX |
float64 | 0.0 |
Fraction of 5xx responses |
GET_METHOD |
float64 | 1.0 |
Fraction of GET requests |
POST_METHOD |
float64 | 0.0 |
Fraction of POST requests |
HEAD_METHOD |
float64 | 0.0 |
Fraction of HEAD requests |
OTHER_METHOD |
float64 | 0.0 |
Fraction of other HTTP methods |
NIGHT |
float64 | 0.0 |
Fraction of requests during night hours |
UNASSIGNED |
float64 | 1.0 |
Fraction of requests with unassigned referrer |
IMAGES |
float64 | 0.1012658 |
Fraction of image resources |
TOTAL_HTML |
float64 | 0.8987342 |
Fraction of HTML resources |
HTML_TO_IMAGE |
float64 | 0.1126761 |
HTML-to-image request ratio |
HTML_TO_CSS |
float64 | 0.0 |
HTML-to-CSS request ratio |
HTML_TO_JS |
float64 | 0.0 |
HTML-to-JS request ratio |
WIDTH |
float64 | 44.0 |
Session navigation graph width |
DEPTH |
float64 | 4.0 |
Session navigation graph depth |
STD_DEPTH |
float64 | 0.4940411 |
Std of navigation depth |
CONSECUTIVE |
float64 | 0.1012658 |
Fraction of consecutive sequential requests |
DATA |
float64 | 1555089.0 |
Total bytes transferred |
PPI |
float64 | 27183337.3 |
Pages-per-interval (request rate proxy) |
SF_REFERRER |
float64 | 0.0 |
Same-frame referrer fraction |
SF_FILETYPE |
float64 | 0.2051282 |
Same-frame filetype fraction |
MAX_BARRAGE |
int64 | 1 |
Max burst size (consecutive rapid requests) |
PENALTY |
int64 | 0 |
Heuristic penalty score |
ROBOT |
int64 | 1 |
Target: 1 = bot, 0 = human |
semantic_features (7 columns)
| column | dtype | example | description |
|---|---|---|---|
ID |
object | obSnwGoBCue8G08E_WCX |
Session id (join key) |
TOTAL_TOPICS |
int64 | 242 |
Total page topics visited |
UNIQUE_TOPICS |
int64 | 500 |
Distinct page topics visited |
PAGE_SIMILARITY |
float64 | 2.0661157 |
Mean pairwise page-content similarity |
PAGE_VARIANCE |
float64 | 92.2595556 |
Variance of page-content vectors |
BOOLEAN_PAGE_VARIANCE |
float64 | 0.1654137 |
Binary indicator of nontrivial page variance |
ROBOT |
int64 | 1 |
Target: 1 = bot, 0 = human |
public_v2.json (per-entry schema, from first 100 entries)
| field | type | example | description |
|---|---|---|---|
referrer |
str | http://search.lib.auth.gr/Record/68b03… |
HTTP Referer header (URL or -) |
request |
str | search.lib.auth.gr:80 66.249.34457 - - [01/Mar/2018… |
Full raw access-log request line |
method |
str | GET |
HTTP method |
resource |
str | /AJAX/d780f3cf8bf4e286eb6dec2f372f6d78… |
Requested URL path |
bytes |
str | 491 |
Response size in bytes (string) |
response |
str | 200 |
HTTP status code (string) |
ip |
str | 66.249.34457 |
Client IP (final octet digit-jumbled in source) |
useragent |
str | Mozilla/5.0 (compatible; Googlebot/2.1; …) |
Client User-Agent string |
timestamp |
str | 2018-02-28T22:00:01.000Z |
ISO-8601 UTC timestamp |
Label
Label column: ROBOT in both CSVs (1 = bot, 0 = human).
| ROBOT | count | rate |
|---|---|---|
| 0 | 53,858 | 0.79965 |
| 1 | 13,494 | 0.20035 |
Label agreement between the two CSVs on shared ID: 1.0000 (labels are derived from the same ground-truth session classification). Class imbalance is roughly 4:1 human:bot — mild, manageable without resampling.
Per the Figshare/paper description, the ROBOT flag was assigned in the source dataset by heuristic plus manual review of session user-agents and behaviour, and is session-level rather than request-level. It is not an independently verified bot/human determination — no such thing exists for real traffic at this scale.
Two consequences for any modelling:
- Circularity risk. A classifier trained on these features to predict
ROBOTmay simply relearn the user-agent / behaviour heuristic that produced the label. If so, its apparent accuracy measures agreement with that heuristic, not detection of bots. The honest experiment holds out UA-derived signal, or explicitly benchmarks a UA-baseline against the behavioural model to measure the marginal signal over the labelling rule. - Ceiling, not truth. Error rates measured against this label are error rates against the heuristic — useful calibration of “do these behavioural features agree with the standard labelling,” not proof of catching real bots.
A specific, sharper version of the risk lives in the data: the PENALTY column looks like a heuristic bot-score, and if it fed the ROBOT labelling it would leak the label directly. Verify and likely drop it before using it as a feature.
Identifier inventory
The CSVs expose only the session ID; per-session actor attributes are absent. Actor-level signals (IP, User-Agent) live in the raw public_v2.json log at the request level.
| column | source | n_unique (in scope) | role |
|---|---|---|---|
ID |
both CSVs | 67,352 | session primary key (Elasticsearch-style id) |
ip |
JSON (per request) | 9 (in 100-sample) | client IP (obfuscated, weak actor id) |
useragent |
JSON (per request) | 7 (in 100-sample) | UA string (weak actor/bot signal) |
referrer |
JSON (per request) | n/a | referring URL |
The source obfuscates the final octet of each IPv4 address by digit-jumbling (e.g. 66.249.34457), so IPs cannot be geolocated or joined to external lists. The bot user-agents in the sample are academic crawlers (Googlebot, BUbiNG, ICC-Crawler) — previewing a framing point below: the bot class here is dominated by indexing crawlers, not adversarial automation.
Temporal structure
The CSVs hold only aggregated session-level temporal features (TOTAL_DURATION, AVERAGE_TIME, STANDARD_DEVIATION, NIGHT, MAX_BARRAGE); no wall-clock session start/end timestamps are exposed. Wall-clock timestamps live only in public_v2.json at the per-request level.
- Format: ISO-8601 UTC string, e.g.
2018-02-28T22:00:01.000Z(millisecond precision). - Sample range (100 requests from the head of the file): 2018-02-28 22:00:01 to 2018-02-28 22:00:25 UTC.
- The raw access-log line inside
requestalso carries the original local timestamp with a+0200(Athens) offset, confirming a European source. - The full-file temporal range cannot be reported without scanning the 3 GB JSON, which is out of scope for this bounded EDA.
Missing data
simple_features: 43,221 null cells overall (2.0054% of cells); 3 of 32 columns have any nulls.semantic_features: 26,328 null cells overall (5.5843%); 3 of 7 columns have any nulls.public_v2.json: dense in the 100-row sample;referreris the literal string-when absent (Apache convention), not JSON null, so a full-file null check must look for-sentinels.
Columns with any nulls in simple_features: STANDARD_DEVIATION (0.2139), SF_FILETYPE (0.2139), SF_REFERRER (0.2139). Columns with any nulls in semantic_features: BOOLEAN_PAGE_VARIANCE (0.1303), PAGE_VARIANCE (0.1303), PAGE_SIMILARITY (0.1303).
Quirks and observations
- Three-file layout: two session-feature CSVs (engineered) + one 3 GB raw JSON log. The CSVs are pre-computed features; modelling can use them directly.
- Both CSVs have identical row counts and 100%
IDoverlap; two feature blocks for the same session table, inner-joinable onID. - The raw JSON is a single top-level object rather than NDJSON. Streaming parse works only because each entry happens to be on its own line; any reformatter would break naive line-parsers.
- Raw-log IDs are per-request, not per-session; there is no in-file mapping from a session
IDto its constituent request IDs. - Client IPs are partially obfuscated (final octet digit-shuffled), so they cannot be geolocated.
referreruses"-"for missing values instead of JSON null.PENALTYlooks like a heuristic bot-score; if it was used to deriveROBOTit leaks the label. Verify before using as a feature.- The source host is a single library OPAC, so traffic patterns and bot mix are domain-specific — academic search crawlers (Googlebot, ICC-Crawler) dominate the bot class.
Framing distance
What real problem it approximates: behavioural bot detection on real web sessions — the project’s actual unit of analysis — with engineered features close to what a defender would compute from access logs.
What it fails to represent: the labels are a heuristic, not verified ground truth; the traffic is a single 2018 library OPAC, not a transactional commercial site under adversarial bot pressure; the bots are predominantly academic indexing crawlers, not sophisticated targeted automation; and the obfuscated IP plus absent TLS/fingerprint signals mean the network and device layers are invisible.
What further evidence would be needed: independently verified labels (unavailable for real traffic); a transactional site with scarce-resource flows; adversarial/targeted bot traffic rather than crawlers; and the raw network/TLS signals the engineered features omit.
What it cannot show
A reader should not conclude that a behavioural model’s accuracy here transfers to (a) sophisticated adversaries, (b) commercial transactional sites, or (c) the detection of bots in general rather than agreement with one labelling heuristic on crawler-dominated traffic. It calibrates behavioural-feature separability against a standard heuristic — a real and useful thing, and a bounded one.
Reproduction
Generated by notebooks/eda/web-robot-sessions.ipynb, which calls openbotrisk.eda.loaders.load_web_robot_meta (pandas full-read for the two CSVs; manual line-streaming for the first 100 entries of public_v2.json).
jupyter nbconvert --to notebook --execute --inplace \
notebooks/eda/web-robot-sessions.ipynb \
--ExecutePreprocessor.timeout=300Loader runtime on the reference machine: 0.2 s. The two CSVs fit in memory; the 3 GB JSON is never fully materialised — only the first 100 entries are read.