Goals of the paper:
- measuring how widely paywalls have been adopted
- present an automated system for detecting whether a site uses a paywall
The web is increasingly moving away from “open” models to “paywalled” models (see paywalls), to keep the user engaged with the platform.
People are annoyed by ad-based websites because:
- Google and Facebook have a monopoly on the advertising industry (70% of revenues)
- Ad-based funding systems suffer from significant and increasing rates of fraud
- privacy concerns
Paywall circumvention
We find that all observed paywalls are trivial to circumvent. Well-known techniques includes:
- emptying the cookie jar (75% effective)
- enabling Incognito/Private Mode (sufficient to bypass most paywalls)
- changing the screen size dimensions
- hiding the user’s IP address
- changing the user agent string
- using an ad blocker extension
- enabling “Reader Mode”
- using the Pocket web service5
- blocking HTTP requests for popular paywall libraries
Paywall detection
Our model consists of two components:
- a crawling component that visits a subset of pages on a site, records information about each page’s execution, and extracts some ML features
- text features: the presence of specific keywords is checked (“subscribe”, “sign up”, “remaining”, translated in 87 languages)
- structural features: if the website has a RSS (RDF Site Summary) or atom feed, the number of text nodes increases after cookies are cleaned, etc
- visual features: how many text nodes are obscured before and after cleaning cookies, number of text nodes, number of nodes that have z-index styles (to detect pop-ups)
- a classifier, that uses the extracted features to predict if the site uses a paywall
- a random forest is used from SciKit-Learn python library
- the final accuracy was of 77%