Scrapinghub Reference Documentation

General Web Page InformationΒΆ

If you requested any kind of extraction (e.g. article or job posting), and the extraction succeeds, then along with the type-specific field (e.g. article or jobPosting), the webPage field will be available in the query result:

import requests

response = requests.post(
    'https://autoextract.scrapinghub.com/v1/extract',
    auth=('[api key]', ''),
    json=[{'url': 'http://example.com/article',
           'pageType': 'article'}])
print(response.json()[0]['article'])
print(response.json()[0]['webPage'])

The following fields are available for web page extraction:

Name

Type

Description

inLanguages

List of dictionaries with code field

The list of languages used on the page, ordered from the most prominently used language to the least used. code denotes the IETF BCP 47 language tag. In case the language is not detected, the field is ommited.

All fields are optional. Fields without a valid value (null or empty array) are excluded from extraction results.

Below is an example response with all web page fields present in case article extraction was requested. Article fields are ommited in this example, the webPage field would be present for other kinds of extraction as well:

[
  {
    "article": {
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query": {
      "id": "1564747029122-9e02a1868d70b7a3",
      "domain": "example.com",
      "userQuery": {
        "pageType": "article",
        "url": "https://example.com/article"
      }
    }
  }
]