Scrapinghub Reference Documentation

Crawlera Fetch API (experimental)

Warning

The Crawlera Fetch API is considered experimental. Use it only if you’re getting assistance from a Scrapinghub engineer. This documentation is subject to change without previous notice.

Please note that Crawlera had an old Fetch API that is no longer supported. This is an entirely new API with different parameters.

To use the Fetch API you will need a Crawlera API key with Browser Execution functionality enabled, even if you don’t use the render and screenshot parameters. Otherwise you will get 401 Unauthorized response.

We are currently testing the Fetch API under a limited private beta. Please fill this form if you’re interested in trying it out. We don’t have a date for the public release yet.

The Crawlera Fetch API allows you to download web pages using an HTTP API, instead of a Proxy API. It provides server-side browser execution capabilities, and better browser emulation than requests processed through the standard proxy API.

Authentication is done through standard HTTP auth, using your Crawlera API key as the user name and an empty password.

Here is an example of a working request (replace API_KEY with your API key):

curl -u <API_KEY>: http://fetch.crawlera.com/fetch/v2 -d url=https://toscrape.com

The examples in this documentation are provided as commands to execute in a terminal. You will need the curl and jq command line tools. curl often comes installed with your operating system, while jq needs to be downloaded.

You can download jq at: https://stedolan.github.io/jq/download/

Request Endpoint & Parameters

  • Endpoint: http://fetch.crawlera.com:8010/fetch/v2

  • Method: POST

  • Parameter values should be URL encoded.

Parameter

Required

Description

Example

Default

url

yes

URL to fetch

https://toscrape.com

region

no

The region to route the request through, specified as a country code. If auto or ommitted, Crawlera will pick the best region to route the request based on the target website.

es

auto

render

no

Pass true to render the URL in a browser.

true

false

screenshot

no

Pass true to return the screenshot field in the response, with a screenshot of the page. Implies render=true.

true

false

Example:

curl -u <API key>: http://fetch.crawlera.com:8010/fetch/v2 -d url=https://toscrape.com/

Response Format

The status code of the Fetch API response is always 200 (regardless of the response from the target website) unless there is a problem with the Fetch API itself.

The response size limit is 10 Mb.

The API response is a (utf-8 encoded) JSON object with the following fields:

Name

Type

Description

url

String

The URL of the page fetched

body

String

The body of the response, encoded using body_encoding.

body_encoding

String

The encoding used for the body of the response. Either plain or base64.

headers

Object

The HTTP headers of the response.

original_status

String

The HTTP status of the response received from the website

crawlera_status

string

Crawlera status, one of:

  • success - successful request (counts towards monthly quota)

  • ban - request was banned after trying multiple proxies

  • fail - other error (not a ban) prevented fulfilling the request. See crawlera_error.

screenshot

String

A screenshot of the page, encoded in base64

Example Fetch API response:

{
      "url": "https://toscrape.com",
      "screenshot": "",
      "original_status": 200,
      "headers": {
        "server": "nginx/1.14.0 (Ubuntu)",
        "date": "Mon, 25 May 2020 17:40:16 GMT",
        "content-type": "text/html",
        "last-modified": "Wed, 29 Jun 2016 21:51:37 GMT",
        "x-upstream": "toscrape-sites-master_web",
        "transfer-encoding": "chunked"
      },
      "crawlera_status": "success",
      "body_encoding": "plain",
      "body": "...HTML of the response goes here..."
    }

Use Cases

Fetch the HTML of a page rendered in a browser

To run this example you will need:

  • a Crawlera API key with Browser Execution enabled

  • curl, jq command line utilies

Example:

curl -u <API_KEY>: http://fetch.crawlera.com:8010/fetch/v2/ -d url=https://toscrape.com/ -d render=true | jq '.body' -r > page.html

Fetching a screenshot

To run this example you will need:

  • a Crawlera API key with Browser Execution enabled

  • curl, jq, base64 command line utilies

Example:

curl -u <API_KEY>: http://fetch.crawlera.com:8010/fetch/v2/ -d url=https://toscrape.com/ -d screenshot=true -d render=true | jq '.screenshot' -r | base64 -d > image.jpg

Scrapy Middleware For Crawlera Fetch API

There is an official Scrapy downloader middleware to download pages using Crawlera Fetch API.

https://github.com/scrapy-plugins/scrapy-crawlera-fetch

Installation and usage instructions can be found in the README of the project.