Scrapinghub Reference Documentation

Crawlera Proxy API

Note

Check also the Crawlera Help Center for general guides and articles.

Proxy API

Crawlera works with a standard HTTP web proxy API, where you only need an API key for authentication. This is the standard way to perform a request via Crawlera:

curl -vx proxy.crawlera.com:8010 -U <API key>: http://httpbin.org/ip

Crawlera Sessions

See Crawlera Sessions

Request Headers

Crawlera supports multiple HTTP headers to control its behaviour.

Not all headers are available in every plan, here is a chart of the headers available in each plan:

Header

Basic

Advanced

Enterprise

X-Crawlera-Timeout

X-Crawlera-JobId

X-Crawlera-Max-Retries

X-Crawlera-No-Bancheck

X-Crawlera-Profile

1

1

X-Crawlera-Profile-Pass

X-Crawlera-Cookies

2

2

X-Crawlera-Session

1 Basic and Advanced users may send desktop or mobile values only.

2 Defaults to discard value.

X-Crawlera-Profile

Available on Basic, Advanced, C50, C100, C200 and Enterprise plans.

This is a replacement of X-Crawlera-UA header with slightly different behaviour: X-Crawlera-UA only sets User-Agent header but X-Crawlera-Profile applies a set of headers which actually used by the browser. For example, all modern browsers set Accept-Language and Accept-Encoding headers. Also, some browsers set DNT and Upgrade-Insecure-Requests headers.

This header’s intent is to replace legacy X-Crawlera-UA so if you pass both X-Crawlera-UA and X-Crawlera-Profile, the latter supersedes X-Crawlera-UA.

Example:

X-Crawlera-UA: desktop
X-Crawlera-Profile: pass

Crawlera won’t respect X-Crawlera-UA setting here because X-Crawlera-Profile is set.

Supported values for this header are:

  • pass - do not use any browser profile, use User-Agent provided by the client (available on Enterprise plans only)

  • desktop- use a random desktop browser profile ignoring client User-Agent header

  • mobile - use a random mobile browser profile ignoring client User-Agent header

By default, no profile is used. Crawlera starts to process X-Crawlera-UA header. If an unsupported value is passed in X-Crawlera-Profile header, Crawlera replies with a 540 Bad Header Value.

X-Crawlera-Profile-Pass

Available on C50, C100, C200 and Enterprise plans.

Crawlera profiles already provide correct default values for the headers sent by the mimicked browser. If you want to use your own header, please use complimentary header X-Crawlera-Profile-Pass. The value of X-Crawlera-Profile-Pass is the name of the header you need to use. In that case, Crawlera won’t override you value. You can put several header names there, delimited by comma.

Example

You want to use your own specific browser locale (de_DE) instead of default en_US. In that case, you need to put Accept-Language as a value of X-Crawlera-Profile-Pass and provide de_DE as a value of Accept-Language.

X-Crawlera-Profile: desktop
X-Crawlera-Profile-Pass: Accept-Language
Accept-Language: de_DE

X-Crawlera-No-Bancheck

Available on Basic, Advanced, C50, C100, C200 and Enterprise plans.

This header instructs Crawlera not to check responses against its ban rules and pass any received response to the client. The presence of this header (with any value) is assumed to be a flag to disable ban checks.

Example:

X-Crawlera-No-Bancheck: 1

X-Crawlera-Cookies

Available on Enterprise plan and the old C50, C100 and C200 plans.

This header controls internal cookie tracking performed by Crawlera. On Crawlera Basic & Advanced plan, its value defaults to discard.

Supported values for this header are:

  • enable - internal cookies override cookies from the request

  • disable - cookies from the request override internal cookies

  • discard - all cookies are discarded. This is the default behaviour on Crawlera Basic & Advanced plans.

Example:

X-Crawlera-Cookies: disable

X-Crawlera-Session

Available on Advanced, C10, C50, C100, C200 and Enterprise plans.

This header instructs Crawlera to use sessions which will tie requests to a particular outgoing IP until it gets banned.

Example:

X-Crawlera-Session: create

When create value is passed, Crawlera creates a new session an ID of which will be returned in the response header with the same name. All subsequent requests should use that returned session ID to prevent random outgoing IP switching between requests. Crawlera sessions currently have maximum lifetime of 30 minutes. See Crawlera Sessions for information on the maximum number of sessions.

X-Crawlera-JobId

This header sets the job ID for the request (useful for tracking requests in the Crawlera logs).

Example:

X-Crawlera-JobId: 999

X-Crawlera-Max-Retries

This header limits the number of retries performed by Crawlera.

Example:

X-Crawlera-Max-Retries: 1

Passing 1 in the header instructs Crawlera to do up to 1 retry. Default number of attempts is 5 (which is also the allowed maximum value, the minimum being 0). Passing 0 or 1 on this header has the same effect (only one attempt to execute the request).

X-Crawlera-Timeout

This header sets Crawlera’s timeout in milliseconds for receiving a response from the target website. The timeout must be specified in milliseconds and be between 30,000 and 180,000. It’s not possible to set the timeout higher than 180,000 milliseconds or lower than 30,000 milliseconds, it will be rounded to its nearest maximum or minimum value.

Example:

X-Crawlera-Timeout: 40000

The example above sets the response timeout to 40,000 milliseconds. In the case of a streaming response, each chunk has 40,000 milliseconds to be received. If no response is received after 40,000 milliseconds, a 504 response will be returned. If not specified, it will default to 30000.

[Deprecated] X-Crawlera-Use-Https

Previously the way to perform https requests needed the http variant of the url plus the header X-Crawlera-Use-Https with value 1 like the following example:

curl -x proxy.crawlera.com:8010 -U <API key>: http://twitter.com -H x-crawlera-use-https:1

Now you can directly use the https url and remove the X-Crawlera-Use-Https header, like this:

curl -x proxy.crawlera.com:8010 -U <API key>: https://twitter.com

If you don’t use curl for crawlera you can check the rest of the documentation and update your scripts in order to continue using crawlera without issues. Also some programming languages will ask for the Certificate file crawlera-ca.crt. You can install the certificate on your system or set it explicitely on the script.

Response Headers

X-Crawlera-Next-Request-In

This header is returned when response delay reaches the soft limit (120 seconds) and contains the calculated delay value. If the user ignores this header, the hard limit (1000 seconds) may be reached, after which Crawlera will return HTTP status code 503 with delay value in Retry-After header.

X-Crawlera-Debug

This header activates tracking of additional debug values which are returned in response headers. At the moment only request-time and ua values are supported, comma should be used as a separator. For example, to start tracking request time send:

X-Crawlera-Debug: request-time

or, to track both request time and User-Agent send:

X-Crawlera-Debug: request-time,ua

The request-time option forces Crawlera to output to the response header a request time (in seconds) of the last request retry (i.e. the time between Crawlera sending request to an outgoing IP and Crawlera receiving response headers from that outgoing IP):

X-Crawlera-Debug-Request-Time: 1.112218

The ua option allows to obtain information about the actual User-Agent which has been applied to the last request (useful for finding reasons behind redirects from a target website, for instance):

X-Crawlera-Debug-UA: Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/533+ (KHTML, like Gecko)

X-Crawlera-Error

This header is returned when an error condition is met, stating a particular Crawlera error behind HTTP status codes (4xx or 5xx). The error message is sent in the response body.

Example:

X-Crawlera-Error: user_session_limit

Note

Returned errors are internal to Crawlera and are subject to change at any time, so should not be relied on.

Using Crawlera with Scrapy Cloud

To employ Crawlera in Scrapy Cloud projects the Crawlera addon is used. Go to Settings > Addons > Crawlera to activate.

Settings

CRAWLERA_URL

proxy URL (default: http://proxy.crawlera.com:8010)

CRAWLERA_ENABLED

tick the checkbox to enable Crawlera

CRAWLERA_APIKEY

Crawlera API key

CRAWLERA_MAXBANS

number of bans to ignore before closing the spider (default: 20)

CRAWLERA_DOWNLOAD_TIMEOUT

timeout for requests (default: 190)

Using Crawlera from different languages

Check out our Knowledge Base for examples of using Crawlera with different programming languages:

Working with HTTPS

See Crawlera with HTTPS in our Knowledge Base

Working with Cookies

See Crawlera and Cookies in our Knowledge Base