Scrapinghub API Reference

Crawlera API

Note

Check also the Help Center for general guides and articles.

Proxy API

Crawlera works with a standard HTTP web proxy API, where you only need an API key for authentication. This is the standard way to perform a request via Crawlera:

curl -vx proxy.crawlera.com:8010 -U <API key>: http://httpbin.org/ip

Errors

When an error occurs, Crawlera sends a response containing an X-Crawlera-Error header and an error message in the body.

Note

These errors are internal to Crawlera and are subject to change at any time, so should not be relied on and only used for debugging.

X-Crawlera-Error Response Code Error Message
bad_session_id 400 Incorrect session ID
user_session_limit 400 Session limit exceeded
bad_auth 407  
too_many_conns 429 Too many connections*
header_auth 470 Unauthorized Crawlera header
500 Unexpected error
nxdomain 502 Error looking up domain
econnrefused 502 Connection refused
econnreset 502 Connection reset
socket_closed_remotely 502 Server closed socket connection
send_failed 502 Send failed
noslaves 503 No available proxies
slavebanned 503 Website crawl ban
serverbusy 503 Server busy: too many outstanding requests
timeout 504 Timeout from upstream server
msgtimeout 504 Timeout processing HTTP stream
domain_forbidden 523 Domain forbidden. Please contact help@scrapinghub.com
bad_header 540 Bad header value for <some_header>

* Crawlera limits the number of concurrent connections based on your Crawlera plan. See: Crawlera pricing table for more information on plans.

Sessions and Request Limits

Sessions

Sessions allow reusing the same slave for every request. Sessions expire 30 minutes after their last use and Crawlera limits the number of concurrent sessions to 100 for C10 plans, and 5000 for all other plans.

Sessions are managed using the X-Crawlera-Session header. To create a new session send:

X-Crawlera-Session: create

Crawlera will respond with the session ID in the same header:

X-Crawlera-Session: <session ID>

From then onward, subsequent requests can be made through the same slave by sending the session ID in the request header:

X-Crawlera-Session: <session ID>

Another way to create sessions is using the /sessions endpoint:

curl -u <API key>: proxy.crawlera.com:8010/sessions -X POST

This will also return a session ID which you can pass to future requests with the X-Crawlera-Session header like before. This is helpful when you can’t get the next request using X-Crawlera-Session.

If an incorrect session ID is sent, Crawlera responds with a bad_session_id error.

List sessions

Issue the endpoint List sessions with the GET method to list your sessions. The endpoint returns a JSON document in which each key is a session ID and the associated value is a slave.

Example:

curl -u <API key>: proxy.crawlera.com:8010/sessions
{"1836172": "<SLAVE1>", "1691272": "<SLAVE2>"}

Delete a session

Issue the endpoint Delete a session with the DELETE method in order to delete a session.

Example:

curl -u <API key>: proxy.crawlera.com:8010/sessions/1836172 -X DELETE

Request Limits

Crawlera’s default request limit is 5 requests per second (rps) for each website. There is a default delay of 200ms between each request and a default delay of 1 second between requests through the same slave. These delays can differ for more popular domains. If the requests per second limit is exceeded, further requests will be delayed for up to 15 minutes. Each request made after exceeding the limit will increase the request delay. If the request delay reaches the soft limit (120 seconds), then each subsequent request will contain X-Crawlera-Next-Request-In header with the calculated delay as the value.

Request Headers

Crawlera supports multiple HTTP headers to control its behaviour.

Not all headers are available in every plan, here is a chart of the headers available in each plan (C10, C50, etc):

Header C10 C50 C100 C200 Enterprise
X-Crawlera-UA  
X-Crawlera-No-Bancheck  
X-Crawlera-Cookies
X-Crawlera-Timeout
X-Crawlera-Session
X-Crawlera-JobId
X-Crawlera-Max-Retries

X-Crawlera-UA

Only available on C50, C100, C200 and Enterprise plans.

This header controls Crawlera User-Agent behaviour. The supported values are:

  • pass - pass the User-Agent as it comes on the client request
  • desktop - use a random desktop browser User-Agent
  • mobile - use a random mobile browser User-Agent

If X-Crawlera-UA isn’t specified, it will default to desktop. If an unsupported value is passed in X-Crawlera-UA header, Crawlera replies with a 540 Bad Header Value.

More User-Agent types will be supported in the future (chrome, firefox) and added to the list above.

X-Crawlera-No-Bancheck

Only available on C50, C100, C200 and Enterprise plans.

This header instructs Crawlera not to check responses against its ban rules and pass any received response to the client. The presence of this header (with any value) is assumed to be a flag to disable ban checks.

Example:

X-Crawlera-No-Bancheck: 1

X-Crawlera-Cookies

This header allows to disable internal cookies tracking performed by Crawlera.

Example:

X-Crawlera-Cookies: disable

X-Crawlera-Session

Warning

An experimental beta feature.

This header instructs Crawlera to use sessions which will tie requests to a particular slave until it gets banned.

Example:

X-Crawlera-Session: create

When create value is passed, Crawlera creates a new session an ID of which will be returned in the response header with the same name. All subsequent requests should use that returned session ID to prevent random slave switching between requests. Crawlera sessions currently have maximum lifetime of 30 minutes. See Sessions and Request Limits for information on the maximum number of sessions.

X-Crawlera-JobId

This header sets the job ID for the request (useful for tracking requests in the Crawlera logs).

Example:

X-Crawlera-JobId: 999

X-Crawlera-Max-Retries

This header limits the number of retries performed by Crawlera.

Example:

X-Crawlera-Max-Retries: 1

Passing 1 in the header instructs Crawlera to do up to 1 retry. Default number of retries is 5 (which is also the allowed maximum value, the minimum being 0).

X-Crawlera-Timeout

This header sets Crawlera’s timeout in milliseconds for receiving a response from the target website. The timeout must be specified in milliseconds and be between 30,000 and 180,000. It’s not possible to set the timeout higher than 180,000 milliseconds or lower than 30,000 milliseconds, it will be rounded to its nearest maximum or minimum value.

Example:

X-Crawlera-Timeout: 40000

The example above sets the response timeout to 40,000 milliseconds. In the case of a streaming response, each chunk has 40,000 milliseconds to be received. If no response is received after 40,000 milliseconds, a 504 response will be returned. If not specified, it will default to 30000.

[Deprecated] X-Crawlera-Use-Https

Previously the way to perform https requests needed the http variant of the url plus the header X-Crawlera-Use-Https with value 1 like the following example:

curl -x proxy.crawlera.com:8010 -U <API key>: http://twitter.com -H x-crawlera-use-https:1

Now you can directly use the https url and remove the X-Crawlera-Use-Https header, like this:

curl -x proxy.crawlera.com:8010 -U <API key>: https://twitter.com

If you don’t use curl for crawlera you can check the rest of the documentation and update your scripts in order to continue using crawlera without issues. Also some programming languages will ask for the Certificate file crawlera-ca.crt. You can install the certificate on your system or set it explicitely on the script.

Response Headers

X-Crawlera-Next-Request-In

This header is returned when response delay reaches the soft limit (120 seconds) and contains the calculated delay value. If the user ignores this header, the hard limit (1000 seconds) may be reached, after which Crawlera will return HTTP status code 503 with delay value in Retry-After header.

X-Crawlera-Debug

This header activates tracking of additional debug values which are returned in response headers. At the moment only request-time and ua values are supported, comma should be used as a separator. For example, to start tracking request time send:

X-Crawlera-Debug: request-time

or, to track both request time and User-Agent send:

X-Crawlera-Debug: request-time,ua

The request-time option forces Crawlera to output to the response header a request time (in seconds) of the last request retry (i.e. the time between Crawlera sending request to a slave and Crawlera receiving response headers from that slave):

X-Crawlera-Debug-Request-Time: 1.112218

The ua option allows to obtain information about the actual User-Agent which has been applied to the last request (useful for finding reasons behind redirects from a target website, for instance):

X-Crawlera-Debug-UA: Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/533+ (KHTML, like Gecko)

X-Crawlera-Error

This header is returned when an error condition is met, stating a particular Crawlera error behind HTTP status codes (4xx or 5xx). The error message is sent in the response body.

Example:

X-Crawlera-Error: user_session_limit

Note

Returned errors are internal to Crawlera and are subject to change at any time, so should not be relied on.

Using Crawlera with Scrapy Cloud

To employ Crawlera in Scrapy Cloud projects the Crawlera addon is used. Go to Settings > Addons > Crawlera to activate.

Settings

CRAWLERA_URL proxy URL (default: http://proxy.crawlera.com:8010)
CRAWLERA_ENABLED tick the checkbox to enable Crawlera
CRAWLERA_APIKEY Crawlera API key
CRAWLERA_MAXBANS number of bans to ignore before closing the spider (default: 20)
CRAWLERA_DOWNLOAD_TIMEOUT timeout for requests (default: 190)

Using Crawlera with Selenium and Polipo

Since it’s not so trivial to set up proxy authentication in Selenium, a popular option is to employ Polipo as a proxy. Update Polipo configuration file /etc/polipo/config to include Crawlera credentials (if the file is not present, copy and rename config.sample found in Polipo source folder):

parentProxy = "proxy.crawlera.com:8010"
parentAuthCredentials = "<API key>:"

For password safety reasons this content is displayed as (hidden) in the Polipo web interface manager. The next step is to specify Polipo proxy details in the Selenium automation script, e.g. for Python and Firefox:

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.proxy import *

polipo_proxy = "localhost:8123"

proxy = Proxy({
    'proxyType': ProxyType.MANUAL,
    'httpProxy': polipo_proxy,
    'ftpProxy' : polipo_proxy,
    'sslProxy' : polipo_proxy,
    'noProxy'  : ''
})

driver = webdriver.Firefox(proxy=proxy)
driver.get("http://scrapinghub.com")
assert "Scrapinghub" in driver.title
elem = driver.find_element_by_class_name("portia")
actions = ActionChains(driver)
actions.click(on_element=elem)
actions.perform()
print "Clicked on Portia!"
driver.close()

Using Crawlera with CasperJS, PhantomJS and SpookyJS

To use session-wide Crawlera proxy with PhantomJs or CasperJS provide --proxy=proxy.crawlera.com:8010 and --proxy-auth=<API key>: arguments to PhantomJS (CasperJS passes these arguments to PhantomJS).

Example:

casperjs|phantomjs --proxy="proxy.crawlera.com:8010" --proxy-auth="<API KEY>:''" yourscript.js

When making HTTPS requests, the URLs should be wrapped in a Fetch API call.

Example:

phantomjs --ssl-protocol=any phantomjs/examples/rasterize.js http://<API KEY>:@proxy.crawlera.com:8010/fetch?url=https://twitter.com twitter.jpg

SpookyJS allows you to spawn multiple instances of CasperJS suites, so proxy and proxy-auth arguments should be provided when creating a Spooky object.

Example:

var spooky = new Spooky({
    child: {
        proxy: 'proxy.crawlera.com:8010',
        proxy-auth: '<API key>:'
        /* ... */
    },
    /* ... */
},

If it’s preferred that Crawlera operated only on specific URLs, they should be wrapped according to Fetch API.

Example in CasperJS:

var casper = require('casper').create();
casper.start();
// always encode url components !
var url_to_crawl = encodeURIComponent('http://wtfismyip.com/text'); // results in http%3A%2F%2Fwtfismyip.com%2Ftext
// You can either
// Authenticate session wide:
casper.setHttpAuth('<API key>', '');
casper.open('http://proxy.crawlera.com:8010/fetch?url=' + url_to_crawl);
// or incorporate authentication into the url:
casper.open('http://<API key>:@proxy.crawlera.com:8010/fetch?url=' + url_to_crawl);

casper.then(function(response) {
    this.echo(response.url);
    this.echo(response.status);
    this.debugHTML();  // print page source
});
casper.run();

Using Crawlera with Splash

You can use Splash with Crawlera to render JavaScript and proxy all requests issued from Splash. This can be necessary if your crawler makes heavy usage of Splash and target website throttles or blocks requests from Splash.

How to do it?

You need to send your requests to Splash. Splash must proxy its requests via Crawlera.

This is best achieved by using Splash /execute endpoint. You need to create Lua script that will tell Splash to use proxy for requests. Splash provides splash:on_request callback function that can be used for this purpose.

function main(splash)
    local host = "proxy.crawlera.com"
    local port = 8010
    local user = "<API key>"
    local password = ""
    local session_header = "X-Crawlera-Session"
    local session_id = "create"

    splash:on_request(function (request)
        request:set_header("X-Crawlera-UA", "desktop")
        request:set_header(session_header, session_id)
        request:set_proxy{host, port, username=user, password=password}
    end)

    splash:on_response_headers(function (response)
        if response.headers[session_header] ~= nil then
            session_id = response.headers[session_header]
        end
    end)

    splash:go(splash.args.url)
    return splash:png()
end

The previous example rendered a page as a PNG image and the binary content its returned in the HTTP request. The /execute endpoint reads the automation script in the lua_source parameter (which is a string containing the full script).

Example (using python requests library):


# coding: utf-8
import requests

splash_server = 'http://192.168.99.100:8050'

with open('crawlera-splash.lua') as lua:
    lua_source = ''.join(lua.readlines())
    splash_url = '{}/execute'.format(splash_server)
    r = requests.post(
            splash_url,
            json={
                'lua_source': lua_source,
                'url': url,
            },
            timeout=100,
    )

    fp = open("crawlera-splash.png", "wb")
    fp.write(r.content)
    fp.close()

Note: in the previous python script Splash was running at address 192.168.99.100 default IP from docker container.

Using Crawlera from Different Languages

Warning

Some HTTP client libraries including Apache HttpComponents Client and .NET don’t send authentication headers by default. This can result in doubled requests so pre-emptive authentication should be enabled where this is the case.

In the following examples we’ll be making HTTPS requests to https://twitter.com through Crawlera. It is assumed that Crawlera Certificate has been installed, since CONNECT method will be employed.

Python

Making use of Requests HTTP Proxy Authentication:

import requests
from requests.auth import HTTPProxyAuth

url = "https://twitter.com"
proxy_host = "proxy.crawlera.com"
proxy_port = "8010"
proxy_auth = HTTPProxyAuth("<API KEY>", "")
proxies = {"https": "https://{}:{}/".format(proxy_host, proxy_port)}

r = requests.get(url, proxies=proxies, auth=proxy_auth,
                 verify='/path/to/crawlera-ca.crt')

print("""
Requesting [{}]
through proxy [{}]

Request Headers:
{}

Response Time: {}
Response Code: {}
Response Headers:
{}
""".format(url, proxy_host, r.request.headers, r.elapsed.total_seconds(),
           r.status_code, r.headers, r.text))

PHP

Making use of PHP binding for libcurl library:

<?php

$ch = curl_init();

$url = 'https://twitter.com/';
$proxy = 'proxy.crawlera.com:8010';
$proxy_auth = '<API KEY>:';

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_PROXY, $proxy);
curl_setopt($ch, CURLOPT_PROXYUSERPWD, $proxy_auth);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_TIMEOUT, 30);
curl_setopt($ch, CURLOPT_CAINFO, '/path/to/crawlera-ca.crt');

$scraped_page = curl_exec($ch);
curl_close($ch);
echo $scraped_page;

?>

Making use of Guzzle, a PHP HTTP client, in the context of Symfony framework:

<?php

namespace AppBundle\Controller;

use GuzzleHttp\Client;
use Symfony\Bundle\FrameworkBundle\Controller\Controller;
use Sensio\Bundle\FrameworkExtraBundle\Configuration\Route;
use Symfony\Component\HttpFoundation\Response;

class CrawleraController extends Controller
{
    /**
     * @Route("/crawlera", name="crawlera")
     */
    
    public function crawlAction()
    {
        $url = 'https://twitter.com';
        $client = new Client(['base_uri' => $url]);
        $crawler = $client->get($url, ['proxy' => 'http://<API KEY>:@proxy.crawlera.com:8010'])->getBody();

        return new Response(
            '<html><body> '.$crawler.' </body></html>'
        );
    }
}

Ruby

Making use of curb, a Ruby binding for libcurl:

require 'curb'

url = "https://twitter.com"
proxy = "proxy.crawlera.com:8010"
proxy_auth = "<API KEY>:"

c = Curl::Easy.new(url) do |curl|
  curl.proxypwd = proxy_auth
  curl.proxy_url = proxy
  curl.verbose = true
end

c.perform
puts c.body_str

Making use of typhoeus, another Ruby binding for libcurl:

require 'typhoeus'

url = "https://twitter.com"
proxy_host = "proxy.crawlera.com:8010"
proxy_auth = "<API KEY>:"

request = Typhoeus::Request.new(
  url,
  method: :get,
  proxy: proxy_host,
  proxyuserpwd: proxy_auth
)

request.run
print "Response Code: "
puts request.response.code
print "Response Time: "
puts request.response.total_time
print "Response Headers: "
puts request.response.headers
print "Response Body: "
puts request.response.body

Making use of mechanize, a Ruby library for automated web interaction: Don’t forget to load the Certificate file crawlera-ca.crt and set it using the env variable export SSL_CERT_FILE=/path/to/crawlera-ca.crt

require 'rubygems'
require 'mechanize'

url = "https://twitter.com"
proxy_host = "proxy.crawlera.com"
proxy_api_key = "<API KEY>"

agent = Mechanize.new
agent.set_proxy proxy_host, 8010, proxy_api_key, ''

res = agent.get(url)
puts res.body

Node.js

Making use of request, an HTTP client:

module.exports = require('./lib/express');

var express = require('express');
var request = require('request');
var fs = require('fs');
var app = express();

app.get('/', function(req, res) {

    var options = {
        url: 'https://twitter.com',
        ca: fs.readFileSync("/path/to/crawlera-ca.crt"),
        requestCert: true,
        rejectUnauthorized: true
    };

    var new_req = request.defaults({
        'proxy': 'http://<API KEY>:@proxy.crawlera.com:8010'
    });

    function callback(error, response, body) {
        if (!error && response.statusCode == 200) {
            console.log(response.headers);
            console.log(body);
        }
        else{
            console.log(error, response, body);
        }
    }

    new_req(options, callback);

});

var server = app.listen(3000, function() {

    var host = server.address().address;
    var port = server.address().port;
    console.log('App listening at http://%s:%s', host, port);

});

Java

Note

Because of HTTPCLIENT-1649 you should use version 4.5 of HttpComponents Client or later.

Extending an example published at The Apache HttpComponents™ project website and inserting Crawlera details:

import java.io.File;
import javax.net.ssl.SSLContext;
import org.apache.http.HttpHeaders;
import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.AuthCache;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.conn.ssl.TrustSelfSignedStrategy;
import org.apache.http.impl.auth.BasicScheme;
import org.apache.http.impl.client.BasicAuthCache;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicHeader;
import org.apache.http.ssl.SSLContexts;
import org.apache.http.util.EntityUtils;

public class ClientProxyAuthentication {

    public static void main(String[] args) throws Exception {
        
        // Trust own CA and all self-signed certs
        SSLContext sslcontext = SSLContexts.custom()
                .loadTrustMaterial(new File("/path/to/jre/lib/security/cacerts"),
                                   "changeit".toCharArray(),
                                   new TrustSelfSignedStrategy())
                .build();

        // Allow TLSv1 protocol only
        SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(
                sslcontext, new String[] {"TLSv1"},
                null,
                SSLConnectionSocketFactory.getDefaultHostnameVerifier());
        
        CredentialsProvider credsProvider = new BasicCredentialsProvider();
        credsProvider.setCredentials(
                new AuthScope("proxy.crawlera.com", 8010),
                new UsernamePasswordCredentials("<API KEY>", ""));
        
        try (CloseableHttpClient httpclient = HttpClients.custom()
                .setDefaultCredentialsProvider(credsProvider)
                .setSSLSocketFactory(sslsf)
                .build())
        {
            HttpHost target = new HttpHost("twitter.com", 443, "https");
            HttpHost proxy = new HttpHost("proxy.crawlera.com", 8010);

            AuthCache authCache = new BasicAuthCache();

            BasicScheme basicAuth = new BasicScheme();
            basicAuth.processChallenge(
                    new BasicHeader(HttpHeaders.PROXY_AUTHENTICATE,
                                    "Basic realm=\"Crawlera\""));
            authCache.put(proxy, basicAuth);

            HttpClientContext ctx = HttpClientContext.create();
            ctx.setAuthCache(authCache);

            RequestConfig config = RequestConfig.custom()
                .setProxy(proxy)
                .build();
            
            HttpGet httpget = new HttpGet("/");
            httpget.setConfig(config);

            System.out.println("Executing request " + httpget.getRequestLine() +
                " to " + target + " via " + proxy);

            try (CloseableHttpResponse response = httpclient.execute(
                target, httpget, ctx))
            {
                System.out.println("----------------------------------------");
                System.out.println(response.getStatusLine());
                System.out.println("----------------------------------------");
                System.out.println(EntityUtils.toString(response.getEntity()));
                EntityUtils.consume(response.getEntity());
            }
        }
    }
}

crawlera-ca.crt should be added to keystore, for instance with keytool:

keytool -import -file /path/to/crawlera-ca.crt -storepass changeit -keystore $JAVA_HOME/jre/lib/security/cacerts -alias crawleracert

C#

using System;
using System.IO;
using System.Net;

namespace ProxyRequest
{
    class MainClass
    {
        public static void Main (string[] args)
        {
            var myProxy = new WebProxy("http://proxy.crawlera.com:8010");
            myProxy.Credentials = new NetworkCredential("<API KEY>", "");

            HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://twitter.com");
            request.Proxy = myProxy;
            request.PreAuthenticate = true;

            WebResponse response = request.GetResponse();
            Console.WriteLine("Response Status: " 
                + ((HttpWebResponse)response).StatusDescription);
            Console.WriteLine("\nResponse Headers:\n" 
                + ((HttpWebResponse)response).Headers);

            Stream dataStream = response.GetResponseStream();
            var reader = new StreamReader(dataStream);
            string responseFromServer = reader.ReadToEnd();
            Console.WriteLine("Response Body:\n" + responseFromServer);
            reader.Close();

            response.Close();
        }
    }
}

Fetch API

Warning

The Fetch API is deprecated and will be removed soon. Use the standard proxy API instead.

Crawlera’s fetch API let’s you request URLs as an alternative to Crawlera’s proxy interface.

Fields

Note

Field values should always be encoded.

Field Required Description Example
url yes URL to fetch http://www.food.com/
headers no Headers to send in the outgoing request header1:value1;header2:value2

Basic example:

curl -u <API key>: http://proxy.crawlera.com:8010/fetch?url=https://twitter.com

Headers example:

curl -u <API key>: 'http://proxy.crawlera.com:8010/fetch?url=http%3A//www.food.com&headers=Header1%3AVal1%3BHeader2%3AVal2'