Scrapinghub API Reference

Scrapy Cloud Write Entrypoint

Scrapy Cloud Write Entrypoint is a write-only interface to Scrapy Cloud storage. Its main purpose is to make it easy to write crawlers and scripts compatible with Scrapy Cloud in different programming languages using custom Docker images.

Jobs in Scrapy Cloud run inside Docker containers. When a Job container is started, a named pipe is created at the location stored in the SHUB_FIFO_PATH environment variable. To interface with Scrapy Cloud storage, your crawler has to open this named pipe and write messages on it, following a simple text-based protocol as described below.

Protocol

Each message is a line of ASCII characters terminated by a newline character. Message consists of the following parts:

  • a 3-character command (one of “ITM”, “LOG”, “REQ”, “STA”, or “FIN”),
  • followed by a space character,
  • then followed by a payload as a JSON object,
  • and a final newline character \n.

This is how example log message will look like:

LOG {"time": 1485269941065, "level": 20, "message": "Some log message"}

This example and all the following examples omit the trailing newline character because it’s a non-printable character. This is how you would write the above example message in Python:

pipe.write('LOG {"time": 1485269941065, "level": 20, "message": "Some log message"}\n')
pipe.flush()

Newline characters are used as message separators. So, make sure that the serialized JSON object payload doesn’t contain newline characters between key/value pairs and that newline characters inside strings for both keys and values are properly escaped, i.e an actual \ (reverse solidus, backslash), followed by n. Here’s an example of two consecutive log messages which carry a multiline messages in the payload:

LOG {"time": 1485269941065, "level": 20, "message": "First multiline message. Line 1\nLine 2"}
LOG {"time": 1485269941066, "level": 30, "message": "Second multiline message. Line 1\nLine 2"}

In Python this will look like this:

pipe.write('LOG {"time": 1485269941065, "level": 20, "message": "First multiline message. Line 1\\nLine 2"}\n')
pipe.write('LOG {"time": 1485269941066, "level": 20, "message": "Second multiline message. Line 1\\nLine 2"}\n')
pipe.flush()

Unicode characters in JSON object MUST be escaped using standard JSON u four-hex-digits syntax, e.g. item {"ключ": "значение"} should look like this:

ITM {"\u043a\u043b\u044e\u0447": "\u0437\u043d\u0430\u0447\u0435\u043d\u0438\u0435"}

The total size of the message MUST not exceed 1 MiB. For messages that exceed this size the error will be logged instead.

ITM command

The ITM command writes a single item into Scrapy Cloud storage. ITM payload has not predefined schema.

Example:

ITM {"key": "value"}

To support very simple scripts the Scrapy Cloud Write Entrypoint allows sending plain JSON objects as items, i.e. without the 3-character command and space prefix. The following two messages are valid and equivalent:

ITM {"key": "value"}
{"key": "value"}

LOG command

The LOG command writes a single log message into Scrapy Cloud storage. The schema for the LOG payload is described in Log object.

Example:

LOG {"level": 20, "message": "Some log message"}

REQ command

The REQ command writes a single request into Scrapy Cloud storage. The schema for the REQ payload is described in Request object.

Example:

REQ {"url": "http://example.com", "method": "GET", "status": 200, "rs": 10, "duration": 20}

STA command

STA stands for stats and is used to populate the job stats page and to create graphs on the job details page.

Field Description Required
time UNIX timestamp of the message, in milliseconds. No
stats JSON object with arbitrary keys and values. Yes

If following keys are present in the STA payload – their values will be used to populate Scheduled Requests graph on a job details page:

  • scheduler/enqueued
  • scheduler/dequeued

The key names above were picked for compatibility with Scrapy stats.

Example:

STA {"time": 1485269941065, "stats": {"key": 0, "key2": 20.5, "scheduler/enqueued": 20, "scheduler/dequeued": 15}}

FIN command

The FIN command is used to set the outcome of a crawler execution, once it’s finished.

Field Description Required
outcome String with custom outcome message, limited to 255 chars Yes

Example:

FIN {"outcome": "finished"}

Printing to stdout and stderr

The output printed by a job in Scrapy Cloud is automatically converted into log messages. Lines printed to stdout are converted into INFO level log messages. Lines printed to stderr are converted into ERROR level log messages. For example, if the script prints Hello, world to stdout, the resulting LOG command will look like this:

LOG {"time": 1485269941065, "level": 20, "message": "Hello, world"}

There’s very basic support for multiline standard output – if some output consists of multiple lines where first line starts with a non-space character and subsequent lines start with a space character, it would be considered as a single log entry. For example, the following traceback in stderr:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'e' is not defined

will produce the following log messages:

LOG {"time": 1485269941065, "level": 40, "message": "Traceback (most recent call last):\n  File \"<stdin>\", line 1, in <module>"}
LOG {"time": 1485269941066, "level": 40, "message": "NameError: name 'e' is not defined"}

Resulting log messages are subject to 1 MiB limit – this means that output longer than 1023 KiB is likely to cause errors.

Warning

Even though you can write log messages by printing them to stdout and stderr, we recommend you to use the named pipe and LOG message instead. Due to the way data is sent between processes, it is not possible to maintain the order of the messages coming from different sources (named pipe, stdout, stderr). Exclusive usaged of the named pipe will both give the best performance and guarantee that messages are received in exactly the same order they were sent.

How to build a compatible crawler

Scripts or non-Scrapy spiders have to be deployed as custom Docker images.

Each spider needs to follow the pattern:

  1. Get the path to the named pipe mentioned earlier from SHUB_FIFO_PATH environment variable.

  2. Open named pipe for writing. E.g. in Python you do it like this:

    import os
    
    path = os.environ['SHUB_FIFO_PATH']
    pipe = open(path, 'w')
    
  3. Write messages to the pipe. If you want to send a message instantly, you have to flush the stream, otherwise it may remain in the file buffer inside the crawler process. However this is not always required as buffer will be flushed once enough data is written or when file object is closed (depends on the programming language you use):

    # write item
    pipe.write('ITM {"a": "b"}\n')
    pipe.flush()
    # ...
    # write request
    pipe.write('REQ {"time": 1484337369817, "url": "http://example.com", "method": "GET", "status": 200, "rs": 10, "duration": 20}\n')
    pipe.flush()
    # ...
    # write log entry
    pipe.write('LOG {"time": 1484337369817, "level": 20, "message": "Some log message"}\n')
    pipe.flush()
    # ...
    # write stats
    pipe.write('STA {"time": 1485269941065, "stats": {"key": 0, "key2": 20.5}}\n')
    pipe.flush()
    # ...
    # set outcome
    pipe.write('FIN {"outcome": "finished"}\n')
    pipe.flush()
    
  4. Close the named pipe when the crawl is finished:

    pipe.close()
    

Note

scrapinghub-entrypoint-scrapy uses Scrapy Cloud Write Entrypoint, check the code if you need an example.