Scrapinghub Reference Documentation

Product List Extraction (beta)

If you requested a product list extraction, and the extraction succeeds, then the productList field will be available in the query result:

from autoextract.sync import request_raw

query = [{
    'url': 'http://books.toscrape.com/',
    'pageType': 'productList'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['productList'])

The following fields are available for productList:

Name

Type

Description

url

String

URL of a page where products were extracted.

products

List of dictionaries

List of products, individual fields described below.

breadcrumbs

List of dictionaries with name and link optional string fields

A list of breadcrumbs (a specific navigation element) with optional name and URL.

paginationNext

Dictionary with url and text string fields.

Next page in the list (with respect to the current page) if pagination is present. url is the URL of the next page. It is a required field. text is the text corresponding to the link as it appears on site. It is optional.

paginationPrevious

Dictionary with url and text string fields.

Previous page in the list (with respect to the current page) if pagination is present. url is the URL of the previous page. It is a required field. text is the text corresponding to the link as it appears on site. It is optional.

url field is required.

Each product inside products field has the following fields:

Name

Type

Description

name

String

The name of the product.

offers

List of dictionaries with price, currency, regularPrice and availability string fields

Offers of the product. All fields are optional but currency is present only if price is also present. price field is a string with a valid number (dot is a decimal separator). It is the price customer has to pay after discount or special offers. currency is currency as given on the web site, without extra normalization (for example both “$” and “USD” are possible currencies). It is present only if price is also present. regularPrice is the price before the discount or any special offer. It is present only when the price is different from regularPrice. availability is product availability, currently it can either be "InStock" or "OutOfStock". "InStock" includes the following cases: in-stock, limited availability, pre-sale (indicates that the item is available for ordering and delivery before general availability), pre-order (indicates that the item is available for pre-order, but will be delivered when generally available), in-store-only (indicates that the item is available only at physical locations). "OutOfStock" includes following cases: out-of-stock, dis-continued and sold-out.

sku

String

Stock Keeping Unit identifier for the product assigned by the seller.

brand

String

Brand or manufacturer of the product.

mainImage

String

A URL or data URL value of the main image of the product.

images

List of strings

A list of URL or data URL values of all images of the product (may include the main image).

description

String

Description of the product.

aggregateRating

Dictionary with ratingValue, bestRating float fields and reviewCount int field

ratingValue is the average rating value. bestRating is the best possible rating value. reviewCount is the number of reviews or ratings for the product. All fields are optional but one of reviewCount or ratingValue is present.

probability

Float

Probability that the extracted item is a single product listing.

url

String

URL of the main product page for this listing.

All fields are optional, except for probability. Fields without a valid value (null or empty array) are excluded from extraction results.

Below is an example response with all product list fields present:

[
  {
    "productList":{
      "url":"http://example.com/product-list-page-3",
      "breadcrumbs":[
        {
          "name":"Home",
          "link":"http://example.com"
        }
      ],
      "paginationNext":{
        "text":"Next Page",
        "url":"http://example.com/product-list-page-4"
      },
      "paginationPrevious":{
        "text":"Previous Page",
        "url":"http://example.com/product-list-page-2"
      },
      "products":[
        {
          "name":"Product 1",
          "url":"http://example.com/product1",
          "offers":[
            {
              "price":"42",
              "currency":"USD",
              "availability":"InStock",
              "regularPrice":"60"
            }
          ],
          "sku":"product sku",
          "brand":"product1 brand",
          "mainImage":"http://example.com/image.png",
          "images":[
            "http://example.com/image.png"
          ],
          "description":"product1 description",
          "aggregateRating":{
            "ratingValue":4.5,
            "bestRating":5.0,
            "reviewCount":31
          },
          "probability":0.95
        },
        {
          "name":"Product 2",
          "url":"http://example.com/product2",
          "offers":[
            {
              "price":"72",
              "currency":"USD",
              "availability":"OutOfStock"
            }
          ],
          "sku":"product2 sku",
          "brand":"product2 brand",
          "mainImage":"http://example.com/image2.png",
          "images":[
            "http://example.com/image2.png"
          ],
          "description":"product2 description",
          "aggregateRating":{
            "ratingValue":1.5,
            "bestRating":5.0,
            "reviewCount":85
          },
          "probability":0.90
        }
      ]
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query":{
      "id":"1564747029122-9e02a1868d70b7a2",
      "domain":"example.com",
      "userQuery":{
        "pageType":"productList",
        "url":"https://example.com/product-list-page"
      }
    },
    "algorithmVersion": "20.8.1"
  }
]