Warning

Zyte Automatic Extraction will be discontinued starting April 30th, 2024. It is replaced by Zyte API. See Migrating from Automatic Extraction to Zyte API.

Product Extraction#

Product extraction supports pages which contain a single product. Many fields are extracted, such as product name, brand, price, availability, sku.

This supports use-cases such as price monitoring, product intelligence, product analytics and many others.

Related page types are Product List Extraction which supports pages with multiple products, Review Extraction which supports reviews on single product pages, Real Estate Extraction which supports pages with a single real estate item, and Vehicle Extraction which supports pages with a single vehicle item.

Request example#

If you request a product extraction, and the extraction succeeds, then the product field is available in the query result:

from autoextract.sync import request_raw

query = [{
    'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
    'pageType': 'product'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['product'])

Available fields#

The following fields are available for product:

name: string

The name of the product.

offers: list of dictionaries

Product offers. Each offer may contain price, currency, regularPrice and availability string fields. All fields are optional but currency is present only if price is also present.

  • price field is a string with a valid number (a dot is used as decimal separator). It is the price a customer has to pay after discounts or special offers.

  • currency is the currency as given on the website, without extra normalization (for example, both “$” and “USD” are possible currencies). It is present only if price is also present.

  • regularPrice is the price before any discount or special offer. It is present only when the price is different from regularPrice.

  • availability is the product availability, as a string. Allowed values:

    • "InStock" - includes limited availability, presale, preorder, and in-store only.

    • "OutOfStock" - includes discontinued and sold out.

Example:

[
  {
    "price": "42",
    "regularPrice": "45.00",
    "currency": "USD",
    "availability": "InStock"
  }
]
sku: string

Stock Keeping Unit identifier for the product assigned by the seller.

mpn: string

Manufacturer part number identifier for product. It is issued by the manufacturer and is same across different websites for a product.

gtin: list of dictionaries with type and value string fields

Standardized GTIN product identifier which is unique for a product across different sellers. It includes the following type: isbn10, isbn13, issn, ean13, upc, ismn, gtin8, gtin14.

gtin14 corresponds to former names EAN/UCC-14, SCC-14, DUN-14, UPC Case Code, UPC Shipping Container Code.

ean13 also includes the jan (japanese article number). Example:

[{"type": "isbn13", "value": "9781933624341"}]
brand: string

Brand or manufacturer of the product.

breadcrumbs: list of dictionaries with name and link optional string fields

A list of breadcrumbs (a specific navigation element) with optional name and URL. Example:

[
  {"name": "Foo", "link": "http://example.com/foo"},
  {"name": "Bar", "link": "http://example.com/foo/bar"},
  {"name": "Baz"},
]
mainImage: string

A URL or data URL value of the main image of the product.

images: list of strings

A list of URL or data URL values of all images of the product (may include the main image).

description: string

Description of the product.

descriptionHtml: string

Simplified HTML of the description, including sub-headings, image captions and embedded content.

aggregateRating: dictionary

Aggregate information about the product rating and reviews.

  • ratingValue is the average rating value, as a float.

  • bestRating is the best possible rating value, as a float.

  • reviewCount is the number of reviews or ratings for the product, as int.

Example - 4.5 out of 5, based on 12 reviews:

{
  "ratingValue": 4.5,
  "bestRating": 5,
  "reviewCount": 12
}

All fields are optional but one of reviewCount or ratingValue must be present.

color: string

Color of the product.

size: string

A standardized size of a product, specified through a simple textual string (for example “XL”, “32Wx34L”). A single product dimension (height, width) is not considered as the size.

style: string

Style of the product. It can be referred as pattern/finish on the product page. Example values: “Polka dots”, “Striped”, “Nickel finish with Translucent glass”, etc.

additionalProperty: list of dictionaries with name and value fields

A list of product properties or characteristics.

  • name field contains the property name,

  • value field contains the property value.

Example:

[
  {"name": "color", "value": "blue"},
  {"name": "brand", "value": "McBrand"},
  {"name": "best for", "value": "special events"},
]
hasVariants: list of Product

A list of product variants, using the same Product schema. Represents extra information available about the variants of a product. All variants are included into this array, including the variant shown on the page. If some field in this array is empty, it means that either the value is the same as in the top-level product, or that extraction API did not manage to extract it.

probability: float

Probability that the requested page is a single product page.

canonicalUrl: string

Canonical URL of the product, if available.

url: string

URL a of page where this product was extracted.

All fields are optional, except for url and probability.

Fields without a valid value (null or empty array) are excluded from extraction results.

Response example#

Below is an example response with all product fields present:

[
  {
    "product": {
      "name": "Product name",
      "offers": [
        {
          "price": "42",
          "regularPrice": "45.00",
          "currency": "USD",
          "availability": "InStock"
        }
      ],
      "sku": "product sku",
      "mpn": "product mpn",
      "gtin": [
        {
          "type": "ean13",
          "value": "978-3-16-148410-0"
        }
      ],
      "brand": "product brand",
      "breadcrumbs": [
        {
          "name": "Level 1",
          "link": "http://example.com"
        }
      ],
      "mainImage": "http://example.com/image.png",
      "images": [
        "http://example.com/image.png"
      ],
      "description": "product description",
      "descriptionHtml": "<article>HTML description for Product ...",
      "aggregateRating": {
        "ratingValue": 4.5,
        "bestRating": 5.0,
        "reviewCount": 31
      },
      "color": "product color",
      "size": "product size",
      "style": "product style",
      "additionalProperty": [
        {
          "name": "property 1",
          "value": "value of property 1"
        }
      ],
      "hasVariants": [
        {
          "offers": [{"price": "42", "availability": "InStock"}],
          "color": "red",
          "sku": "pruqkj-r"
        },
        {
          "offers": [{"price": "45", "availability": "OutOfStock"}],
          "color": "black",
          "sku": "pruqkj-b"
        }
      ],
      "probability": 0.95,
      "canonicalUrl": "https://example.com/product/",
      "url": "https://example.com/product"
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query": {
      "id": "1564747029122-9e02a1868d70b7a2",
      "domain": "example.com",
      "userQuery": {
        "pageType": "product",
        "url": "https://example.com/product"
      }
    },
    "algorithmVersion": "20.8.1"
  }
]