Scrapinghub Reference Documentation

Article Extraction

If you requested an article extraction, and the extraction succeeds, then the article field will be available in the query result:

import requests

response = requests.post(
    'https://autoextract.scrapinghub.com/v1/extract',
    auth=('[api key]', ''),
    json=[{'url': 'http://example.com/article?id=24',
           'pageType': 'article'}])
print(response.json()[0]['article'])

The following fields are available for articles:

Name

Type

Description

headline

String

Article headline or title.

datePublished

String

Publication date. ISO-formatted with ‘T’ separator, may contain a timezone. If the actual publication date is not found, dateModified value is taken.

datePublishedRaw

String

Same date but before parsing, as it appeared on the site.

dateModified

String

The date when the article was most recently modified. ISO-formatted with ‘T’ separator, may contain a timezone.

dateModifiedRaw

String

Same date but before parsing, as it appeared on the website.

author

String

Author (or authors) of the article.

authorsList

List of strings

All authors of the article split into separate strings, for example the author value might be "Alice and Bob" and authorList value ["Alice", "Bob"], while for a single author author value might be "Alice Johnes" and authorList value ["Alice Johnes"].

inLanguage

String

Language of the article, as an ISO 639-1 language code.

breadcrumbs

List of dictionaries with name and link optional string fields

A list of breadcrumbs (a specific navigation element) with optional name and URL.

mainImage

String

A URL or data URL value of the main image of the article.

images

List of strings

A list of URL or data URL values of all images of the article (may include the main image).

description

String

A short summary of the article, human-provided if available, or auto-generated.

articleBody

String

Text of the article, including sub-headings, with newline separators.

articleBodyHtml

String

Simplified HTML of the article, including sub-headings, image captions and embedded content (videos, tweets, etc). See Format of articleBodyHtml Field section below for a detailed description.

articleBodyRaw

String

HTML of the article body as seen in the source page.

videoUrls

List of strings

A list of URLs of all videos inside the article body.

audioUrls

List of strings

A list of URLs of all audios inside the article body.

probability

Float

Probability that this is a single article page.

canonicalUrl

String

Canonical URL of the article, if available.

url

String

URL of a page where this article was extracted.

All fields are optional, except for url and probability. The articleBodyRaw field will not be returned if you pass "articleBodyRaw": false as a query parameter. Fields without a valid value (null or empty array) are excluded from extraction results.

Below is an example response with all article fields present:

[
  {
    "article": {
      "headline": "Article headline",
      "datePublished": "2019-06-19T00:00:00",
      "datePublishedRaw": "June 19, 2019",
      "dateModified": "2019-06-21T00:00:00",
      "dateModifiedRaw": "June 21, 2019",
      "author": "Article author",
      "authorsList": [
        "Article author"
      ],
      "inLanguage": "en",
      "breadcrumbs": [
        {
          "name": "Level 1",
          "link": "http://example.com"
        }
      ],
      "mainImage": "http://example.com/image.png",
      "images": [
        "http://example.com/image.png"
      ],
      "description": "Article summary",
      "articleBody": "Article body ...",
      "articleBodyHtml": "<article><p>Article body ... </p> ... </article>",
      "articleBodyRaw": "<div id=\"an-article\">Article body ...",
      "videoUrls": [
        "https://example.com/video.mp4"
      ],
      "audioUrls": [
        "https://example.com/audio.mp3"
      ],
      "probability": 0.95,
      "canonicalUrl": "https://example.com/article/article-about-something",
      "url": "https://example.com/article?id=24"
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query": {
      "id": "1564747029122-9e02a1868d70b7a3",
      "domain": "example.com",
      "userQuery": {
        "pageType": "article",
        "url": "http://example.com/article?id=24"
      }
    },
    "algorithmVersion": "20.8.1"
  }
]

Format of articleBodyHtml Field

The articleBodyHtml field in article extractions contains a normalized and simplified HTML version of the article body. It is easy to create your own CSS styles over this HTML so that the final look-and-feel is integrated with the rest of your app.

The normalized HTML also allows for automated HTML processing which is consistent across websites. For example:

  • To get all images with their captions you can run //figure xpath and then ./img and ./figcaption

  • h tags are normalized, making the article hierarchy easy to determine

  • Tables and lists can be extracted cleanly

  • Links are absolute

  • Only semantic HTML tags are returned - no generic divs/spans are included

The supported tags and attributes are normalized as follows:

Content Type

Normalization

Supported Elements/Attributes

Sectioning

All content is enclosed in a root article tag. Headings are normalized so that they always start with h2.

article (root only), h2, h3, h4, h5, h6, aside

Text

Paragraphs are enclosed with p tag. Tables, lists, definition lists and block quotes are supported.

p, table, tbody, thead, tfoot, th, tr, td, ul, ol, li, dl, dt, dd, blockquote

Inline text

b tag is translated to strong. i tag is translated to em.

a, br, strong, em, s, sup, sub, del, ins, u, cite

Pre-formatted text

None

pre, code

Multimedia elements

Multimedia elements are enclosed within figure generally. Captions for these elements are included within the figcaption tag when available. If multimedia elements appear in the text as inline elements within paragraphs they are kept as is (without enclosing them in a figure element).

figure, figcaption, img, video, audio, iframe, embed, object, source

Supported attributes

Tag attributes not in the suported list to the right are filtered out of the output.

data-*, alt, cite, colspan, datetime, dir, href, label, rowspan, src, srcset, sizes, start, title, type, value, vspace

Social media content

Content from social media platforms (Twitter, etc) will be rendered properly if the correct JavaScript files from the platform are included. The currently supported platforms and the JavaScript file to use to include them are as follows:

Platform

Script file

Twitter

<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Instagram

<script async src="//www.instagram.com/embed.js"></script>

Facebook

<div id="fb-root"></div>
<script async defer src="https://connect.facebook.net/en_US/sdk.js#xfbml=1&version=v3.2"></script>

Example articleBodyHtml response

<article>

<p>The range of use cases for web data extraction is rapidly increasing and with it the necessary investment. Plus the number of websites continues to grow rapidly and is expected to exceed 2 billion by 2020.</p>

<p>Presented by <a href="https://scrapinghub.com/">Scrapinghub</a>, the first Web Data Extraction Summit will be held in Dublin, Ireland on 17th September 2019. This is the first-ever event dedicated to web data and extraction and will be graced by over 100 CEOs, Founders, Data Scientists and Engineers.</p>

<figure><iframe src="https://play.vidyard.com/7hJbbWtiNgipRiYHhTCDf6?v=4.2.13&amp;viral_sharing=0&amp;embed_button=0&amp;hide_playlist=1&amp;color=FFFFFF&amp;playlist_color=FFFFFF&amp;play_button_color=2A2A2A&amp;gdpr_enabled=1&amp;type=inline&amp;new_player_ui=1&amp;vydata%5Butk%5D=d057931dfb8520abe024ef4b2f68d0ad&amp;vydata%5Bportal_id%5D=4367560&amp;vydata%5Bcontent_type%5D=blog-post&amp;vydata%5Bcanonical_url%5D=https%3A%2F%2Fblog.scrapinghub.com%2Fthe-first-web-data-extraction-summit&amp;vydata%5Bpage_id%5D=12510333185&amp;vydata%5Bcontent_page_id%5D=12510333185&amp;vydata%5Blegacy_page_id%5D=12510333185&amp;vydata%5Bcontent_folder_id%5D=null&amp;vydata%5Bcontent_group_id%5D=5623735666&amp;vydata%5Bab_test_id%5D=null&amp;vydata%5Blanguage_code%5D=null&amp;disable_popouts=1" title="Video"></iframe></figure>

<p>With a promising line-up of talks and discussions accompanied by interesting conversations and networking sessions with fellow data enthusiasts, followed by food and drinks at the magnificent Guinness Storehouse, there are no reasons to miss this event. What’s more, we are also giving out free swag! You will get your own Extract Summit T-shirts on the day!</p>

<figure><img src="https://blog.scrapinghub.com/hubfs/Extract-Summit-Emails-images-tee-aug2019-v1.gif" alt="Extract-Summit-Emails-images-tee-aug2019-v1"></figure>

</article>