Warning

Zyte Automatic Extraction will be discontinued starting April 30th, 2024. It is replaced by Zyte API. See Migrating from Automatic Extraction to Zyte API.

Forum Post Extraction#

Forum post extraction supports pages which contain multiple posts made on an internet forum page where a specific topic is discussed (thread). The API supports both “old-school” forums, and more modern discussion platforms. API response contains a list of all posts on the page, including the post text and publication date.

This supports use-cases such as media monitoring, analytics, brand monitoring, mentions, sentiment analysis and many others.

Related page type is Comment Extraction which supports comments under a single blog post or an article.

Request example#

If you requested a forum post extraction, and the extraction succeeds, then the forumPosts field will be available in the query result:

from autoextract.sync import request_raw

query = [{
    'url': 'https://example.com/forum-post-page',
    'pageType': 'forumPosts'
}]
results = request_raw(query, api_key='[api key]')
print(results[0]['forumPosts'])

Available fields#

Top-level#

The following fields are available for forumPosts:

url: string

URL of a page where posts were extracted.

topic: dictionary with name string field

A dictionary with the name of the topic that is discussed on the page. Example:

{"name": "How do you cook rice?"}

posts: list of dictionaries

List of posts; fields are described below.

url field is required.

Individual posts#

Each post inside posts field has the following fields available:

text: string: Text of the post.
datePublished: string: Post date. ISO-formatted with ‘T’ separator, may contain a timezone.
datePublishedRaw: string: Same as datePublished, but before parsing/normalization, i.e. as it appeared on the site.
replyCount: integer: Number of replies recieved by the post.
upvoteCount: integer: Number of up-votes recieved by the post.
probability: float: Probability that this is a post.

Posts refer to the topic extracted from the same page.

All fields are optional, except for probability. Fields without a valid value (null or empty array) are excluded from extraction results.

Response example#

Below is an example response with all forum post fields present:

[
  {
    "forumPosts": {
      "url": "https://example.com/forum-topic-1",
      "topic": {
        "name": "Which is the best country to work in?"
      },
      "posts": [
        {
          "text": "Finland is often considered the best for it.",
          "datePublished": "2020-01-30T00:00:00",
          "datePublishedRaw": "Jan 30, 2020",
          "upvoteCount": 12,
          "replyCount": 1,
          "probability": 0.95
        },
        {
          "text": "Switzerland has good work life balance.",
          "upvoteCount": 2,
          "probability": 0.80
        },
        {
          "text": "Depends on the person",
          "replyCount": 1,
          "probability": 0.80
        }
      ]
    },
    "webPage": {
      "inLanguages": [
        {"code": "en"},
        {"code": "es"}
      ]
    },
    "query": {
      "id": "1564747029122-9e02a1868d70b7a3",
      "domain": "example.com",
      "userQuery": {
        "pageType": "forumPosts",
        "url": "https://example.com/forum-topic-1"
      }
    },
    "algorithmVersion": "20.8.1"
  }
]