AutoExtract Integrations¶
Note
In all of the examples, you will need to replace the string ‘[api key]’ with your unique key.
Using cURL¶
Here is an example, how to query AutoExtract API for a product page type using cURL:
curl --verbose \
--user [api key]: \
--header 'Content-Type: application/json' \
--data '[{"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]' \
--max-time 605 \
--compressed \
https://autoextract.scrapinghub.com/v1/extract
Using requests
Python library¶
Here is a simple example in Python, how to query AutoExtract with
requests
library. However, we recommend
using scrapinghub-autoextract client for this.
import requests
response = requests.post(
'https://autoextract.scrapinghub.com/v1/extract',
auth=('[api key]', ''),
json=[{'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html', 'pageType': 'product'}])
results = response.json()
Using scrapinghub-autoextract¶
If you want to query AutoExtract using the command line or in Python, then consider the scrapinghub-autoextract client library, which makes the use of the API easier.
A command-line utility, asyncio-based library, and a simple synchronous wrapper are provided by this package.
Here is an example, how to query the product page type using the client from a command line:
python -m autoextract \
urls.txt \
--api-key [api key] \
--page-type product \
--output res.jl
where urls.txt
is a text file with URLs to query line by line, while
res.jl
is an output JSON-lines file where results will be written.
If you prefer to use Python, then synchronously querying for the product page type is as simple as follow:
from autoextract.sync import request_raw
query = [{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product'
}]
results = request_raw(query, api_key='[api key]')
where request_raw
returns results in a list of dictionaries structure.
It is also possible to query AutoExtract asynchronously using asyncio
event
loop:
from autoextract.aio import request_raw
async def foo():
query = [{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product'
}]
results = await request_raw(query)
# ...
More detailed information about the usage, installation, and the package in general, can be found in scrapinghub-autoextract documentation.
Using scrapy-autoextract¶
In case you want to integrate querying AutoExtract into your Scrapy spider, consider scrapy-autoextract. It provides the possibility to consume the AutoExtract API by using Scrapy middleware or Page Object providers.
To learn more about the library, please check scrapy-autoextract documentation.
Using JavaScript with Node.js¶
Here is an example of how to use AutoExtract in JavaScript with Node.js:
const https = require('https');
const data = JSON.stringify([{
'url': 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html',
'pageType': 'product',
}]);
const options = {
host: 'autoextract.scrapinghub.com',
path: '/v1/extract',
headers: {
'Authorization': 'Basic ' + Buffer.from('[api key]:').toString('base64'),
'Content-Type': 'application/json',
'Content-Length': data.length
},
method: 'POST',
};
const req = https.request(options, res => {
console.log(`statusCode: ${res.statusCode}`)
res.on('data', d => {
process.stdout.write(d)
})
});
req.on('error', error => {
console.error(error)
});
req.write(data);
req.end();
Using PHP with cURL¶
Here is an example of how to use AutoExtract in PHP with cURL library:
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://autoextract.scrapinghub.com/v1/extract');
curl_setopt($ch, CURLOPT_USERPWD, '[api_key]:');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, True);
curl_setopt($ch, CURLOPT_TIMEOUT_MS, 605000);
curl_setopt($ch, CURLOPT_POSTFIELDS, '[{"url": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html", "pageType": "product"}]');
curl_setopt($ch, CURLOPT_HTTPHEADER, array('Content-Type: application/json'));
// $output contains the result
$output = curl_exec($ch);
curl_close($ch);
?>
Using Java¶
Here is an example of how to use AutoExtract in Java, requesting a single product extraction,
to be placed in Main.java
file:
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Base64;
public class Main {
private static final String API_KEY_ENVIRONMENT_VARIABLE_NAME = "AUTOEXTRACT_API_KEY";
private static final String AUTOEXTRACT_PAGE_TYPE = "product";
private static final String AUTOEXTRACT_URL = "https://autoextract.scrapinghub.com/v1/extract";
public static void main(String[] args) {
if (args.length <= 0) {
System.err.println("No URL specified");
System.exit(1);
}
try {
URL url = new URL(args[0]);
String apiKey = System.getenv(API_KEY_ENVIRONMENT_VARIABLE_NAME);
if (apiKey == null) {
System.err.println(String.format(
"No API key specified, please set environment variable %s",
API_KEY_ENVIRONMENT_VARIABLE_NAME));
System.exit(1);
}
String extractedData = fetchExtractedData(url, apiKey);
System.out.println(extractedData);
} catch (MalformedURLException e) {
System.err.println("Invalid URL");
System.exit(1);
} catch (IOException e) {
System.err.println(String.format("Something went wrong: %s", e.getMessage()));
System.exit(1);
}
}
private static String fetchExtractedData(URL url, String apiKey) throws IOException {
URL autoExtractUrl = new URL(AUTOEXTRACT_URL);
HttpURLConnection connection = (HttpURLConnection) autoExtractUrl.openConnection();
connection.setRequestMethod("POST");
connection.setRequestProperty(
"Authorization",
String.format("Basic %s", Base64.getEncoder().encodeToString(String.format("%s:", apiKey).getBytes())));
connection.setRequestProperty("Content-Type", "application/json");
String payload = String.format(
"[{\"url\": \"%s\", \"pageType\": \"%s\"}]", url, AUTOEXTRACT_PAGE_TYPE);
connection.setDoOutput(true);
connection.setRequestProperty("Content-Length", Integer.toString(payload.length()));
connection.getOutputStream().write(payload.getBytes());
StringBuffer response = new StringBuffer();
BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));
String inLine;
while ((inLine = in.readLine()) != null) {
response.append(inLine);
}
in.close();
return response.toString();
}
}
This needs an API key in AUTOEXTRACT_API_KEY
environment variable, example usage would be:
$ javac Main.java
$ AUTOEXTRACT_API_KEY="your-key-here" java Main 'http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'