This endpoint is used to extract the content of an article from a given URL. It can handle requests both with or without a proxy and supports fetching the raw HTML of the page. This service is useful for extracting structured or unstructured content from web pages, such as blog posts, articles, or news content.

Request Body Parameters:
  • url (string, required): The URL of the article or webpage to be scraped. This URL should point to the page whose content needs to be extracted.

  • Example: "url": "https://www.boulama.com/blog/posts/the-power-of-habit:-what-i-learned-why-you-should-read-it.html"

  • proxy (optional):

  • Example with automatic proxy selection:

"proxy": true
  • Example with location-based proxy selection:
"proxy": {
  "location": "us"
}
  • Example with user-provided proxy details:
"proxy": {
  "server": "198.51.100.1",
  "username": "proxyuser",
  "password": "proxypass"
}
  • raw_html (boolean, optional): If set to true, the raw HTML content of the webpage is returned in the response in addition to or instead of structured article data. Defaults to false.
  • Example: "raw_html": true

Response:

The response will contain the extracted content of the article, such as title, text, metadata, and/or raw HTML, depending on the request options.

  • Status Code: 200 (OK) on success.
  • Response Body:
  • If raw_html is true, the response will include the full HTML of the page.
  • If raw_html is false, the response will return the parsed article content (like title, author, body text) in a structured format (e.g., JSON).

Notes:

  • Proxy: If no proxy is provided or proxy: false is passed, the request will be made directly. For restricted or geo-blocked content, using a proxy may be necessary.
  • Rate Limiting: Frequent scraping requests might be subject to rate limiting. Ensure your application handles such errors gracefully.