If you can't connect AMPize to your database and you don't have any REST API, you still can scrape and normalize content from an existing website. In this guide, you'll learn how to crawl websites with AMPize.
By plugging a website as a datasource, you can extract structured items from a site. Your site will be spidered from a starting seed URL, and all pages matching the extraction template you defined will be processed.
You can optionally set crawl jobs to repeat automatically.
During crawls, only new items are extracted. Items are updated when displayed by a visitor in the AMP site, and will be deleted if the source page doesn't exist anymore.
Name of your datasource. Cannot contain spaces, special caracters or numbers. The datasource name cannot be modified afterwards.
Starting or seed URL from which your site will be spidered.
URL of a classic detail page of your site, in which a complete article is displayed.
Type of content for this source, used to describe your items through JSON-LD markup (see SEO guide).
User agent used by the web spiders. Can be useful to let spiders access content that are usually blocked by a paywall for example.
Only URLs containing at least one of the matching strings will be crawled. A page is said to be crawled when it is evaluated for additional links to follow, or for links to be processed.
Only URLs containing at least one of the matching strings will be processed. A web page is said to be processed if it matches the extraction template.
URLs containing at least one of the matching strings will not be crawled.
If enabled, crawls will respect robots.txt policies. When turned off, robots.txt will be ignored.
Max pages to process
Limit the number of processed pages
Limit the number of extracted items
The maximum depth (number of clicks you need to reach a specific page from the starting URL) that will be allowed to crawl for your site.
Target Crawl Concurrency
Average number of requests send in parallel to your website.
If enabled, crawl jobs will repeat automatically. Each round will fully re-spider the site from the starting URL, and process pages according to your settings.
Repeat frequency (in minutes)
The crawl will repeat every x minutes.
You can choose to be notified by email at the conclusion of each crawl.
The extraction template will help you define the structure of your data. You can add fields to the structure from:
- CSS selectors matching specific elements in the page
- markup vocabularies (JSON-LD or microdata)
To add a field to your structure, just click on an element of the page and select the attribute (text, html, src, href, alt...) you want to be used for this field. You can use several attributes for the same element if you want (for example, for an image you can add an 'image' field for the 'src' attribute and a 'caption' field for the 'alt' attribute).
You can also manually edit the CSS selector if you cannot reach an element using the interface. For each field, the following propertiers are available:
- Field name (mandatory)
- Format (for 'text' attributes only): 'text' or 'date'. If 'date', the value will be converted to a timestamp
- Required: if checked, items without this field won't be scraped
- Schema.org output property: select the corresponding property for the generated JSON-LD markup (see SEO guide)
Once added, a field can not be modified: you must delete it and add it all over again.
If the page you choose as a template contains JSON-LD or microdata markup, a button labelled 'Add a field from markup' will let you add a field based on one of these properties often used for Article items:
As for fields based on CSS selectors, you must give it a name and you can select a schema.org output property if you want this field to be used for the generated JSON-LD markup (see SEO guide).
Manage list queries
If you want to display article lists on your AMP site that perfectly match the corresponding list on your source site, you can manage list queries.
Such a list query is defined by:
- the name of your choice
- the page URL from the source site
- the CSS selector that matches the div tag surrounding the list of articles on this page
Once you have defined a list query, you can use it for any of your AMP sites in the Query Builder. To mimic the list of your source site, you just have to apply a filter (yourQueryListName_tag = TRUE) and a sort property (order by: yourQueryListName_order, direction: Ascendant) on your query.
For each site source, the following actions are available:
To run/stop the crawl manually.
Crawl specific URLs
If you want an item to be extracted without having to wait for the next round, you can crawl specific URLs.
To delete all the items already extracted. The settings of your source and the extraction template are unaffected.
To delete the whole source (items, settings, extraction template).