A Node-RED node that implements scrape-it functionality.
Use the Manage Palette option in the Node-RED Editor menu.
A node to scrape html of msg.payload to a JSON. The transformation is defined by mapping property.
You can either define the mapping as a JSON-Object directly in the node, or you pass the mapping as a property msg.mapping
directly to the scrape node.
If you need to specify functions (e.g. for the how
or convert
fields), you have to pass the mapping via the input (since JSON can not contain functions).
The mapping property is an object containing the scraping information.
If you want to scrape a list, you have to use the listItem
selector:
listItem
(String): The list item selector.data
(Object): The fields to include in the list objects:<fieldName>
(Object|String): The selector or an object containing:selector
(String): The selector.convert
(Function): An optional function to change the value.how
(Function|String): A function or function name to access the value.attr
(String): If provided, the value will be taken based on the attribute name.trim
(Boolean): Iffalse
, the value will not be trimmed (default:true
).closest
(String): If provided, returns the first ancestor of the given element.eq
(Number): If provided, it will select the nth element.texteq
(Number): If provided, it will select the nth direct text child. Deep text child selection is not possible yet. Overwrites thehow
key.listItem
(Object): An object, keeping the recursive schema of thelistItem
object. This can be used to create nested lists.
For the format of the selector, please refer to the Selectors section of the Cheerio library
{
"title": ".header h1",
"desc": ".header h2",
"avatar": {
"selector": ".header img",
"attr": "src"
}
}
First extend your flow with a function node directly in front of the scrape node:
Inside the function node, add a property mapping
that contains your mapping (functions work aswell) (and leave the json-mapping within the Scrape-Node
return {
...msg, // leave the message untouched
mapping: { // and add the mapping
articles: {
listItem: '.article',
data: {
// Get the article date and convert it into a Date object
createdAt: {
selector: '.date',
convert: x => new Date(x)
},
// Get the title
title: 'a.article-title',
// Nested list
tags: {
listItem: '.tags > span'
}
}
}
}
}