Extract Site Metadata

Cleans and extracts a web resource's metadata.

Metadata extraction fields currently supported:

Name	Data Type
author	array (jsonb)
canonical_url	string
copyright	string
date (publish date)	date
description	text
favicon	text
image (primary/og image)	text
jsonld (structured data)	object (jsonb)
keywords	array (jsonb)
lang	string
locale	string
origin	string
publisher	string
site_name	string
tags	array (jsonb)
title	string
type	string
truncated_text	text
status	string
videos	array (jsonb)
links	array (jsonb)

Install

NPM:

$ npm install extract-site-metadata --save

Yarn:

$ yarn add extract-site-metadata

Usage

Feed in a raw markup from a webpage to get extracted metadata fields.

From .html file:

import fs from 'fs';
import extractSiteMetadata from 'extract-site-metadata';

const getMetadataFromFile = (filename) => {
  const filepath = path.resolve(__dirname, `../data/${filename}.html`);
  const markup = fs.readFileSync(filepath).toString();
  // feel free to use localhost as the second parameter for testing
  const metadata = extractLinkMetadata(markup, 'YOUR_SITE_ORIGIN_HERE');
  return metadata;
};

getMetadataFromFile('example');

From a server request:

import axios from 'axios';
import extractSiteMetadata from 'extract-site-metadata';

const processSite = async (url) => {
  return axios.get(url, config = {})
    .then(res => {
      const { headers } = res;
      const contentType = headers['content-type'];
      if (contentType.includes('text/html')) {
        return {
          body: res.data,
          url
        };
      }
    })
    .catch(err => {
      console.log(err);
    });
};

processSite('https://www.cnbc.com/guide/personal-finance-101-the-complete-guide-to-managing-your-money/`)
	.then((data) => {
		...
	});

Development

Run: git clone https://github.com/sc10ntech/extract-site-metadata.git
Change into project directory and install deps: cd extract-site-metadata && npm i

Credits & Disclaimer

extract-site-metadata was inspired by, and tries to be the spiritual successor to node-unfluff

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github/workflows		.github/workflows
.husky		.husky
data/stopwords		data/stopwords
examples		examples
src		src
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract Site Metadata

Install

Usage

Development

Credits & Disclaimer

About

Releases 11

Packages

Contributors 3

Languages

License

sc10ntech/extract-site-metadata

Folders and files

Latest commit

History

Repository files navigation

Extract Site Metadata

Install

Usage

Development

Credits & Disclaimer

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 11

Packages 0

Contributors 3

Languages

Packages