Docs
As a library, SoupJS provides three classes/functions to the user - the Tag
class, the Soup
class, and the scrape()
function.
// Exports
module.exports = {
Tag: Tag,
Soup: Soup,
scrape: scrape
};
The library is designed with a structure similar to the fetch
API, but simplified.
Installation
Obviously, to use the library, you have to install it first. Once you have Node and NPM(which comes with Node when you install it), you can simply run:
npm i soupjs-lib
Code Notation
The documentation below uses a certain code notation you may or may not be familiar with. For variables and properties, the notation for them is <type of variable/property> <variable/property name>
. As for functions, the notation for them is <return type> <function name>(<type of parameter><parameter name> = <parameter's default value, if any>, ...)
.
Under the Hood
To use SoupJS, it might be useful to find out how it works under the hood. Basically, you start by passing the URL of the website you want to scrape to scrape()
. You can also pass in a file by adding an extra parameter to the function, discussed below.
Depending on whether you’re getting an online file or a local file, SoupJS will either use JavaScript’s fetch
API(provided by the filler library node-fetch
) or Node’s fs
(i.e., filesystem) module to read the file.
Once it does that, the function will create a new Soup
object for the contents of the file. This object basically uses the contents of the file to create a new, virtual Document Object Model that you can search through.
To search through the file, Soup
provides two methods. The first method finds the first occurrence of a HTML element matching a specific CSS selector, and the latter method finds all occurrences of elements that match a CSS selector.
When these element(s) are found, a Tag
object is created for each one of them. The Tag
class provides you with information about the element, in a simplified way. Like the Soup
class, it also provides two methods for searching, but within the scope of the object.
scrape()
Soup scrape(string location, string filetype="url")
Main function provided. scrape()
is asynchronous, and takes in two parameters: location
, or the location of the file, and filetype
, which represents where the file is coming from.
The location can either be a URL(https://…), or the path of a local file(./file.html).
By default, the second parameter is “url”, in which case the location is a URL. However, if you would like to scrape a local file, you need to pass in “file”.
Soup
Soup(string content)
The Soup
class takes a string content
and creates a DOM based off of it. It has the following methods:
Tag find(string selector)
array findAll(string selector, int limit=null)
Both of them take a parameter selector
, which represents the CSS selector being used to filter the document. findAll()
, however, takes an extra optional parameter limit
which represents the maximum number of elements to be returned. If the number of elements matching the selector is larger than limit
, the elements returned will be capped at limit
. Otherwise, all elements will be returned. By default, this is set to null
so all elements are returned - for large files, using limit
is suggested so that overflow errors don’t occur.
This class also has a couple of extra properties that can be used.
string htmlContent
object document
string text
These should be generally self-explanatory. htmlContent
is the HTML of content
, and text
is the text of content
. document
stores the actual virtual DOM, meaning you can directly access it if find()
or findAll()
can’t be used for your purposes.
Tag
Tag(Element element)
The Tag
class takes a DOM element, normally passed to it by the Soup
class, and converts it to a easier to read and understand format. Once a Tag
object is created, it will provide some useful properties:
string tag
string text
object attrs
string htmlContent
object document
tag
provides the tag name of the element(e.g., “p” for paragraph tags). text
provides the inner text of the element, while htmlContent
provides the inner HTML of the element.
attrs
contains all the attributes of the element, and is a object, or dictonary, with a key-and-value structure. For multi-valued attributes like classes, the Tag
class will automatically split up the values into a array.
Finally, document
stores a virtual DOM, within the scope of the element. You can access it just like you can with the document
property in the Soup
class, if the methods described below can’t be used for your purposes.
Like the Soup
class, Tag
also provides two methods for filtering elements.
string getAttr(string attr)
Tag find(string selector)
array findAll(string selector, int limit=null)
However, it also provides a extra method, getAttr()
. This method takes the name of an attribute and returns what the element has for it. If the element doesn’t have the provided attribute, it will return undefined
.