Promise Based Scraper in Node.js

Promises/A+ in JavaScript

Promises/A+ in JavaScript

I love automation. And I love scraping. I ended up writing a quick Node.js script to Scrape Magento’s Certification Directory to put together a list of Australian & Melbournian Magento Developers / Specialists that are certified.

It definitely was a bunch of fun because I tried a new technique based on promises. The promise library I used is a Promises/A+ compliant Bluebird library.

Without further ado, here is the code:

So the code basically goes through each address in the array urls, with each page that it visits returning a promise object. A chain of .then() goes through each row of developer information, creates a JSON object per developer and stores it in a MongoDB.

If a ‘next page’ button is found, it will go to the next page before visiting the next URL in the urls array.

In my opinion, the coolest parts of the code is:

…which pretty much turns a callback based MongoDB native Node.js driver to a promise based one. It’ll do a bunch of inserts (upsert) as promises and only when those promises resolve, goes to the next .then()

You can see the result of this code here or view it as a GitHub Gist.

Similar Posts:

Comments

  • Ben P

    December 1, 2015 at 7:43 pm

    Francis, this looks very clear. Thanks for posting it.

    I’m wondering what your take would be on the problem of the link-following request failing or timing out? No need for code, of course, but would you mind sharing how you would probably modify the scraper to handle this?

    Thanks again!

  • Francis Kim

    December 1, 2015 at 8:19 pm

    Hi Ben,

    I haven’t made it deliberately fail to test but .catch() should catch any errors and you could also place a .reject() when the http call errors.

  • Vince

    May 3, 2017 at 3:21 pm

    doesnt pull anything into the DB

Write a comment

Your email address will not be published. Required fields are marked *