Build a Data Mining Automation Bot Using Node.js & Your Browser

The ingredients:

  • Linux server running Node.js + Express (might work on Windows?)
  • MySQL installed
  • Chrome with Tampermonkey (alternatively Firefox and Greasemonkey)

You might wonder, why MySQL? I just happened to be more used to it at the time I coded this bot. If I were to do this again, I would no doubt use MongoDB for this exercise.

Fun automation with Node.js and your browser!

Fun automation with Node.js and your browser!

At the end of this exercise, you will have your very own ‘bot’, capable of collecting any kind of information you want from the internet, into your own database. Sounds like a great way to get sued for Copyright infringement! Let’s get started!

Below is the Express app. You can save it as express.js and run nodejs express.js to run it. Obviously replace the connection username, password and database as yours.

Can you guess what this app will do? You can see the database is called quotes. This app will collect famous quotes from famous people! (off a quotes site). You can’t really call it stealing, as these quotes aren’t the property of anybody or any site 😉 But do bear in mind that this is for educational purposes only.

Note that there is no authentication/authorisation in this code.

Now for the browser automation!

I have obscured the actual domain name I used for obvious reasons. The code above will automate your Chrome to parse through foobarquotes.com’s quotes, and it will send them to your Express endpoint powered by Node.js, which will in turn, save the quote to your MySQL database 🙂 Check it out.

Screen Shot 2015-06-23 at 11.01.25 pm

You can see at the end, there is a bit of a ‘humanness’ added where the browser will stay on a page for 35-65 seconds (random) then move on to the next page to collect another set of quotes. I was able to collect over 20,000 quotes this way with minimal effort, but must admit have not gone forward further with this project. It was fun regardless. Enjoy coding!

Edit: Just got a comment on Facebook asking why I didn’t do it all in Node.js – in retrospect, I could have – using something like Cheerio but it won’t be as cool or unique as this idea 😛 I might do another post on how to do a similar thing with just Node.js + MongoDB later.

Edit: MySQL apparently now supports JSON so this in fact is completely feasible.

As of MySQL 5.7.8, MySQL supports a native JSON data type

Similar Posts:

Comments

  • JOO LEE

    July 2, 2015 at 1:51 pm

    Nice post! I see lots of potential of this project in the finance context.
    Usefulness might be exponentially multiplied if this gets linked up with some kind text analytics engine that can interpret the scrapped text into trading signals.Thanks heaps.

  • franciskim

    July 3, 2015 at 10:21 am

    ‘Usefulness’ is a relative term 🙂 But yes, I agree with you – I’ve also got another project which involves a lot of server-to-server scraping and text analysis including sentiment. These types of things would be good for automated trades, it’s just a matter of application.

  • franciskim

    July 3, 2015 at 10:22 am

    and a big thank you for the first comment on my blog 🙂

  • eldy

    April 7, 2016 at 6:00 pm

    why would u call this a bot? this is just normal scraping, isn’t?

  • Francis Kim

    April 7, 2016 at 6:05 pm

    I guess so, but it is kind of unique in a sense that it uses Greasemonkey/Tampermonkey to automate the browser. No real emphasis on the word bot here.

  • eldy

    April 7, 2016 at 6:02 pm

    Using mysql in this super simple project is the right choice, since you don’t have to declare models and schema like in mongodb!

  • Jon A

    August 9, 2016 at 6:02 am

    Hi Frances, do you have a blog that has similar posts like this incredible share?

  • Francis Kim

    August 10, 2016 at 7:03 pm

    Hi Jon,

    No – that that I know of. But please do let me know if you come across one!

    Francis

  • Richard Maurer

    October 30, 2016 at 9:00 pm

    Had a great lough when you said that the bot is a great way to get sued. However, great inputs to get started with my own bot – thanks! Why would you prefer MongoDB over Mysql – just curious.

  • Francis Kim

    October 30, 2016 at 10:32 pm

    Hi Richard, just because it feels more native to JavaScript. But MySQL does support JSON now as well.

  • great site

    December 16, 2016 at 5:58 pm

    It’s an remarkable post for all the web visitors; they will get advantage from it I am sure.

  • Amir

    February 3, 2017 at 7:57 pm

    Hi. Thank you so much for this amazing post. I wonder if it works on Telegram or not.

  • Francis Kim

    February 5, 2017 at 12:50 am

    Hey Amir, you can do anything you’d like with code 🙂

Write a comment

Your email address will not be published. Required fields are marked *