Build a Data Mining Automation Bot Using Node.js & Your Browser

The ingredients:

  • Linux server running Node.js + Express (might work on Windows?)
  • MySQL installed
  • Chrome with Tampermonkey (alternatively Firefox and Greasemonkey)

You might wonder, why MySQL? I just happened to be more used to it at the time I coded this bot. If I were to do this again, I would no doubt use MongoDB for this exercise.

Fun automation with Node.js and your browser!

Fun automation with Node.js and your browser!

At the end of this exercise, you will have your very own ‘bot’, capable of collecting any kind of information you want from the internet, into your own database. Sounds like a great way to get sued for Copyright infringement! Let’s get started!

Below is the Express app. You can save it as express.js and run nodejs express.js to run it. Obviously replace the connection username, password and database as yours.

Can you guess what this app will do? You can see the database is called quotes. This app will collect famous quotes from famous people! (off a quotes site). You can’t really call it stealing, as these quotes aren’t the property of anybody or any site ๐Ÿ˜‰ But do bear in mind that this is for educational purposes only.

Note that there is no authentication/authorisation in this code.

Now for the browser automation!

I have obscured the actual domain name I used for obvious reasons. The code above will automate your Chrome to parse through foobarquotes.com’s quotes, and it will send them to your Express endpoint powered by Node.js, which will in turn, save the quote to your MySQL database ๐Ÿ™‚ Check it out.

Screen Shot 2015-06-23 at 11.01.25 pm

You can see at the end, there is a bit of a ‘humanness’ added where the browser will stay on a page for 35-65 seconds (random) then move on to the next page to collect another set of quotes. I was able to collect over 20,000 quotes this way with minimal effort, but must admit have not gone forward further with this project. It was fun regardless. Enjoy coding!

Edit: Just got a comment on Facebook asking why I didn’t do it all in Node.js – in retrospect, I could have – using something like Cheerio but it won’t be as cool or unique as this idea ๐Ÿ˜› I mightย do another post on how to do a similar thing with just Node.js + MongoDB later.

Edit:ย MySQL apparently now supports JSON so this in fact is completely feasible.

As of MySQL 5.7.8, MySQL supports a native JSON data type

Comments

  • JOO LEE

    July 2, 2015 at 1:51 pm

    Nice post! I see lots of potential of this project in the finance context.
    Usefulness might be exponentially multiplied if this gets linked up with some kind text analytics engine that can interpret the scrapped text into trading signals.Thanks heaps.

  • franciskim

    July 3, 2015 at 10:21 am

    ‘Usefulness’ is a relative term ๐Ÿ™‚ But yes, I agree with you – I’ve also got another project which involves a lot of server-to-server scraping and text analysis including sentiment. These types of things would be good for automated trades, it’s just a matter of application.

  • franciskim

    July 3, 2015 at 10:22 am

    and a big thank you for the first comment on my blog ๐Ÿ™‚

  • eldy

    April 7, 2016 at 6:00 pm

    why would u call this a bot? this is just normal scraping, isn’t?

  • Francis Kim

    April 7, 2016 at 6:05 pm

    I guess so, but it is kind of unique in a sense that it uses Greasemonkey/Tampermonkey to automate the browser. No real emphasis on the word bot here.

  • eldy

    April 7, 2016 at 6:02 pm

    Using mysql in this super simple project is the right choice, since you don’t have to declare models and schema like in mongodb!

  • Jon A

    August 9, 2016 at 6:02 am

    Hi Frances, do you have a blog that has similar posts like this incredible share?

  • Francis Kim

    August 10, 2016 at 7:03 pm

    Hi Jon,

    No – that that I know of. But please do let me know if you come across one!

    Francis

  • Richard Maurer

    October 30, 2016 at 9:00 pm

    Had a great lough when you said that the bot is a great way to get sued. However, great inputs to get started with my own bot – thanks! Why would you prefer MongoDB over Mysql – just curious.

  • Francis Kim

    October 30, 2016 at 10:32 pm

    Hi Richard, just because it feels more native to JavaScript. But MySQL does support JSON now as well.

  • great site

    December 16, 2016 at 5:58 pm

    It’s an remarkable post for all the web visitors; they will get advantage from it I am sure.

  • Amir

    February 3, 2017 at 7:57 pm

    Hi. Thank you so much for this amazing post. I wonder if it works on Telegram or not.

  • Francis Kim

    February 5, 2017 at 12:50 am

    Hey Amir, you can do anything you’d like with code ๐Ÿ™‚

  • correspond

    December 8, 2017 at 11:02 am

    Do you have any video of tาปat? I’d like to find out some additional information.

Write a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.