Build a Data Mining Automation Bot Using Node.js & Your Browser
- Published 23rd Jun 2015
Last edited 10th Jun 2017
The ingredients:
- Linux server running Node.js + Express (might work on Windows?)
- Node modules: mysql, body-parser
- MySQL installed
- Chrome with Tampermonkey (alternatively Firefox and Greasemonkey)
You might wonder, why MySQL? I just happened to be more used to it at the time I coded this bot. If I were to do this again, I would no doubt use MongoDB for this exercise.

Fun automation with Node.js and your browser!
At the end of this exercise, you will have your very own ‘bot’, capable of collecting any kind of information you want from the internet, into your own database. Sounds like a great way to get sued for Copyright infringement! Let’s get started!
Below is the Express app. You can save it as express.js and run nodejs express.js to run it. Obviously replace the connection username, password and database as yours.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 |
var express = require("express"); var mysql = require('mysql'); var bodyParser = require('body-parser'); var connection = mysql.createConnection({ host : 'localhost', user : 'root', password : 'password', database : 'quotes' }); var app = express(); var jsonParser = bodyParser.json(); var urlencodedParser = bodyParser.urlencoded({ extended: false }); // Add headers app.use(function (req, res, next) { // Website you wish to allow to connect res.setHeader('Access-Control-Allow-Origin', '*'); // Request methods you wish to allow res.setHeader('Access-Control-Allow-Methods', 'GET, POST, OPTIONS, PUT, PATCH, DELETE'); // Request headers you wish to allow res.setHeader('Access-Control-Allow-Headers', 'X-Requested-With,content-type'); // Set to true if you need the website to include cookies in the requests sent // to the API (e.g. in case you use sessions) res.setHeader('Access-Control-Allow-Credentials', true); // Pass to next layer of middleware next(); }); connection.connect(function(err){ if(!err) { console.log("Database is connected ... \n\n"); } else { console.log("Error connecting database ... \n\n"); } }); app.post('/insert', urlencodedParser, function(req, res) { var quote = { quote: req.body.quote, author: req.body.author, categories: req.body.categories } console.log(quote); connection.query('INSERT INTO quotes SET ?', quote, function(err, result) { console.log(result); console.log(err); var response = { status : 200, success : result } res.end(JSON.stringify(response)); }); }); app.listen(3000); |
Can you guess what this app will do? You can see the database is called quotes. This app will collect famous quotes from famous people! (off a quotes site). You can’t really call it stealing, as these quotes aren’t the property of anybody or any site ๐ But do bear in mind that this is for educational purposes only.
Note that there is no authentication/authorisation in this code.
Now for the browser automation!
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
// ==UserScript== // @name foobarquotes.com parsing // @namespace http://your.homepage/ // @version 0.1 // @description enter something useful // @author You // @match http://www.foobarquotes.com/quotes* // @grant none // @require https://code.jquery.com/jquery-2.1.3.min.js // ==/UserScript== function getRandomInt(min, max) { return Math.floor(Math.random() * (max - min + 1)) + min; } // document load makes sure all assets are loaded jQuery(document).load(function() { // parse through each quote and post via ajax jQuery('#quotes .quote').each(function() { // create JSON to send var postData = { quote: jQuery(this).find('.quote-link').text().trim(), author: jQuery(this).find('.quote-author').text().trim(), categories: "jQuery(this).find('.category').text().trim()" }; // post to the Express/Node.js endpoint $.ajax({ url: "http://[insert your server IP]:3000/insert", type: "POST", data: postData, async: false, success: function(data, textStatus, jqXHR) {}, error: function(jqXHR, textStatus, errorThrown) {} }); }); // pause on page for random seconds between 35 to 65 before moving to next page setTimeout(function() { var link = jQuery('.pagination').find('li.active').next().find('a').attr('href'); if (link) { window.location = link; } }, getRandomInt(35000, 65000)); }); |
I have obscured the actual domain name I used for obvious reasons. The code above will automate your Chrome to parse through foobarquotes.com’s quotes, and it will send them to your Express endpoint powered by Node.js, which will in turn, save the quote to your MySQL database ๐ Check it out.

You can see at the end, there is a bit of a ‘humanness’ added where the browser will stay on a page for 35-65 seconds (random) then move on to the next page to collect another set of quotes. I was able to collect over 20,000 quotes this way with minimal effort, but must admit have not gone forward further with this project. It was fun regardless. Enjoy coding!
Edit: Just got a comment on Facebook asking why I didn’t do it all in Node.js – in retrospect, I could have – using something like Cheerio but it won’t be as cool or unique as this idea ๐ I mightย do another post on how to do a similar thing with just Node.js + MongoDB later.
Edit:ย MySQL apparently now supports JSON so this in fact is completely feasible.
As of MySQL 5.7.8, MySQL supports a native JSON data type
Comments
JOO LEE
July 2, 2015 at 1:51 pmNice post! I see lots of potential of this project in the finance context.
Usefulness might be exponentially multiplied if this gets linked up with some kind text analytics engine that can interpret the scrapped text into trading signals.Thanks heaps.
franciskim
July 3, 2015 at 10:21 am‘Usefulness’ is a relative term ๐ But yes, I agree with you – I’ve also got another project which involves a lot of server-to-server scraping and text analysis including sentiment. These types of things would be good for automated trades, it’s just a matter of application.
franciskim
July 3, 2015 at 10:22 amand a big thank you for the first comment on my blog ๐
eldy
April 7, 2016 at 6:00 pmwhy would u call this a bot? this is just normal scraping, isn’t?
Francis Kim
April 7, 2016 at 6:05 pmI guess so, but it is kind of unique in a sense that it uses Greasemonkey/Tampermonkey to automate the browser. No real emphasis on the word bot here.
eldy
April 7, 2016 at 6:02 pmUsing mysql in this super simple project is the right choice, since you don’t have to declare models and schema like in mongodb!
Jon A
August 9, 2016 at 6:02 amHi Frances, do you have a blog that has similar posts like this incredible share?
Francis Kim
August 10, 2016 at 7:03 pmHi Jon,
No – that that I know of. But please do let me know if you come across one!
Francis
Richard Maurer
October 30, 2016 at 9:00 pmHad a great lough when you said that the bot is a great way to get sued. However, great inputs to get started with my own bot – thanks! Why would you prefer MongoDB over Mysql – just curious.
Francis Kim
October 30, 2016 at 10:32 pmHi Richard, just because it feels more native to JavaScript. But MySQL does support JSON now as well.
great site
December 16, 2016 at 5:58 pmIt’s an remarkable post for all the web visitors; they will get advantage from it I am sure.
Amir
February 3, 2017 at 7:57 pmHi. Thank you so much for this amazing post. I wonder if it works on Telegram or not.
Francis Kim
February 5, 2017 at 12:50 amHey Amir, you can do anything you’d like with code ๐
correspond
December 8, 2017 at 11:02 amDo you have any video of tาปat? I’d like to find out some additional information.
Write a comment