I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond
- Published 24th Aug 2016
Last edited 10th Jun 2017
Social media APIs and their rate limits have not been nice to me recently, especially Instagram. Who needs it anyway?
Sites are increasingly getting smarter against scraping / data mining attempts. AngelList even detects PhantomJS (have not seen other sites do this). But if you are automating your exact actions that happen via a browser, can this be blocked?
First off, in terms of concurrency or the amount of horsepower you get for your hard earned $$$ – Selenium sucks. It’s simply not built for what you would consider ‘scraping’. But with sites being built with more and more smarts these days, the only truely reliable way to mine data off the internets is to use browser automation.
My stack looks like, pretty much all JavaScript. There goes a few readers 😑😆 – WebdriverIO, Node.js and a bunch of NPM packages including the likes of antigate (thanks to Troy Hunt – Breaking CAPTCHA with automated humans) but I’m sure most of my techniques can be applied to any flavour of the Selenium 2 driver. It just happens that I find coding JavaScript optimal for browser automation.
Purpose of This Post
I’m going to share everything that I’ve learnt to date from my recent love affair with Selenium automation/scraping/crawling. The purpose of this post is to illustrate some of the techniques I’ve created which I haven’t seen published anywhere else – as a broader, applicable idea to be shared around and discussed by the webdev community. Also, I need that Kobe Bryant money.
Scraping? Isn’t it Illegal?
Taken from my Hacker News comment
On the ethics side, I don’t scrape large amounts of data – eg. giving clients lead gen (x leads for y dollars) – in fact, I have never done a scraping job and don’t intend to do those jobs for profit.
For me it’s purely for personal use and my little side projects. I don’t even like the word scraping because it comes loaded with so many negative connotations (which sparked this whole comment thread) – and for a good reason – it’s reflective of how the the demand in the market. People want cheap leads to spam, and that’s bad use of technology.
Generally I tend to focus more on words and phases like ‘automation’ and ‘scripting a bot’. I’m just automating my life, I’m writing a bot to replace what I would have to do on a daily basis – like looking on Facebook for some gifs and videos then manually posting them to my site. Would I spend an hour each and every day doing this? No, I’m much more lazier than that.
Who is anyone to tell me what I can and can’t automate in my life?
Let’s get to it.
Faking Human Delays
It’s definitely good practice to add these human-like, random pauses in some places just to be extra safe:
|
1 2 3 4 5 6 7 8 9 |
const getRandomInt = (min, max) => { return Math.floor(Math.random() * (max - min + 1)) + min } browser .init() // do stuff .pause(getRandomInt(2000, 5000)) // do more stuff |
Parsing Data jQuery Style with Cheerio
Below is a snippet from a function that gets videos from a Facebook page:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
const getVideos = (url) => { browser .url(url) .pause(15000) .getTitle() .then((title) => { if (!argv.production) console.log(`Title: ${title}`) }) .getSource() .then((source) => { $ = cheerio.load(source) $('div.userContentWrapper[role="article"]').each((i, e) => { // parse stuff jQuery style here & maybe save it somewhere // wheeeeeee } |
I also use this similar method and some regex to parse RSS feeds that can’t be read by command line, cURL-like scripts.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
fastFeed.parse(data, (err, feed) => { if (err) { console.error('Error with fastFeed.parse() - trying via Selenium') console.error(err) browser .url(rssUrl) .pause(10000) .getSource() .then((source) => { source = source.replace(/</g, '<').replace(/>/g, '>').replace(/&/g, '&') source = source.replace(/(<.?html([^>]+)*>)/ig, '').replace(/(<.?head([^>]+)*>)/ig, '').replace(/(<.?body([^>]+)*>)/ig, '').replace(/(<.?pre([^>]+)*>)/ig, '') if (debug) console.log(source) fastFeed.parse(source, (err, feed) => { // let's go further up the pyramid of doom! // ༼ノಠل͟ಠ༽ノ ︵ ┻━┻ } |
It actually works pretty well – I’ve tested with multiple sources.
Injecting JavaScript
If you get to my level, injecting JavaScript for the client-side becomes commonplace.
|
1 2 3 4 5 6 7 8 |
browser // du sum stufs lel .execute(() => { let person = prompt("Please enter your name", "Harry Potter") if (person != null) { alert(`Hello ${person}! How are you today?`) } }) |
By the way, this is a totally non-practical example (in case you haven’t noticed). Check out the following headings.
Beating CAPTCHA
GIFLY.co had not updated for over 48 hours and I wondered why. My script which gets animated gifs from various Facebook pages was being hit with the capture screen 😮
Cracking Facebook’s captcha was actually pretty easy. It took me exactly 15 minutes to accomplish this. I’m sure there are ways to do this internally but with Antigate providing an NPM package and with costs so low, it was a no-brainer for me.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
const Antigate = require('antigate') let ag = new Antigate('booo00000haaaaahaaahahaaaaaa') browser .url(url) .pause(5000) .getTitle() .then((title) => { if (!argv.production) console.log(`Title: ${title}`) if (title == 'Security Check Required') { browser .execute(() => { // injectz0r the stuffs necessary function convertImageToCanvas(image) { var canvas = document.createElement("canvas") canvas.width = image.width canvas.height = image.height canvas.getContext("2d").drawImage(image, 0, 0) return canvas } // give me a png with base64 encoding return convertImageToCanvas(document.querySelector('#captcha img[src*=captcha]')).toDataURL() }) .then((result) => { // apparently antigate doesn't like the first part let image = result.value.replace('data:image/png;base64,', '') ag.process(image, (error, text, id) => { if (error) { throw error } else { console.log(`Captcha is ${text}`) browser .setValue('#captcha_response', text) .click('#captcha_submit') .pause(15000) .emit('good') // continue to do stuffs } }) }) } |
So injecting JavaScript has become super-handy here. I’m converting an image to a canvas, then running .toDataURL() to get a Base64 encoded PNG image to send to the Antigate endpoint. The function was stolen from a site where I steal a lot of things from, shouts to David Walsh. This solves the Facebook captcha, enters the value then clicks submit.
Catching AJAX Errors
Why would you want to catch client-side AJAX errors? Because reasons. For example, I automated unfollowing everyone on Instagram and I found that even through their website (not via the API) there is some kind of a rate limit.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
browser // ... go to somewhere on Instagram .execute(() => { fkErrorz = [] jQuery(document).ajaxError(function (e, request, settings) { fkErrorz.push(e) }) }) // unfollow some people, below runs in a loop browser .click('.unfollowButton') .execute(() => { return fkErrorz.length }) .then((result) => { let errorsCount = parseInt(result.value) console.log('AJAX errors: ' + errorsCount) if (errorsCount > 2) { console.log('Exiting process due to AJAX errors') process.exit() // let's get the hell outta here!! } }) |
Because a follow/unfollow invokes an AJAX call, and being rate limited would mean an AJAX error, I inject an AJAX error capturing function then save it to a global variable.
I retrieve this value after each unfollow and terminate the script if I get 3 errors.
Intercepting AJAX Data
While scraping/crawling/spidering Instagram, I ran into a problem. A tag page did not give me the post date in the DOM. I really needed this data for IQta.gs and I couldn’t afford visiting every post as I’m parsing about 200 photos every time.
What I did find though, is that there is a date variable stored in the post object that the browser receives. Heck, I ended up not even using this variable but this is what I came up with:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
browser .url(url) // an Instagram tag page eg. https://www.instagram.com/explore/tags/coding/ .pause(10000) .getTitle() .then((title) => { if (!argv.production) console.log(`Title: ${title}`) }) .execute(() => { // override AJAX prototype & hijack data iqtags = []; (function (send) { XMLHttpRequest.prototype.send = function () { this.addEventListener('readystatechange', function () { if (this.responseURL == 'https://www.instagram.com/query/' && this.readyState == 4) { let response = JSON.parse(this.response) iqtags = iqtags.concat(response.media.nodes) } }, false) send.apply(this, arguments) } })(XMLHttpRequest.prototype.send) }) // do some secret awesome stuffs (actually, I'm just scrolling to trigger lazy loading to get moar data) .execute(() => { return iqtags }) .then((result) => { let nodes = result.value if (!argv.production) console.log(`Received ${nodes.length} images`) let hashtags = [] nodes.forEach((n) => { // use regex to get hashtags from captions }) if (argv.debug > 1) console.log(hashtags) |
It’s getting a little late in Melbourne.
Other Smarts
So I run all of this in a Docker container running on AWS. I’ve pretty much made my Instagram crawlers fault-tolerant with some bash scripting (goes to check if it is running now)
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
==> iqtags.grid2.log <== [2016-08-23 15:01:40][LOG] There are 1022840 images for chanbaek [2016-08-23 15:01:41][LOG] chanbaek { ok: 1, nModified: 0, n: 1, upserted: [ { index: 0, _id: 57bc654f1007e86f09f70b49 } ] } [2016-08-23 15:01:51][LOG] Getting random item from queue [2016-08-23 15:01:51][LOG] Aggregating related tags from a random hashtag in db [2016-08-23 15:01:51][LOG] Hashtag #spiritualfreedom doesn't exist in db [2016-08-23 15:01:51][LOG] Navigating to https://www.instagram.com/explore/tags/spiritualfreedom [2016-08-23 15:02:05][LOG] Title: #spiritualfreedom • Instagram photos and videos ==> iqtags.grid3.log <== [2016-08-23 15:00:39][LOG] Navigating to https://www.instagram.com/explore/tags/artist [2016-08-23 15:00:56][LOG] Title: #artist • Instagram photos and videos [2016-08-23 15:01:37][LOG] Received 185 images [2016-08-23 15:01:37][LOG] There are 40114945 images for artist [2016-08-23 15:01:37][LOG] artist { ok: 1, nModified: 1, n: 1 } [2016-08-23 15:01:47][LOG] Getting random item from queue [2016-08-23 15:01:47][LOG] Aggregating related tags from a random hashtag in db [2016-08-23 15:01:47][LOG] Hashtag #bornfree doesn't exist in db [2016-08-23 15:01:47][LOG] Navigating to https://www.instagram.com/explore/tags/bornfree [2016-08-23 15:02:01][LOG] Title: #bornfree • Instagram photos and videos [2016-08-23 15:02:44][LOG] Received 183 images [2016-08-23 15:02:44][LOG] There are 90195 images for bornfree [2016-08-23 15:02:45][LOG] bornfree { ok: 1, nModified: 0, n: 1, upserted: [ { index: 0, _id: 57bc658f1007e86f09f70b4c } ] } |
Yes, it seems that all 6 IQta.gs crawlers are running fine 🙂 I’ve run into some issues with Docker where an image becomes unusable, I have no idea why – I did not spend time to look into the root cause, but basically my bash script will detect non-activity and completely remove and start the Selenium grid again from a fresh image.
Random Closing Thoughts
I had this heading written down before I started writing this post and I have forgotten the random thoughts I had back then. Oh well, maybe it will come back tomorrow.
Ah, a special mention goes out to Hartley Brody and his post, which was a very popular article on Hacker News in 2012/2013 – it inspired me to write this.
Those of you wondering what the hell browser is:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
const webdriverio = require('webdriverio') let options if (argv.chrome) { options = { desiredCapabilities: { browserName: 'chrome', chromeOptions: { prefs: { 'profile.default_content_setting_values.notifications': 2 } } }, port: port, logOutput: '/var/log/fk/iqtags.crawler/' } } else { options = { desiredCapabilities: { browserName: 'firefox' }, port: port, logOutput: '/var/log/fk/iqtags.crawler/' } } const browser = webdriverio.remote(options) |
And argv comes from yargs
Thanks to my long time friend In-Ho @ Google for proofreading! You can check out some of his music production here.
Comments
I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond | Raony Guimarães
August 24, 2016 at 7:25 am[…] Source: I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond […]
VividSoftwareSolutions
August 24, 2016 at 8:31 amFrancis, very impressive. I like the Catching Ajax errors part.
Thanks a ton for all the useful snippets!
Francis Kim
August 24, 2016 at 9:48 amThank you, glad you find it useful!
Bookmarks for August 23rd | Chris's Digital Detritus
August 24, 2016 at 9:02 am[…] I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond – […]
Kevin
August 24, 2016 at 9:27 amJust curious, how would you make money with scraping??
I’m a JS developer for profession, and have to setup Selenium + mocha / nightwatch frequently.
So wonder if there is some way to make an extra bug 🙂
Francis Kim
August 24, 2016 at 9:47 amNot sure to be honest lol. Most scraping jobs seem to be kind of low quality jobs wanting x leads for y dollars – that’s certainly not what I am after. But it has served me well for personal use. I think the money will be in creating a bot that allows smart automation rather than just doing the numbers.
chejazi
August 24, 2016 at 1:22 pmIf you can create sites with verticals that curate content automatically and better than a general purpose sharing platform, then you could make advertising revenue.
Francis Kim
August 25, 2016 at 12:31 pmThat’s what I’m trying to do but getting the site known doesn’t seem easy!
Nick Sweeting
August 24, 2016 at 10:17 amLead-gen jobs are all over the place, but they don’t pay very well unless you can scrape multiple sources and cross-reference to improve lead quality (which usually takes some data-science experience).
Cupcake89
August 24, 2016 at 11:19 amYour logo and specially it’s colors resembles speedof.me. Haha
Francis Kim
August 24, 2016 at 12:27 pmWell, colours maybe haha.
zack
August 24, 2016 at 12:24 pmIt’s not scraping, it’s an HTML feed!
Francis Kim
August 24, 2016 at 12:27 pmHaha, I like your thinking 🙂
I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond | michaltheil
August 24, 2016 at 2:52 pm[…] Source: I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond […]
Ronald
August 24, 2016 at 5:31 pmGreat article and awesome work with GIFLY.co! I happen to have published a small scraper lib on npm for those looking to do some quick of listing-type sites: https://www.npmjs.com/package/cerealscraper
Keep up the good work 😉
JavaScript Scraping Techniques – Fraud Engineering
August 24, 2016 at 6:22 pm[…] Very informative article about modern scraping tips and tricks. […]
Christian
August 24, 2016 at 6:49 pmNice article! You can avoid the pyramid of doom by creating custom commands (http://webdriver.io/guide/usage/customcommands.html). That will help you to have a better structured code.
Francis Kim
August 24, 2016 at 6:56 pmChristian the man himself! Thanks for that, I shall start using that when the pryamid gets too high.
Nick
August 25, 2016 at 6:46 amHi! Ever do any scraping side projects? I need a few existing ones upgraded.
deejbee
August 24, 2016 at 7:27 pmI have similar interests in automation and I made an auto-login chrome extension that logs into financial sites like banks automatically. It automates the “type characters 3, 5 & 9 from your password” type challenges, multipage logins and clicks etc.
https://chrome.google.com/webstore/detail/uyp-free-blasts-through-m/kblhemffnhfhjianafepcclpakocicgg
Anon
August 24, 2016 at 9:10 pmThere are so many different ways to scrape/mine data from the internet.
How do you compare these custom builds vs the latest startups trying to solve the “scraping” problem for the rest?
Have you looked into tools like;
Import.io
Mozenda
ScrapingHub
And other frameworks for scraping/parsing data?
Scrapy for example?
Pro Tip: You could also apply machine learning to your current setup to analyze, classify and filter your data before reposting it on your “curated” websites.
MonkeyLearn is a great example of an easy to use machine learning API
Marc
August 24, 2016 at 9:19 pmGlassdoor seems to be detecting Phantom.js as well. Did the “human” delay help here?
I'm a Bot Developer
August 25, 2016 at 12:21 am[…] Anyway, there’s brewing interest around chatbots and bots in general. Like a digital personal assistant, I think it’s great to see the productisation and humanisation of code. It’s easy for anyone to relate to and it doesn’t feel as rigid as ‘scraper’ or ‘data miner’ – maybe that’s a different matter, but I’m still stuck on my last post. […]
Douglas Muth
August 25, 2016 at 4:07 amBeing a bit of a newbie to Docker, I’d be interested in seeing an article on how to get these crawlers up and running in Docker, how you store/check output from the crawlers, etc. Thanks!
John Rowling
August 25, 2016 at 4:40 amGreat article Francis. These snippets are life-saver. Thanks for sharing.
砍站的技巧? | Gea-Suan Lin's BLOG
August 25, 2016 at 5:43 am[…] 在 Hacker News 上看到的文章,講如何用 JavaScript 砍站:「I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond」。 […]
Tom
August 26, 2016 at 12:32 amScrapy for example?
Single threaded…
RobotFramework is actually very interesting one for big projects (i.e. connecting multiple services together or using its selenium driver and injecting any JS (like JQuery) to do the job, I personally like tools like:
CasperJS
PhantomJS
but these two are so-so when it comes to captcha pooling. Usually this can be sorted out with making a screenshot of whole page and extracting coords of captcha image, then cut out. Once you have this you can use antigate (if i understood idea good), deathbycaptcha, decaptcher or any other decaptcher type service.
anything that supports headless
nick3499
August 26, 2016 at 5:54 am[Rhetorical] How could this scraping cat-and-mouse game ever become illegal? Internet exploits publicized data. Sysops can swipe at scraper gnats, and add comments to their TOS to discourage, but nothing more.
Virtual reality headsets draw user minds deeper into technoscapes, but Tron-like WAN awareness is impossible. No matter how often mindset-morphing mediascapes convince users they are nothing more than fleshy machines (just buzz to stimulate elite interest, and shift favor toward subservience).
Humans will always transcend machines, just as your own instructions adapt to human-sysop strats. And wield the modules of high-level tools to thwart human-sysop foes. Throwing down a staff which transforms into Python.
Ville
August 26, 2016 at 11:10 pmI’m using https://www.npmjs.com/package/dollar for simple scraping stuff. Works like charm 🙂
John C.
August 27, 2016 at 2:23 am“My stack looks like, pretty much all JavaScript”
*vomit*, *close tab*…
Juliano
August 28, 2016 at 3:47 amFrancis, great stuff! I actually working on writing an “HTML Feed” like yours, in fact, that’s how I ran into your article. I need to have a task that checks for changes weekly, and if there are changes to a particular element on DOM or particular site, update automatically, and create a JSON each time it changes. I’m not as comfortable as you are with all these tools but I can pick up. Recommend any other resources besides this great article?
Week 32 | import digest
August 31, 2016 at 8:13 pm[…] insights in this review of the state of the art in automated web scraping involving Selenium, phantomJS, node and a host of JavaScript […]
Link Dump: August 2016 – var blog = "";
September 1, 2016 at 1:34 am[…] I Don’t Need No Stinking API – Web Scraping in 2016 and Beyond […]
Top 10 JavaScript Articles for the Past Month (v.August) | 懒得折腾
September 2, 2016 at 3:46 am[…] Rank 1 […]
Stefan Smiljkovic
December 7, 2016 at 6:07 pmGreat post Francis.
We are using power of nightmare.js for web scraping and automation.
Btw, I shared your post on my directory https://links.vanila.io/posts/Ykj8MDAY54A4Q7QZa/i-don-t-need-no-stinking-api-web-scraping-in-2016-and-beyond
Stas
May 4, 2017 at 9:10 pmFrancis, would your method be successful for AngelList scraping ?
What are the key things to consider when attempting it ?
Thanks !
Francis Kim
May 6, 2017 at 12:06 amHey Stas, yeah – it should be considered more as ‘web automation’ versus your everyday scraping. Anything a human can do and collect, a Selenium based script / bot can do. Some key things to consider are giving it plenty of pauses, and coding it to recover from failures.
Timothy
May 15, 2017 at 11:54 pmHi there very cool web site!! Man .. Beautiful .. Superb ..
I will bookmark your blog and take the feeds additionally?
I am glad to search out numerous useful info right here within the submit,
we need develop more strategies in this regard, thanks for
sharing. . . . . .
Reza
May 28, 2017 at 3:46 amDude!
You are the automation guru! How long does it take for a newbie wanna be a programmer to get to your level? What are the requirements to become a pro in automation? What steps I need to take? I really love the comcept of automation and would love to have a shot.
Rowan miller
July 6, 2017 at 8:37 amGreat post
Quan lot Lot khe nu
July 31, 2017 at 9:26 amI enjoy what you guys are usually up too. This kind of clever work and reporting!
Keep up the good works guys I’ve you guys to blogroll.
Write a comment