I can't do that for you dave: Undefined is not a function
LeadDev London · London, UK
by Asim Hussain · 31 May 2019
This is a ten-minute lightning tour, under a HAL-9000 “2001: A Space Odyssey” title, of what machine learning actually looks like once you’re a JavaScript developer rather than a maths PhD. I run through three AI-powered web apps live, and the whole point is how little stands between you and building this stuff yourself.
AI-generated summary of my talk
Jump into the talk
- 0:16 The title, AI JavaScript meetup, aijs.rocks
- 1:18 TheMojifier — emotion to emoji
- 2:21 The Face API: skip the deep dive, check for an API
- 3:22 In-browser classification with TensorFlow.js + MobileNet
- 6:23 Computer Vision API: captions as alt-text
- 8:24 pix2pix and generative models — outline to cat
- 9:24 Where this is heading
A meetup, and a website of the weird
I co-run an AI JavaScript meetup. We started it early in 2018 because we’d noticed JavaScript and machine learning beginning to overlap in genuinely interesting ways. At the end of every meetup someone would wander up with a link to some AI-powered JavaScript site, and after a year of that we collected them all into one place: aijs.rocks. So for the talk I picked three of those apps and walked through, very briefly, what’s actually going on under the hood. The title is the HAL-9000 line, mangled — if you got the reference, you’re old.
TheMojifier, and the lesson hiding inside it
The first one I wrote myself: TheMojifier. Give it an image, it finds every face, works out the emotion on each one, picks the matching emoji and pastes it back over the face. It works with multiple faces, it works on memes, you can add it to your own Slack workspace. The first demo is a photo of my son — that, apparently, is exactly how he emojifies.
The interesting question is how it reads emotion at all. I do run workshops teaching people to train a neural network to detect emotion — but the honest answer, the one people don’t expect, is: there’s an API for that. Microsoft’s Face API takes a posted image and returns every face plus a set of emotion scores: anger, contempt, disgust, fear, happiness, sadness, surprise. (A warning for the bearded among you: you can apparently never be 100% happy. I’m sorry.) That’s the first real lesson of the talk — before you go deep-diving into machine learning, check whether someone has already commoditised the hard part behind a single request.
Running the model in your browser
The second app is a simple CodePen: click around, it searches Unsplash for an image, and the percentages in the corner are its guess at what’s inside — terrier, puppy, and so on. The thing worth noticing is that the only network call is fetching the image. The recognition happens in JavaScript, in the browser.
That’s TensorFlow.js. Plain TensorFlow runs heavy numerical computation across GPUs and CPUs, written in C. Early in 2018 they shipped TensorFlow.js — the whole thing rewritten from scratch in JavaScript — and the lovely part is that it’s the only dependency you need. No extra install. You can train models from your own data, or load pre-trained ones. This demo uses MobileNet, a model that recognises one of 1,000 things in an image and has been optimised to run on mobile, and it gave you that whole classifier in about four lines of code.
The catch is size. A model small enough for mobile only knows so much. If you want to really understand what’s in an image, you need a much bigger model trained on much more data — and again, you can either do that yourself or just call an API. Microsoft’s Computer Vision API is one such, and the bit I love is that it returns a caption: a plain, human-readable sentence describing the image. My friend Sarah built a demo wiring that straight into alt-text — the descriptions screen-readers read aloud for people who can’t see the image. Twitter, being Twitter, was quick to point out where it fails: a star-filled sky it got wrong, and a couple of others it got about half right. Fifty per cent is a pass mark in my book.
Drawing the rest of the cat
The last one is my favourite. Running in the browser, you draw a rough outline of a cat and it generates the rest — a photorealistic cat — using pix2pix, a generative neural network that takes your outline as input and paints the image. It doesn’t have to be cats; it works on human faces too, at which point the room stops laughing. From there the inputs get stranger: a segmented depth image driving a generated dancer (a related vid2vid model, not yet in the browser), and even plain text as the input with the images as output.
Which is where I left it. How far off, really, is someone typing “build me an e-commerce app on this reference architecture for a company in Japan with this many customers” and getting it? In 2019 my answer was: not that far off. That’s the whole talk — ten minutes, no maths PhD required, and an npm install or a single API call is enough to start.