asim.dev

The Future of Machine Learning & Javascript

ng-conf 2019 · Salt Lake City, USA

by Asim Hussain · 30 April 2019

talk tech
The Future of Machine Learning & Javascript
▶ Watch on YouTube ↗

I gave this talk at ng-conf 2019 to teach machine learning from first principles by walking through a couple of demos from aijs.rocks, a site my friend Elena and I built to collect JavaScript-powered AI applications. My promise to the room: by the end you’ll see that a neural network is “just multiplication and tuning numbers” — nothing magical — and you’ll know exactly how the cat-drawing demo works.

AI-generated summary of my talk

Jump into the talk

  1. 0:07 Intro and aijs.rocks
  2. 3:08 The 'puppies' demo: image classification in the browser
  3. 5:08 What a neuron actually is
  4. 8:10 Stacking neurons and back-propagation
  5. 10:11 TensorFlow.js and MobileNet
  6. 12:13 Cloud APIs and auto-generated alt text
  7. 16:13 GANs and the cat-outline demo
  8. 19:20 Text-to-image, and how far away are we?

Where these demos come from

My friend Elena and I both fell into machine learning at the start of the year and launched a meetup in London — the AI JS meetup. People kept sending us links to JavaScript-powered AI applications, so we built a place to put them: aijs.rocks, a collection of AI-powered JavaScript apps. Because they’re JavaScript, you can always click in and play with them in the browser, and usually you can view source and see exactly how they were built. It’s a good place to get inspired and to learn a little machine learning. I picked two of those demos to take apart on stage.

The first was built by an attendee called Oliver, off the back of a workshop I’d run. It runs in a CodePen: do an Unsplash search for, say, a puppy, and in the bottom-left corner it tries to guess what’s in the image. The crucial, slightly crazy part is that the detection is all happening in the browser, in JavaScript. Ten years ago we couldn’t reliably tell a cat from a dog; now we’re doing this client-side.

A neuron is just multiplication

That demo is built on TensorFlow, which Google open-sourced in 2015 — a technology for running mathematical equations across hundreds of servers, GPUs and CPUs, often used to build neural networks. So what is a neural network? It has a basis in biology. A neuron has dendrites coming in and an axon going out; if enough electricity flows into the dendrites, the body fires some out of the axon. That’s it.

If you coded one up in JavaScript you’d reach back to your graph-theory days: a node, some edges coming in, some going out. You pump in some features as “electricity” — maybe the day of the year and the temperature in Celsius (the correct metric, and let’s not start on Europe). Each edge has a weight. You multiply each input by its weight, sum them, and pass the result into an activation function — just a function — and whatever it returns is the electricity going back out. It’s a bit of multiplication, a little addition, not much else.

Stack them, then tune the numbers

A neural network is just a bunch of these combined: an input layer, an output layer, a couple of hidden layers, edges between every node. You initialise every edge with a random number — Math.random() — and at first the network is useless. Say you want it to tell whether a face is happy or sad: you feed in features like the positions of the eyes, nose and mouth, do all that multiplication, and out comes a number. If 0 means unhappy and 10 means happy, and the network says 3 for an image a human has labelled “happy”, we know it’s wrong — of course it is, we seeded it with random values.

The secret is back-propagation. You tell the network how wrong it is, and the algorithm works out how to adjust the edges so that next time it does better. That’s the whole game: a lot of multiplication, and tuning the numbers. Nothing magical.

From MobileNet to cloud APIs

Last year Google announced TensorFlow.js. I’d assumed it was bindings — that you’d need TensorFlow installed underneath — but it’s a complete rewrite in JavaScript. That matters because now all you need to do ML in the browser is one dependency, or a single script tag at the bottom of the page if you code the way I like to. You can train models, or — much more often — load pre-trained ones, like searching npm for a module instead of writing it yourself.

Oliver’s demo used MobileNet, a very simple network trained to identify about a thousand things. It’s not very good, but the trick is in the name: it’s small enough to load realistically into a browser, in about four lines. That’s the trade-off — size for capability. If you really want to know what’s inside an image, the models that can name tens of thousands of things are far too large to download. That’s where the cloud providers come in. Microsoft’s is called Computer Vision: pass it an image, get back what’s inside.

My friend and colleague Sarah Drasner had a lovely idea for it. We’re all supposed to give images alt text for screen readers — so could the API generate that alt text automatically? What boggles my mind is that it doesn’t just say “a photo of a person”; it gives a human-sounding description, and for one photo it correctly knew the subject was Thomas Edison. As ever, the supportive folk of Twitter were quick to point out where it failed — a “star-filled sky” that wasn’t, animals it couldn’t be sure were alive. Even the big models get it wrong. By my marking, 50% is a pass.

GANs: two networks fighting

My favourite demo is image-to-image: draw the outline of a cat and it fills in the rest, live in the browser. It’s made by Zaid, a student in Saudi Arabia — I certainly wasn’t doing this at university. It uses a generative adversarial network, a technique only a couple of years old at the time, and it’s two neural networks competing.

You have a generator and a discriminator. The generator takes cat outlines and tries to produce cat images — badly at first, because it’s seeded randomly. You mix its fakes in with real cat images and hand the pile to the discriminator, whose job is to tell real from fake. When the discriminator gets it wrong, you tune its numbers; when it gets it right, that tells the generator it’s not convincing enough, so the generator improves. They keep fighting — discriminator gets better, generator gets better — until the discriminator can’t tell anymore. Then you throw the discriminator away, export the generator as a JSON model, load it in TensorFlow.js, and run it in the browser.

The inputs don’t have to be outlines. They can be segmented images, or — the part that lands in silence every time — plain text. Describe a flower in words and the network paints it. So I’ll leave the room with the question I keep asking: ten years ago we couldn’t tell a cat from a dog in an image; how far away are we, really, from someone typing “build me an e-commerce application with four pages and PayPal for payments” and getting it?