Sometime around 2013 and 2014, deep learning was going through a revolution that required pretty much everyone to reset their expectations as to how things worked, and leveled the playing field for what people were doing with computer vision.
At least thats the philosophy that Pinterest engineer Andrew Zhai and his team have taken, because around that time he and a few others began working on some internal moonlightproject to build computer vision models within Pinterest. Machine learning tools and techniques had really been around for some time, but thanks to revelations in how deep learning worked and the increasing use of GPUs, the companywas able to take a fresh look at computer vision and see how it would work in the context of Pinterest.
From a computer vision perspective we have a lot of images where visual search makes sense, Zhai said. Theres this product/data-set fit. Users that come to Pinterest, theyre often in this visual discovery experience mode. We were in the right place at the right time where the technology was in the middle of a revolution, and we had our data set, and were very focused on iterating as quickly as we can and get user feedback as fast as we can.
The end result was Lens, a product Pinterest launched earlier this month that allows users to basically point at an object in the real world with their camera and return search results for Pinterest. While a semi-beta was launched last year, Lens was the result of years of scrapped prototypes and product experimentation that eventually produced something that would hopefully turn the world collectively into a bunch of pins that were searchable through your camera, creative lead Albert Pereta said.
When a user looks at something through Lens, Pinterests visual detection kicks in and determines what objects are in the photo. Pinterests technology can then frame the image around, say, a chair, and use that to ask a query using Pinterests existing search technology. It uses certain heuristics, like a confidence score of what kind of object it is, and the context of it like whether it is the dominant object, the largest one, the one the most in focus or something along the lines. Zhai said part of the priority was leveraging as much of Pinterests existing technology, like search, to build its visual search products.
Pinterest had collected a lot of data from users initially cropping objects in their images in order to search for objects, drawing bounding boxes for their searches. The company had positive feedback loops to determine if those searches were correct if users engaged with results for a chair, then it was probably a chair. With that, the company had lots of ways to initially train these deep learning algorithms in order shift the process over to camera photos and try to do the same thing. All that paid off in the future, as the initially janky projects gave the company the critical data set to build something more robust.
Pinterests goal was to emulate the servicescore user experience: that sort of putzingaround and discovering new products or conceptson Pinterest. Just getting theliteral results like you might expect from a Google visual search wasnt enough to extend the Pinterest experience beyond its typical search with keywords and concepts to what youre doing with your camera. There are other ways to get to that result, like literally reading the label on a bottle or asking someone what kind of shoes they are wearing.
If Im in my kitchen and have an avocado in front of me, if we point at that and we return a million photos of avocados, thats close to as useless as you can get, Pereta said. When someone tags am avocado on Pinterest, what they expect is towander about. It can go from cooking a recipe to health benefits and growing one in a garden. You know the related pins, you dont quite understand why theyre there but sometimes they feel like exactly whatyou want to see.
One of the biggest challenges Pinterest faced was figuring out how to jump from user-generated content like low-quality photos to results that included more professionalhigh-quality photography. It was easy to map from low-quality photos, like ones that are blurry or without great lighting, to other low-quality photos, visual search engineering manager Dmitry Kislyuk said. Thats primarily what the results were returning in the first demos that the team was working on, so the team had to figure out how to get to higher-quality results. Both objects clustered together on their own, so the company had to basically forth them to deliver the same semantic results and bucket them together.
Collectively, these all piece together to put together a strong argument that Pinterest is trying to be a leader in visual search. Thats largely been consideredone of Pinterests biggest strengths. Because of its large data set that lends itself so neatly to products, each part of an image can easily be broken out into searches for other products. These searches existed early on at Pinterest, but only in limited form and users couldnt figure out what to do with them but in the past years theyve started to mature more and more. The pitch is part of whats made Pinterest attractive to advertisers, though it needs to ensure it makes the jump from a curiosity baked into an innovation budget to a mainstay product alongside Facebook (and soon potentially Snapchat).
A lot of the success and origins of Pinterests modern visual search dovetails almost perfectly with the rise of GPU usage for deep learning. The processors had existed for a long time, but GPUs are great at running processes in parallel such as rendering pixels on a screen and doing it very quickly. CPUs have to be more versatile, but GPUs were specialized at running these kinds of processes in parallel, enabling the actual mathematics thats happening in the background to execute faster. (This revolution has also rewarded NVIDIA, one of the largest GPU makers in the world, by more than tripling its stock price in the past year and turning it into a critical component in the future of deep learning and autonomous driving.)
Methods for deep learning existed for 10 or 20 years, but it was this one paper around 2013 and 2014 that showed when you provided those methods on a GPU you can get amazing accuracy and results, Zhai said. Its really because of the GPU itself, without that this revolution probably wouldnt happen.GPUs only care about these specific things like matrix multiplication, and you can do it really fast.
The actual process is a careful dance between what happens on the phone and what happens online, in order to build a more seamless user experience. For example, when a user looks at something through their phone, the annotations for Lens are returned quickly while the company finishes doing the image search on the back-end. That kind of perceived user latency helps smooth out the experience and makes it feel more real-time.That will be important going forward as Pinterest begins to expand internationally and has to start grappling with problems like low-latency areas, potentially moving more operations to the phone.
Pinterests results were partially the result of a lot of new learnings, and part luck that everyones teams had to scrap and re-learn all their approaches to deep learning. Beyond that,Pinterest has billions of images that are largely loaded with high-quality versions of images that lend themselves to be naturally searchable, an archive of data that other companies or academics might not have. The whole move fast, break things kind of fits with Pinterest, which was trying to get versions in front of users in order to figure out what worked best, because the team (of less than a dozen) felt like it was inventing new user behavior.
There are plenty of other attempts by other companies to weaponize this technology into something commercial, with startups like Clarifai raising a lot of capital and building metadata-driven visual search that it make available for retailers and businesses. Google is always a looming beast with its vast amount of data, though whether that translates into a commercial product is another story. Pinterest, meanwhile, hopes that its focus on returning related ideas rather than direct one-to-one image results and the tech behind it is something thatll continue to differentiate it going forward.
Were trying to use camera to turn your world into Pinterest, Pereta said. Its not that were creating some completelynew experience to a user. It feels like when we nailed it, its when you feel like the entire world is made of pins. That thing, I take a photo of that chair, its not just that chairs similar styles but also it in context. If you were to find that chair on Pinterest, thats exactly what youd expect to find. That wandering, that discovering. When we do a really good job with camera, its gonna feel like the world is made of pins.