Our past Technical Chair, discussed Jake Mannix’s upcoming talk: Smarter Search With Spark-Solr at MLconf Seattle, scheduled for May 20th.
As an ML excutive, you have to deal with different algorithms, systems, and platforms. How do you deal with this technical complexity?
JM) Typically: aim for simplicity and tried-and-true effectiveness w.r.t algorithms, ease of modification and extensibility for systems, and a balance between openness, stability, and scalability for platforms.
Is it worth buying, building in-house, or using open source ml?
JM) Yes…. LoL. Ok seriously now, depending on your organization and its needs, all three of these can be the right choice. If you have a team of > 3-5 experienced data scientists / engineers with graduate training in ML, building some in-house can work. If you’re an org without a lot of direct internal ML training, try and see what out-of-the-box OSS ML can get you (esp. after engaging with the community to see where the rough edges are, and if they’re surmountable). If you’re a large organization with at least one internal ML expert who can vet closed-source vendors, then buying ML products can be viable (but I would still warn against going with anything *too* “black-box”. I translate that in my head as “black magic”, a.k.a. hard to tell as different from ‘snake oil’).
How do you introduce new algorithms in your production? Do you proactively follow the literature or you develop/modify ones according to your result analysis?
JM) The literature, when academic, is rarely applicable soon in production. When it comes out of serious industry work, it’s worth looking at, to see if you have the requisite inputs (i.e. if Google has a paper showing how if you have 100M images with labeled tags and use some collaborative filter over billions of clicks, well, it doesn’t really matter to most of us: we don’t have those things). Always trying to “keep up with the latest” developments tends to only be useful, IMO, in the broadest view on the field: see the next question, for example.
Everybody is thrilled with deep learning. We have seen great results in image/video/translation. Have you seen the same revolution in your domain?
JM) I’ve heard inklings of DL being useful for text, and while I’ve seen people play around with e.g. word2vec-style models, I haven’t heard of a single production usage of it. Yes, perhaps for machine translation, but even there, I don’t see it *replacing* all the older techniques just yet.
Is there new upcoming research that you think will change the game in the way we search and understand documents?
JM) Syntactic parsing of text documents, when achievable at scale, certainly has the capability of significantly improving search – see for example the effect that Google Knowledge Vault cards have on factual queries, or keep an eye out for advancements in the Allen Institute for AI’s Semantic Scholar ( https://www.semanticscholar.org ).
What is your opinion about knowledge base technology (deep dive, vault, etc)? How close we are to offering a Watson out of a directory with text data?
JM) Most of this technology is both a) fantastic, and b) completely locked behind closed doors. None of it is terribly open, and I’m not sure I see that changing any time soon.
Jake Mannix, Lead Data Engineer, Lucidworks
Living in the intersection of search, recommender-systems, and applied machine learning, with an eye for horizontal scalability and distributed systems. Currently Lead Data Engineer in the Office of the CTO at Lucidworks, doing research and development of data-driven applications on Lucene/Solr and Spark.
Previously built out LinkedIn’s search engine, and was a founding member of the Recommender Systems team there and after that, at Twitter, built their user/account search system and lead that team before creating the Personalization and Interest Modeling team, focused on text classification and graph-based authority and influence propagation.
Apache Mahout committer, PMC Member (and former PMC Chair).
In a past life, studied algebraic topology and particle cosmology.