Thomas Levi, POF on How Online Dating is Improving Matching through Big Data

We discuss Big Data use cases at Plenty of Fish, insights from text mining of user profiles, using topic modeling for developing user archetypes, challenges and more.

By Anmol Rajpurohit@hey_anmol, July 29, 2014.

Thomas LeviThomas Levi started out with a doctorate in Theoretical Physics and String Theory from the University of Pennsylvania in 2006. His postdoctoral studies in cosmology and string theory, where he wrote 19 papers garnering 650+ citations, then took him to NYU and finally UBC. In 2012, he decided to move into industry, and took on the role of Senior Data Scientist at POF. Thomas has been involved in diverse projects such as behavior analysis, social network analysis, scam detection, bot detection, matching algorithms, topic modelling and semantic analysis.

Here is my interview with him:

Anmol Rajpurohit: Q1. What does PlentyOfFish(POF) do? What are the most important use cases of Big Data at PlentyOfFish?

POF LogoThomas Levi: PlentyOfFish is the world’s largest online dating site with over 80 million registered users.

Everyday 3.6 million unique users log into the site and send between 20k and 30K messages per minute. But what we’re most proud of, is all of the relationships that are created as a result of the site.

We use data for a lot of things at PlentyOfFish, both internally and externally. We collect data on successful couples that have used the site and use that information to train and update our matching algorithms via neural networks among other things. Internally, we use data for diverse projects like scam detection, predicting user on-boarding/churn, user behavior as well as more advanced social network analysis to understand how users message and cluster together on the site.

AR: Q2. What is your approach towards text mining the user profiles and messages? What kind of insights does it provide?

TL: One of the most interesting things about working for PlentyOfFish is getting to work with a huge amount of data on real people. User ProfilesPeople behave very differently when they know they are answering questions on a survey rather than their natural behavior or descriptions of themselves. In the case of looking at the interests users put on their profiles, which is what I focused on for this project, there’s no system to game or reason to not simply write what you’re actually interested in. This stands in sharp contrast to the sort of BuzzFeed like questionnaires where you might be trying to tailor your answers to get certain results. In our case, we can see how users and interests cluster together by things like location, gender or age. You can start to ask questions like what do people do for fun in Texas vs. California? What group of people is most romantic? Nerdy? Hipster? Sometimes those results can surprise you.

AR: Q3. How do you use Topic Modeling and LDA to develop user archetypes? How does this help improve the matching algorithms?

TL:

Topic modeling via LDA was originally developed to cluster and self-categorize text documents for things like articles in a way Topic Modelingthat a human being would. That idea applies just as well to users on a dating or social media site.

We’re using it as a form of feature reduction. For example a user might list “skiing” as an interest, while others might enter things like “snowboarding”, or “ski touring”. A human looking at that list would likely conclude that all of those people are into mountain based winter sports or outdoor sports. LDA allows us to take all of the things a user can write for their interests and group them naturally into a much smaller number of categories which can be used to search and match on.

There are a few properties of LDA that make it a particularly good choice for finding user archetypes. The first is that the topics and what words have a high lift in them are determined organically; that is the actual users on the site determine them. This eliminates any bias that might be introduced. If every user who listed “snowboarding” also listed “puppies” then those would be very likely to occur high in a topic together (spoiler alert: they don’t). The second property is that LDA is a mixture model. What that means is that each user does not end up as just one thing, but can be a mix of various topics with different weights. If you think about how people actually are, that’s a good description. For example, I enjoy outdoor sports, but also nerdy TV, books and video-games. Simply listing that I’m into one of those things doesn’t capture me as a whole. This model correctly labels me as a mix of all of these things.

We’re not currently using this model live yet, as we’re in the process of discussing implementations for it. It can be used as the basis for a matching algorithm on its own, as a factor in some of our other matching algorithms, or as a way of showing similar users to one someone is viewing. I believe its strongest potential is in search, as we can allow our members to insert whatever they want their potential match to be interested in and show them thematic matches, e.g. typing in “skiing and Netflix” will get you matches interested in outdoor sports and  TV/movies. If your readers want to see it, they should let us (or me) know.

AR: Q4. What are the biggest challenges in harnessing the tremendous power of Big Data available in the form of user content and activities on PlentyOfFish?

TL: Our challenges fall into three rough categories. The first is the technical challenge of data gathering and what we now call data engineering. Building out a system to measure every action User Generated Contentusers can take on a site at our scale is an extremely complex task. It’s made even more so because as a dating site, users can interact with each other so, much like Facebook, network and social graph effects add another layer of complexity. Once that system is built, we have to store the data, and store it in a way that a Data Scientist like myself can query and work with. I work closely with the team building out this system and I’ve done a lot of work making custom packages in things like R to handle some of the data.

The second is a business goals set of challenges. With all of that data, and all of the things we can do, how do I make the most impact? What’s highest value? I’m currently the only Data Scientist on the research team (though our research developers do a lot of data science too) so choosing projects is not always an easy task.

The third challenge is educational or cultural.

Coming from an academic science background, I’ve got a lot of background and faith in the scientific method, rigorous statistics and a culture of testing and experimentation. That’s not necessarily the norm in the business world. A large component of my job is moving the culture as much as I can in that direction so we’re always making educated, informed and data driven decisions.

We discuss interesting research on the state of romance in US, how PlentyOfFish is managing competition, personal journey from String Theory to Data Science, career advice and more.

By Anmol Rajpurohit@hey_anmol, July 30, 2014.

Thomas LeviThomas Levi started out with a doctorate in Theoretical Physics and String Theory from the University of Pennsylvania in 2006. His postdoctoral studies in cosmology and string theory, where he wrote 19 papers garnering 650+ citations, then took him to NYU and finally UBC. In 2012, he decided to move into industry, and took on the role of Senior Data Scientist at POF. Thomas has been involved in diverse projects such as behavior analysis, social network analysis, scam detection, bot detection, matching algorithms, topic modelling and semantic analysis.

First part of interview.

Here is second and last part of my interview with him:

Anmol Rajpurohit: Q5. How does PlentyOfFish differentiate itself from the competition such as Match, OkCupid and Lavalife?

Thomas LeviIt’s free: This has always been one of our strongestOnline Datingdifferentiators, particularly in an industry in which some of our biggest competitors follow a subscription model. We’ve been able to acquire users at a faster rate because we offer a free service.

It offers selection: We’re the largest online dating site in the world, which means a large selection of potential matches for our users. Even when you filter your search down by location (if you live in small town), or by religion (if you want to meet someone who shares your religious beliefs), you’ll still find a huge selection of users.

It’s always generated revenue: As one of the Google AdSense pioneers, we were the first to break the $1M/month milestone with Google. Years later, we built a proprietary self-serve online ad platform that is widely used by affiliate marketers and advertisers.

It’s a privately held company: Having achieved this level of success with no investment, Markus Frind remains the sole shareholder and helms one of the few independently held dating sites in the world. Today, unlike any of its competitors, Markus Frind continues to lead the company’s day-to-day operations as founding CEO. With just 70 employees, PlentyOfFish is still in “startup” mode the team can iterate quickly and effectively, and continues to dominate in a space where the top competitors have TV marketing budgets exceeding 250 million dollars.

AR: Q6. A few months ago, PlentyOfFish released an interesting research on “Where do the most romantic US singles live?” on the occasion of Valentine’s Day. What were the key attributes in identifying the romance level of singles? Was it based on their profiles or on their activities over PlentyOfFish or both?Romance ResearchTL: This study was based on the LDA interests algorithm mentioned above. When the model was being built out, we noticed that there were categories focused around romantic interests, i.e.  “Candlelight dinners” and “long walks on the beach”. Another researcher I work with had the clever idea that we could average over peoples’ archetypes based on location to determine which states had the highest membership in those categories. The result is that study. In the future we could look at a host of other things, including which states are good “matches” for each other or where a person should live based on their interests.

AR: Q7. You have a very interesting background. After your PhD in string theory and cosmology, how did you land up on a Data Scientist role for an online dating site? Do you see any connection between Data Science and String Theory?

A large chunk of it was negative things having to do String Theorywith academia and string theory. The job market for full time faculty positions on the tenure track at elite research universities is pretty grim these days. In string theory there have been a few years where only two or three people are hired in that specialty in all of North America. The alternative, and where I was career wise was to continue doing postdoctoral fellowships which are quite low paying, two to three year contract positions. The work life balance for me was way out of whack and I found myself wanting more control over where I lived and some stability. It’s frowned on to talk about money, but I won’t pretend that wasn’t a factor. I’ve eaten a lot of ramen in my life.

I also found myself wanting to work on something more concrete and connected to the everyday world. String theory is amazing, and I loved spending my time thinking about how the universe began and the building blocks of space and time. That said, it’s not likely to produce a testable result in my lifetime (never say never though!). I wanted something where I could actually see the impact of my work on regular people. I thought a bit about finance, but tech just appealed to me more. I liked the idea of creating something, and in the case of PlentyOfFish being able to look at how people interact and meet each other is incredibly interesting to me. I’m also a closet romantic, so it sounds corny, but I really like bringing people together, especially ones who otherwise wouldn’t have met.

There are a lot of connections between data science and string theory, though not at the obvious level. Yes, both things involve math and analysis, but I haven’t gotten to the stage of doing algebraic topology to understand dating quite yet. I think the approach to asking questions and solving problems is very similar. Nearly every problem I’ve solved at PlentyOfFish I had to go off and learn a lot of things. For the interest matching we’re talking about here, I taught myself LDA and some Monte Carlo techniques to solve it. In addition, I didn’t just hit on LDA, I spent a couple of months trying out different approaches and exploring various options, learning as I went. That has a lot in common with how I worked in physics. When I decided to move more towards cosmology, I had to teach myself modern cosmology and inflationary theory, and rapidly come to grips with the current state of the art. I also had to boil large questions down to something I could actually write equations for and make progress. The same is true in data science.

AR: Q8. What is the best advice you have got in your career?

TL: The best advice I’ve received is to constantly make sure you’re happy doing what you’re doing. I decided at 16 to be a theoretical physicist. I waited until I was in my 30s to question if that was still what I wanted and whether it was making me happy.

It’s never too late to pursue what you’re passionate about or make a change, just be prepared to work for it.

Reading that over, it sounds a bit cliche, but I really do believe it.

AR: Q9. On a personal note, are there any good books that youre reading lately, and would like to recommend?

All of StatisticsTL: On the technical side, I’m a big fan of “All of Statistics” by Larry Wasserman. It gives a great crash course in statistical thinking, with proofs and lots of examples for all of the key concepts. I also really like David Barber’s “Bayesian Reasoning and Machine Learning” as I can’t emphasize enough how important conditional probability and Bayesian statistics are for my job. I couldn’t make a list of books for a Data Scientist and not throw in Tom Mitchell’s “Machine Learning”. It’s a classic. On the less technical side I think every aspiring Data Scientist and just about anyone in business should read Nate Silver’s “The Signal and the Noise” It’s a popular audience level book on how to understand probability, think about it and what goes wrong if you don’t. It’s also a pretty entertaining read.

On the just for fun side of things, Patrick Rothfuss and his Kingkiller Chronicles is possibly the best series I’ve read, and the first book “The Name of the Wind” might just be the greatest novel I’ve ever read. I could say similar things about Ernie Cline’s “Ready Player One”.

Right now, I’m reading Michael Lewis’ new book “Flash Boys” about high frequency trading, and his earlier effort “The Big Short” is another great read.