Episode 7: Ask A Data Scientist About AI at Work
Rob May and Brooke Torres sit down with two members of Talla's Data Science team, Daniel Shank and Dhairya Dalal. This episode covers interesting trends in Data Science, how the Data Science team interacts with other departments, and ends with an "In Plain English" segment to explain data science terms in their simplest form.
Rob May, CEO and Co-Founder, Talla
Brooke Torres, Director of Marketing, Talla
Dhairya Dalal, Data
Daniel Shank, Data Scientist, Talla
Brooke Torres: Hi, everyone. Welcome to AI at Work where we talk about what's happening in the artificial intelligence space, the world of work, and how the two are transforming each other. I am Brooke Torres, and I'm one of your hosts for this podcast. I'm joined here with Rob May and a couple members of our data science team here at Talla.
Rob May: If you're new, I'm the CEO and co-founder of Talla, and co-host for AI at Work podcast. This podcast targets people who are looking to deploy AI in their organizations and interested in the trends of how AI affects work. So today, I am here with Dhairya Dalal and Daniel Shank, two data scientists here at Talla who've been hard at work putting AI into our own product. They're going to talk about some of the more interesting things that data scientists do, how data scientists interact with other departments, and some cool stuff like that. Welcome, guys.
Daniel Shank: Thanks for having us.
Dhairya Dalal: Yeah, thanks. Glad to be here.
RM: Let's get started by talking a little bit about your backgrounds. Talla is a relatively young company. Daniel, you were one of the first employees, and you've been here for a little while now. What did you guys do before Talla?
DS: Before Talla, I was actually working at an accelerator in Boston called Techstars. I gave on-demand data science help to the companies that went through there, I was able to look at a lot of different kinds of data and do a lot of small projects.
DD: I've got a bit of a diverse background. Prior to Talla, I was with the Allen Institute for Brain Science, where I was helping them develop initiatives for building a knowledge graph for neuroscience research and question answering. Prior to that, I was with the Allen Institute for Artificial Intelligence working on their dialogue research, and then before that, really working with the Office of Institutional Research at Harvard, where I was working with providing strategic decision-making support for seniors leaders across the university using quantitative analysis.
RM: People come to data science from a lot of different educational backgrounds. It's actually quite varied. Can each of you cover where you went to school and what you studied?
DS: I went to the University of Chicago and studied economics. My bachelors is actually in economics. It's actually a very mathematical discipline if you've been to a school that does that kind of thing.
It turns out when I was looking for jobs afterwards that a lot of the stuff I learned would just be turned on its head, and you just get machine learning. I mean, you do all this optimization and statistical reasoning, and you realize that, instead of trying to model decision-making from an economic standpoint, I can model it from the point of view of deciding what duck categories belong to is actually pretty similar. There's a lot of people from different disciplines. Dhairya, what did you study?
DD: I've got a weird background. I studied at the University of Rochester, where I was actually a double major. I studied creative writing and also created my own major of political science, philosophy, and economics, modeled off of Oxford's PP&E and then finishing up my Masters at Harvard with a focus on reinforcement learning and dialogue.
RM: Awesome. You're going to have a chance to talk about reinforcement learning towards the end of the podcast.
BT: Let's talk a little more broadly about AI. It's this hot thing. What are some of the misconceptions that people tend to have about AI in general?
DD: I think one of the things that people tend to think about AI is that it's a monolithic thing, right? It's this one master or a singular way of thinking about solving any and all the world's problems. So people think of IBM Watson. That can basically do anything. Realistically, AI in production will be a mixture of engineering, data science pipelines, ensembles of methods, and computer science techniques. You may have 2, 20, or many different models that are classifying, predicting, and filtering information that eventually get coalesced into a singular business action.
RM: AI has made a lot of progress over the last five years, particularly with the resurgence and deep learning despite being a sort of 50 or 60-year-old technology. But why is this so much better over the last few years? What happened?
DS: I think the major one is it's just that computers got a lot better, right? It got better in a couple of ways. One, we have a lot more storage capacity so we can really use more information that people are producing every second of every day. The other is that we got better at doing particular kinds of computations. Without going into too much detail, there are these kind of math that's used in a lot of machine-learning is done very well by a graphical processing units, or GPUs, and those got a lot cheaper and a lot better.
When we realized that we can just take things that we're putting in video cards and turn them around and use them for machine-learning, we suddenly had the ability to take all of this data and actually do something with it. This sort of led to technical advances to the point where even without those advances in technology, I'd say that we're actually better at AI than we were five years ago. We wouldn't have gotten to the place we are now without that increase in technology and resources.
RM: It's really interesting because despite all the progress, a lot of the news we hear is still very high-level.
BT: In fact, a lot of the research that we see is very high-level. Obviously, it's gotten much better, but why is it, Daniel or Dhairya, that it isn't making it into products that we use today?
DD: That's a great question. I think a lot of research has two main issues. One is that research questions are very narrowly defined. They focus on very curated and tailored data sets and a lot of advances that you see are marginal improvements on those data sets. The second thing is that it often is very inefficient to apply some of these models on large scale.
What you end up having is something that's great for answering questions, but when you try training on a data set, that's several thousands of gigabytes of data. Also, all of a sudden you go from taking a few hours of training to maybe a year to train a data model. A lot of the advances for these methods haven't really caught up to the scale of the production use cases and the data sets that we use to train on aren't really available in the business use cases that would want to have value for.
RM: One of the early theories behind Talla and part of the reason that we were founded as a company was this idea that there's a lot of work that you could automate in organizations if you could capture the data that's in people's heads as they do things, right? Because we don't have data sets to train these models on so many types of tasks that we'd like to do. And it's really interesting because things that you would think would be really easy, like how do you identify a question? Oh, you look for a question mark. Turned out to be a lot harder than you would think.
DS: Yeah. I'd definitely say that a lot of the times you see some sort of article that says, oh, AI can now read books and answer questions about them and then you get really excited and you actually go and you pull down their GitHub repo and you start to try to get to work. Then you realize that oh, actually, it works for that particular collection of articles or news articles. It works great on certain kinds of news articles that can answer questions about that, but in order to get it to work on something you care about, like in our case more business-related things, then you would need to actually make your own data set, which is really expensive and very time consuming. It's been a little bit disappointing sometimes when you see some sort of magic thing on Wired and then you try to do it yourself and you realize it does exactly what it says in the article. You assume that it can do other stuff but it does only the thing that you see in the article. It's kind of funny.
DD: I think the other things that we are picking up on is that the models do a really good job overfitting for that particular domain for what the dates that they're trained on. Oftentimes if you're thinking about this in a business context, we end up missing out. It's institutional policy you have within a business that varies from organization-to-organization. The model does have adaptability and generalizability issues when you try to say hey, go from Wikipedia data to very specific IT or HR or even other verticals that you have in your organization.
RM: Let's talk a little bit about how data science is different than engineering. I was an engineering major and engineering is all about being taught to take a system and break it down into smaller parts until you understand how to solve the smallest part and then you start building it back up at every level. If I go to the engineering team and I ask them something about charting a database or scaling some system, they can give me usually a reasonably accurate estimate of how long that might take. I don't get the same thing when I come into the data science room. It's actually very different. Why is that?
DS: I've actually spent a lot of time thinking about this because it's very frustrating. It's not just people coming to talk to us. We spend a lot of time banging our fists on the whiteboards for this reason. I think it's because when you're doing data science work and doing work with machine-learning, you hit a lot of really hard walls. So there's a common joke in engineering, which is that you say something is going to take one day, you try making it work, and you realize it's going to take three days.
In data science, it kind of goes like that except often you have something you want to do, you spend a day on it, and then you just realize, oh, the thing that I wanted to do didn't work. Then you don't know how much longer it's going to take. Basically, you only know that there's a long list of things you can try that you're pretty sure if you had infinite time to try them, it would work eventually. Once you've failed at one approach, it can give you a hint but often doesn't really tell you what do you do next? You just have to try again.
DD: One of the other challenges is that trying to reconcile what the product's expectation for what the data science should be and what the data science itself is kind of hard. It is for two reasons really. It's hard to conceptualize and capture all the surrounding information, the surrounding use cases, that are going into a product use case that he might have coming to the data science team. Then being able to capture those in a meaningful way so that you deliver precisely that thing because most of the models tend to be very stochastic.
You're not going to be getting direct and exact answers back. You will be getting answers within a probability range. Because there's a probability range, you will get things in a way in which sometimes don't make sense or sometimes will seem very idiosyncratic. A lot of the work that the data scientists do is set expectations that kind of align what the product's vision of the feature is and what the data science capabilities are there.
RM: Is this something that you guys think about? Do you find yourself in a lot of situations where there are product features that you're asked to build around that you just can't deliver on? Maybe in order to work, they need to meet a model that's 94% accurate and sometimes, given the data that you have and the current techniques, you can't hit 94%.
DS: Oh yeah, totally. All the time. I remember in the early days they'd ask can you have a machine-learning system that can just write questions for you. This is something we've always wanted to be able to do and it turns out that it works, but it sort of works. There's no 100% here. Everything always works 80% of the time or something like this. Then you really want it to get to 90 it turns out, or more like 95 for you just to be happy. And it just doesn't seem to be really plausible with the current technology.
DD: Oftentimes, I think the back-and-forth is, how do you constrain the problems that can actually deliver realistic results and provide value immediately? The jump from 85% to 95% might be very hard but the jump from 85% to say 87% might be doable. If you prune out the extraneous cases that are not capturing percentage, you might be able to do just deliver a more useful feature. Sometimes it's just negotiating where we draw the line in terms of the expectations of what it should be delivered.
RM: One of the interesting things we learned early on in this company was you can use a lot of UX and UI design to control user expectations, set some rails and some constraints to help improve the quality of the output by improving the quality of the input.
DS: Totally. It's sort of like a magician's trick. They ask you to pick a card, any card, but they've guided you down a path where they know exactly what card you're going to pick anyway. So you think they're a total genius.
DD: There's data science in science as well. A lot of the challenge going into doing prediction is that you're looking at a very large search space. If you can do smart things in reducing that search space out, down to a more manual thing, you're model's just proving positionally, proving their accuracy.
BT: Switching gears a little bit. For our listeners, obviously there's a lot of uncertainty building into playing data science models into products. How would you think about building out a team and managing it in data science, particularly given some of the differences between that and engineering?
DS: That's a really hard question. There are different sets of skills. You talk about the data munching or data processing abilities and we ask a lot of questions about that. But in terms of mindset? For data scientists, there's been more of this idea of you want to see how they-- I mean, it's the same for engineers, but particularly for data scientists, you give them a problem and you want to see how they set it up. Basically the expertise is all in understanding what models are good for particular kinds of problems.
It's very much like you want to see which tool they pull out of their bag of tricks, where engineers is more of a results-oriented. I give you a laptop, you write the code, if it gets the right answer, you're 90% of the way there. They'll argue about how your loops are inefficient, but it's still pretty good. For a data scientist, you can always get some kind of answer. So you really got to grade them on did you choose a model that was a good fit for this situation? It seems kind of fuzzy. It's getting hard when you're like, “Yeah, that would have worked but I really just don't agree with your choice.” You have to make a call based on that.
DD: I'm all for a contrasting perspective so I don't think there's a need to differentiate between, say, a data scientist and engineer. I think a lot of times there's a lot of overlap that comes into play. So there's this mythical idea of the full set data scientist, right? Which has the ability to not only write and define models but also the ability to take those models, put them into production, get them operationalized and scaled really well. You're seeing a lot of people coming out with these scale sets and getting better and better at it, especially because a lot of the data scientists who are coming out of PhD programs are working essentially on applying their newer models at scale.
A lot of this skill set is very coincidental with some of the skills you'd want to be in production, but I'd say that I think to build a data scientist team, you really want three main things. One is the ability to communicate really well. Being able to take the complexities and the nuances of the models and then being able to explain it to a lay audience on the business side, the product side, as well as to an engineering side. Saying hey, here are the things we need do to be able to see all this to production quality. So there is a communication aspect.
The second aspect is just being able to just go in and write the code in in the environment that you have. Seeing more and more people with the standard set, data scientists coming out with just really strong Python skills and say other programming skills you'll see in other systems languages. That they're able to just hit through code and put them into production immediately. Or at least put up prototypes of them and initiate their team to then work with and then put into production. Then the last thing I think is useful for building a data scientist team is having specific domain expertise. I think generalists are awesome and they're great to kind of get up and running but when you start wanting to build out specific part features, you want someone who has a deep NLP experience or has deep computer vision experience or has deep optimization experience, because those specialties will help take a general product feature and deliver that differentiating value for wherever you're building.
RM: It kind of ties back to a couple of themes that we've talked about, how engineering is different than data science, how much of the data science is research-y, how much of the news is about research and doesn't make it into products. It actually ties back to an experiment that we did here when Google's Neural Turing Machine paper came out, which was when our data science team started working to implement that in a production system. Correct me if I'm wrong, I think the conclusion you actually gave a talk a couple of times on this, Daniel. You guys can find the YouTube video if you want to look for it but I think the conclusion was we couldn't get it to work. You want to talk a little bit about that?
DS: I will amend that. I will say we got it to work in the way that they wanted it to work for the paper. All the results in the paper, we can do that. But for production? It was like, no way. This is my turn and Daria's turn to complain about research code. One of the things about adapting research is that it barely works. We were talking about this before, but really, it barely works.
Under certain situations I'm running on a particular computer. It's going to do what you want but then you move into someone else's computer and even small things like your computation, the libraries you're using to do calculations, that have small differences out like decimal places, are going to make a big difference. And that's just not acceptable for production, right? Especially a lot of methods that are sensitive like that. You just can't rely on being able to have someone. Every time it broke, you basically had to have a data scientist go in and mess around with it for half a day or something like that. It's not really reasonable. Dhairya, if you have any input on that.
DD: Research really is meant to kind of prove out, experiment out the concept that you're researching and writing about. So usually, scientists don't have in mind the idea that they're going to be used in a production setting. That's usually a large part of challenge that runs into applying research to production. But the other thing is that usually you run into more domain specific issues. The data sets they're using work because they're basically optimizing for particular distributions or particular properties of the data set that might not exist in the data set they have in production or in your business use case. The model sometimes just have a hard time caramelizing because they're not really good domain fits.
RM: Let's pause there and go into a section that we're going to call 'In Plain English', where we're going to take some terms that you might have heard in data science and we're going to try to explain them as best we can without using too much data science terminology. Which means if you are a data scientist and you are listening, please give us a little bit of creative license here to gloss over some of the nuances here and to try to convey the main points to the rest of our listeners. So, number one, talk about structured versus unstructured data.
DD: Structured data I think of as basically tabulary data. Data's the stuff you see in databases. Essentially, each column has a type, you know what it is. It it's an integer or numeric or a date, or whatever it is. Unstructured data is essentially just free form text data that exists everywhere. Also, it doesn't have to necessarily be text. It could be information that's captured in image or an infographic, it could be in a PDF, it can be in PowerPoint. But basically, it's information that exists in the wild in its raw and unstructured format. Yeah. That's pretty much the difference.
RM: Gotcha. Would you say most of the techniques that we're using at Talla today are dealing with structured data or unstructured data?
DD: Bit of both, right? A large challenge of working with unstructured data is that your techniques have to be very exploratory. If you want to do some more structured prediction with it, you want to take your extracted data and figure out a way to coerce it into a structured format. If you look at Talla's knowledge base for example, we take freeform text that you put in to our editor but we actually behind the scenes are imposing constraints and adding levels of annotation and other information available so that we can take your text data and then also do more structured predictions and more structured things with it. Yeah, it's a mixture of both.
BT: Let's go through generative versus discriminative models.
DS: That's actually one of my favorite questions. A discriminative model answers questions about data. You can show it a data point and it will answer some question for you. Classic example-- Is this a muffin or is this a dog? Okay, maybe it's not that classic but you can find it on the internet. And the generative model is actually really different. Instead of answering a question about data, it simulates the data itself. So you show it a data set and it will actually reproduce more examples that might have been drawn from the data set but worked. You can use the generative model to answer questions, too. It takes a little bit more time.
You just have to generate enough data points until you can just figure out the statistics on how many images that looked like this were dogs and how many of them were muffins. It's sort of a in the matrix sort of view of if I could just read you this question over and over again I can just see what the answer is and then give you a good answer. But the problem is that they take a lot of time and data to train often because really, you're answering all possible questions about the data set, basically.
BT: What about bugs versus errors?
DS: Basically, a bug is when you say that for software reasons, the model wasn't presented the actual data the user gave or the output was mangled or you had some other engineering failure. Not that it's the engineer's fault. That's why that code is written by data scientists. But you have some sort of external reason to the model that causes this bug. But an error is when you say, well, the model gave you an answer, it’s just not the right answer.
This is a very unhelpful thing a data scientist might tell you. You say, what's wrong with the model? And they say, well, there's nothing wrong with the model, it's just wrong. Right? That's very common. For whatever reason, you don't have enough data, your problem is difficult, it's noisy, you're just off. You made a guess and you were wrong. You made a bet, you bet wrong. And that can be harder to fix than a bug. A bug usually has one point of failure or several key points of failure you can fix. When the model is just this higher rated error, you have to go back to the drawing board. Figure out ways to improve your approach. It's a little bit more of a longer term solution.
RM: Let's talk a little bit about reinforcement learning. What is that? How is it different from deep-learning and machine-learning?
DD: That's a great question, I think. When we think about deep-learning and machine-learning, a lot of that has to deal with providing the algorithm lots and lots of labeled examples. Telling it, here are the examples of what I think is a cat or here's examples of what I think is a muffin, and having the algorithm try to figure out what the patterns are and the underlying data that allow it to get to that conclusion.
Reinforcement learning is a little bit different. The idea is that the model's training itself on something called reward function. What it does is it says, hey, I have an idea of what is right based on this function. That's going to tell me if it matches my example, I'll give it a positive value. If it doesn't match it, I'll give it a penalty or negative value. What the reinforcement agent does is it starts randomly walking through the space of the actions, trying to understand which set of these actions will get me to the correct output. That's probably the thousand foot overview. Essentially, the key takeaway is that reinforcement learning, you have an agent that teaches itself how to learn a task and how to do a task.
BT: We're going to have one more, which is, Daniel, do you want to talk about what overfitting is?
DS: Overfitting is this idea that you show a model a particular data set and it only has that data set to look at. It will learn how to answer questions that it sees in a way that doesn't generalize or doesn't apply to data that it hasn't seen. So it's really only a problem if you don't have all the data, which is almost always the situation you find yourself in. A good example of this, the classic example of this, is a model that just memorizes everything that it's seen.
If you give a model five data points, each have particular answers, like you are predicting corporate earnings, give it geographic location and you've given it five different states, and then it memorizes what the earnings are for the state. But then you give it some others of a state that wasn't in the data set and then will predict one of the different numbers that you saw. You won't actually interpolate or make any guesses. It can be a pretty big problem. It's really what you wrestle with all the time with machine-learning is trying to get your model to generalize or learn something more universal from the data that it sees that applies to everything else.
As a note, this isn't always a problem if it turns out what you're trying to do is really easy. It's not actually this easy if you're trying to detect questions and you have a model that just learns to look for a question mark, well that's really not totally accurate. But if you actually show it data from the real world, it's going to not do too badly. You might call it overfitting, but it's also found a simple rule that works for it and it could do better but it's sticking with the answer that it's got.
RM: If you do that on chat data, for example, then you miss those questions where somebody just types, "seriously?" question mark. It's not really a question. It kind of is but..
RM: Well, thank you guys for that. If you're listening and you have data science terms or AI terms that you would like us to cover for the next time we have one of these In Plain English segments, please send them to firstname.lastname@example.org We'll queue those up and when we have a few, we'll do another one of these with our data scientists. Also, if you're in the market for a knowledge base you should definitely check out Talla's Smart Knowledge Base. It's really designed for sales success and support teams. Any team that's going to be customer facing. It is packed with all kinds of cool AI goodness built by some of the guys who are on the podcast today. We hope you'll check it out. Thank you guys for joining us and we'll see everybody next week.