Episode 40: All About NLP with Steve Cohen
In this episode of AI at Work, you will learn all about NLP with Steve Cohen, COO, and Co-Founder at Basis Technology.
Rob May, CEO and Co-Founder,
Steve Cohen, COO and Co-Founder,
Rob May: Hi everyone, and welcome to the latest edition of AI At Work. I'm Rob May, the co-founder and CEO at Talla. And my guest today on the program is Steve Cohen. He is the chief operating officer and co-founder at Basis Technology. Steve, welcome to the program and why don't you tell us a little bit about Basis?
Steve Cohen: Sure, thanks. Well, thanks for having me here today. So Basis is in the business, primarily, of selling or licensing natural language processing capabilities and components. We sell a lot to the OEM space, with some amount of end-user or enterprise-user sales.
RM: Very cool. Tell me the founding story of Basis. How did you get interested in NLP?
SC: The founding story of Basis takes more time than I think you really want for your podcast. I'll try to edit it down a lot. We actually started as a pure services company. So Carl Hoffman is our co-founder and CEO of the company. He had done a lot of work in software and had grown up in Japan. So, we knew a lot about the Japanese language.
I had also spent a year living in Japan in college, and graduated with a degree in electrical engineering, which of course meant that I was going to do software. And so we both wound up working, using this combination of language and software in a particular way. This was 24 years ago, so circa 1995.
That way was software internationalization. It's pure services, and we were internationalizing various companies software so that it could work with those so-called double-byte characters, a term that's inappropriate now, and I hate, but it still exists-- Japanese characters.
We migrated to do Chinese and Korean a bit and then Unicode. You might say that Unicode put us out of business, although it took a long time, but the transition happened to becoming a software and a linguistics technology firm when one of our customers, a company called Lycos. One of the first search engines that was really focused on international support had us do some work to support Unicode, and when they were done the Unicode support worked. They said, great. Now we want to use that. We want to have a Japanese language index.
The problem is Japanese don't put spaces between their words. It's a fundamental problem of Japanese for processing by a computer. We happen to have a capability. We had some software that did that. That got us, circa 1997, got us into the business of producing linguistic components. I hope that was short enough.
I could tell the longer version of that too.
RM: That's pretty interesting. So you've been in NLP for a very, very long time, and compared to most people, and the structure has changed quite a bit. Why don't you talk a little bit about how is NLP different from AI
SC: Without opening up that can of worms, which is what is AI and what constitutes AI question exactly, I have a really simple answer to this. NLP is one of the capabilities contained within the broad area of what AI is. And so what is-- I'll go back. What is AI?
AI is computers doing things that we thought only people could do. Very broadly, it's a cognitive capabilities that we thought people could do. The implementation, of course, is just really sophisticated pattern matching, but let's not go there right now, right? And natural language processing is one of the things that we expect people to be able to do, and we're impressed when computers are able to do it.
RM: Talk a little bit about circa 1997 how you did things. Did you have to design a lot of things by hand, you know, grammars and parsers, and whatever, and then, versus how it's done today. And has the team structure and the types of people you hire change? Do you still hire a lot of linguists?
SC: That's a really good question, because that exactly has changed over the years. Natural language processing, or what the US government likes to call human language technology-- they're kind of interchangeable terms-- changed dramatically in that 20 year, in that two decade period.
Because before we had a lot of, what we now call, AI tools or machine learning or statistical modeling, the belief through the late '90's, or mid to late '90's, was the way to do it was hire experts who understood language, exactly hiring linguists, or people familiar with the language, have them construct dictionaries of data and rules describing the relationships between words and build up complicated rules-based architectures. And we did that.
Our first capability, like that Japanese segmenter I described, had a dictionary of Japanese words. Some metadata about the words would basically scan through looking to see if it had a possible match for any of those words and a bunch of rules to decide which match was good and which was the wrong match.
That was classic rules-based NLP. That is almost completely gone now here in 2019-- completely replaced by what is to my mind rules-based NLP, where rules-based in 2019 is a machine figuring out what those rules are without the use of a linguist to decide what they should be. That's machine learning.
RM: So you've had to reinvent your core technologies-- what? Like five times in the history of the company.
SC: That's almost exactly five times. You hit the number right on the head. And so to your other question, how has that changed our hiring through the years, it is very-- we hire what a lot of companies hire nowadays, we hire people familiar with the machine learning tools. We do need linguists, because in order to figure out what are the features, and to assemble the right training data, you really do need linguists. You need them in a different way, and you need fewer of them.
What you really need are hard-core software engineers and data scientists, and that tends to be what we hire nowadays.
RM: Interesting. Now the application that most people are going to know is machine translation. So I'm going to type a German sentence into Google translate or some similar tool, and I'm going to get a French sentence out. What are some of the other application, end-user applications that people use NLP for?
SC: Quick comment on that is we still to this day find many people who think that the only thing to do with a language with text that's not in English is to turn it into English, and then do something with it, which I find a little bit sad. We have been doing in situ, or in the native language analytic since day one. And for the most part, it's correct.
Certainly if the audience isn't a native English speaking, there's no point to doing MT. So what are these other capabilities? Well, the simplest one, what I like to describe is the simplest one is figuring out what language the text is in. We have that capability. There's a lot of other similar tools out there, including open source, that can do a really good job of just saying here's a block of text. Typically, you need at least 20 or 30 characters, but often, there's a lot more, and just say, what is the language?
Seems really simple, but it's a fundamental precursor to some of the downstream processing, because with any of these tools, you have a specific model or a data set per language. And so, if you don't know what the language is, you don't know which model to apply, and at worst, you're going to get bad results. At best, you're going to apply multiple models and spend a lot of processing time analyzing it. That's the simple one.
What we find a lot of companies are interested in nowadays is extraction. And I'm using-- I've used that word very carefully, because there's different types of extraction or identification that people want. The most common one is entities, or often called named entity recognition, or named entity extraction.
This is finding the names of things that have names in text, typically people, organizations, locations, companies-- very common. There's a lot of use cases for that. But the extraction or identification could also be sentiment. It could also be events-- something happened-- or relationships. So and so is someone else's boss.
RM: We were talking a little bit before the before the show started here about the fact that some-- historically, these approaches, if you want to do sentiment analysis versus extraction versus machine translation, you might use different types of approaches.
For example, there's a lot of people that got very excited about question answering that oh, you can use these sequence-to-sequence models, and you just train them up on a bunch of questions and answers, and it's all great and easy. I don't know that anybody actually gotten into work in any kind of production system.
Where is this going, in the sense that we are moving towards a world where one technique is more likely to work for all of these things, or do you see those staying similar for the time being and needing slightly different technical approaches, different types of people, different types of neural network architectures? What's the state of the world there?
SC: Well, I don't know about different types of people, but certainly different types of architectures, different data for-- different training data is also relevant. So we haven't seen that any single architecture works. And we have several different approaches, ranging from MEMMS to neural nets in our technology, and find that, again, nothing's-- there's no silver bullet. There is no panacea that solves all the problems.
For instance, our recent experiments with neural nets have shown us some improvement in certain languages that may be due to the data. With some languages we get a notable improvement in accuracy on, say, entity extraction, but at a significant cost of being 1/10 the speed. So we are prepared to work with our customers to deploy whichever one is more appropriate. You get pretty good accuracy much faster, or much better accuracy much slower.
We're going to continue to work on all these technologies. Will something come along in the future that is the silver bullet? I'm not going to say no. That would be foolish. But we find a lot of different approaches are required and it really depends upon the specific problem you're trying to solve.
RM: You guys have worked in a lot of national security-- I've heard some of those stories-- finance, and technology. Are there really interesting NLP stories you can share?
SC: The one interesting story, and it's interesting because it affects something that you know, but you didn't maybe-- maybe didn't know why you have this problem. And the problem is carrying liquids greater than three ounces on planes. You may or may not know the story of a liquid bomb plot.
And that goes back to, if I recall correctly, 2006, when a group of terrorists in London were in the UK, London area I believe, decided that they were going to bring down seven planes traveling over the Atlantic on their way to the US. And the authorities in the UK had heard something. They got some indication of this and immediately set about looking at intercepted signals, along with their counterparts in the US.
This is where NLP comes into play. By using the natural language processing to analyze the intercepted information, they were able to discover the shape of the plot, find out who was involved, and prevent-- this is a good story. If you read about in the papers all the things that go wrong, occasionally, things go right, and this was one of those cases.
They were discovered. The bomb plot didn't happen. The seven planes did not crash into the Atlantic. And net effect is NLP now-- because of NLP, now you can't carry more than three ounces of anything on a plane, but for a good reason.
RM: Yes, interesting. And was basis technology involved in that?
SC: Our technology was used on both sides for that. That's about as deep as I can go.
RM: Gotcha. You know, one of the things that I think your average sort of vice president of whatever department, who's at a big company today, whose thinking about AI, is thinking a lot, probably, more about things like predictive analytics, and they, I think, has maybe a sort of business intelligence kind of view of AI. Do you think your average company has lots of applications for NLP that they haven't explored, if they were to sort of dig in and take a look?
SC: Every single day we discover yet another unique twist on how companies could benefit from data analytics that includes in finding information out of text. I'm putting it into that broader context. Yes, NLP absolutely is part of what they can do.
Where does this come about? Pretty much any business, any size, now can benefit, has a capability and the need to benefit from using data analytics. The data is out there. Everyone wants to use it. The question is how do they?
For example, I mentioned sentiment much earlier. That's very common, easy-to-understand capability, to be able to understand the sentiment of people, either who are your customers or who might be your customers or who are in your space, and use that as a tool to understand their interest or lack of interest in a particular product or a particular brand. So that's a very-- that's what I would call the first-order type of analytics.
If you have more information or more time available, you can go deeper and try to construct some sort of, say, a knowledge graph. That's one of the things a lot of our customers do is try to understand in-- by building a knowledge graph, a connected graph of the entities that they deal with, their customers or their suppliers, to understand where their risks are. So a supply-chain risk is something you might not have thought of as a deployment use case for natural language processing, but it's there because a lot of the documents are written documents. Often, they are printed documents. They have to be OCRed.
RM: Pretty interesting. So, where do you think this technology is going? Like are there other-- are we on the cusp of major, major breakthroughs? Are we inching forward slowly, but surely?
We talked earlier that machine vision-- people started working on it in the 1950's, and by 2010, we were like 65% accurate at identifying objects and images, or whatever. Then, in the 2010-2012 time frame, it just went off a cliff and got really, really good. And by 2015, basic image classification was as good as humans. Still many machine vision problems to solve, but the basic image classifications have really done well. Where are we for NLP? Are we are we near a breakthrough or not?
SC: I don't think it's as far along. That's a funny thing to say, because you think of images as being, in many ways, more complex and having more dimensions than text would. There's certainly a lot more values available than number of characters, even if you go to the Unicode Chinese characters. There's a lot of information.
Yet, the accuracy across the board for some things that people are pretty good at, like I talked about event detection, they're not there yet. For a lot of the more complex NLP tasks, we're pretty happy if we can get accuracy into the mid 70's or low 80's. Some things are much higher. But certainly, a number of them aren't. There's a way to go.
In the business, we obviously tell our customers is once you get above a certain threshold, it starts to be useful, even if it's not perfect. We try to set expectations, and we find a lot of customers get it now. I was describing just how many use cases, how many new use cases we keep finding, and the fact that there's so much data available means that if you at least have some notion of confidence on the tagging, on the analytics, then you can have enough information to make a decision.
RM: When you talk about your customers, how are you interfacing to organizations, and what changes are you seeing there? Most of these people have a head of machine learning, or a head of NLP? Do they have an NLP person? Or, are you just dealing with the head of finance, the head of engineering? Is it all over the map, or is it starting to consolidate?
SC: That's a good question. It's kind of still all over the map. We have different classes of customers with different levels of-- I don't want say sophistication, because they're all smart in what they do, but maybe sophistication within this area. A lot of our OEM sales, we're dealing with some companies that are very savvy, have their own NLP people, or very strong data science teams, and are looking to accelerate what they might be able to do themselves by licensing from us capabilities that we've already built out and have specialized in.
There's a lot of companies that aren't in the OEM category, that are the enterprise users, like say, in the finance area, where they may have data scientists, but those data scientists have focused on building models around their structured information. And so we can come in and have an intelligent conversation with the data scientists, but they don't have the NLP in-house, and they, again, just want us to be able to solve those problems.
You asked the question, how has it changed over the years? It's changed and it hasn't changed. 20 years ago, 15 years ago, we also had a wide variety of types of customers. The population of types has changed, but in both cases, 15 years ago and now, there's still the wide variety.
I think a lot of companies are trying to figure out where AI lives in their companies. We see in the guests we have on here, we see a lot of different configurations-- some people embedded in business lines and product teams, some people have a centralized AI data science unit. Lots people are trying to figure it out.
I will say that our focus now is on talking about the business problems. So we're not-- we like to say we're an AI company. I believe that's absolutely true. But we're not an AI-first company. We're not coming in saying, we're going to use AI to solve your problems. We come in saying, we can discover key information in data you have. Oh, by the way, we use AI.
RM: That makes a lot of sense. Is there any technical problem that maybe you guys aren't working on that you wish somebody would solve to make your life easier for the NLP stuff that you do something upstream or downstream?
SC: Yes, there's an upstream problem that I believe we have encountered constantly over almost the entire 20 years, or a large part of it, which is dealing with scanned documents and PDFs. It turns out that, I don't know if you've ever looked at the structure of a PDF document. It was designed to make it easy to print. It was not designed to make it easy to understand what's going on inside and reverse engineer the structure, and scan document segmentation, and what have you.
Given that there's still a lot of paper, I talked about supply chain. We, every once in a while, someone comes to us with shipping manifests as a problem they'd like to do some analytics on. And our answer is if you could get good segmentation and OCR of the shipping manifest, we could do a good job. We have yet to find that problem. It's solved.
RM: Really interesting.
SC: How many years ago were people talking about, we're moving away, and we'll have a paperless office? It just hasn't happened.
RM: We still do it right. I still print, like when I go to board planes, I print out my plane tickets, because I'm like what if my phone dies.
SC: I've just now switching away from thinking that way. It's tough. My daughter's trying to teach me to use the Apple wallet.
RM: Yeah, but you can get so depend on that, and if it shuts down, you're entirely screwed. That's really interesting. We dealt with that a lot. When people signed on to Talla, the biggest challenge we have is if their support documentation, like if it's in Zendesk for Salesforce or someplace that's HTML, we can just suck it in. Like that's super easy, and then we can do our own NLP on that. When it's in PDFs, it may or may not be easy. It depends on many other factors.
SC: It's an absolute mess. And there's other areas that I'm blanking on. There's a couple that have bothered me that no one's come up with some sort of JSON or just straight out XML sort of format to manage. I find that in finance, there's some really good formats in use, but there's things that aren't.
For instance, the shipping manifest I just described. Why is it that the international shipping agencies haven't declared a standard JSON format?
RM: I worked at a company that was an early Bluetooth standards bearer, and that was a long, hard process. Interesting. And so, as somebody who has worked on a bunch of areas that have touched AI for a long time, how do you feel about some of the concerns of the last just two or three years, really of this deep-learning wave around AI ethics, bias in models, killer robots, and all that kind of stuff? Do you raise your head up and pay attention to any of this? Do you think any of that's a real concern, or are you just sort of heads down, doing your natural language processing?
SC: A little bit of both. I do have thoughts, and I'm on a panel about AI ethics in a couple of weeks, so I should probably start thinking more about it. The answer is with almost any new technology, in my humble opinion, they're all true. All these concerns are both valid and invalid at the same time. There are aspects of this technology that we believe can be beneficial to things like privacy and personal security.
For instance, there's a number of processes in finance where people read your documents, and do you feel good knowing that someone is eyeballing all of your personal financial documents? Might you feel better thinking that actually it's a computer which is only going to forget-- really will forget things, assuming it's programmed honestly, will forget things, and only kick out problems when it sees them, as opposed to maybe they keep a little pad of paper and write down things and sell them to their friends?
You can look at this idea that a computer is scanning your information from both sides. That said, as the AI is doing a much better job in a lot of realms of emulating, or perhaps even implementing human cognitive tasks, we really should think about what's left for humans to do once those cognitive tasks are replaced by computers.
In the past, we've certainly said, and Marc Andreessen was all over this while he was still on Twitter-- and we're saying, remember when cars came along and people said oh, this is terrible. Horses were put out of business and that wasn't really the problem, because there was always something new for those people to move to. Well, what's left? Have we climbed to, are we climbing to the topmost rung?
RM: Yeah, it's interesting, because I do think a lot that-- because I always get that question on panels, too, about is AI going to take all our jobs? I think you definitely have a decade or two where you're going to see, at least for the next 10 years, I think AIs really going to be a massive augmentation tool for humans. We're going to see a lot of productivity. It'll displace some people.
But, yeah, at some point, AI is better at everything. And I wonder what will happen at that point is, to go back to your horse example, 200 years ago, I guess everybody had a horse, now you have to be pretty wealthy to own and keep a horse.
SC: It's kind of switched.
RM: Similar to 200 years ago, everybody made their own clothes, made their own furniture. Now it's all mass-produced, and it's cheap. And so, if you can afford a hand-sewn shirt, or something, we'll just increase the artisan movement, because humans are always interested in their status versus other humans.
When the machines do everything, it'll be a sign of status that you can still have humans do stuff for you, maybe. I don't know. It's one way the world could work out.
SC: That's a positive spin, and I think you're right. Again, I believe there is a valid concern that we have a lot of people in this world nowadays, and what are they going to do? Because if people don't have something to do, they get bored, and they start fighting with each other, and doing bad things, because they're otherwise bored. That's as deep as I'll go into sociology.
But we don't have an answer for that. There was the positive science fiction writers of the '50's posited a future where machines would do all the work and we wouldn't have to work anymore. So why did Georgette-- why did George Jetson have a job? They had robots.
RM: There's a really interesting book I recommend to people sometimes. I don't read a lot of science fiction, but there's a book called Manna, M-A-N-N-A, and it looks at the emergence of AI through two lenses. It's like the United States and Australia, I think are the countries, and they kind of go in different directions with it, and they end up in different places. I won't spoil it. You can read the book.
But one of the things that's interesting is in that world, you have a bunch people who opt into living in the real world, and a bunch of people who just live virtually in a world where they can be whoever they want to be and do whatever they want to do all the time, and they're very plugged into that. And the world segments into those two types of people.
SC: So people elect to be in the matrix.
RM: Basically, yeah.
SC: Some people like the matrix, even though they know it's made up, and it's not real.
RM: This is true. Making air quotes here on real. We can have a whole podcast show debating what real means, but it's not a philosophy program. Good. Well, so with that, we'll wrap up. Steve Cohen, thanks for being on. If people were to learn more about Basis, what's the URL for your website?
SC: BasisTech.com, Take a look.
RM: Great. And if there's guests you'd like us to have on the podcast, or questions you'd like us to ask, topics you want us to cover, please send those podcast at Talla.com We'll see you all next week. Thanks for listening.
Subscribe to AI at Work on iTunes, Google Play, Stitcher, SoundCloud or Spotify and share with your network! If you have feedback or questions, we'd love to hear from you at firstname.lastname@example.org or tweet at us @talllainc.