Recorded 8th October, 2024
Rich Hale, CTO, ActiveNav | Colin Fowle, MD, Blue Car Technologies | Daniel Ibrahim, Strategic Account Manager, Ascertus
_____
Did you know:
In this roundtable webinar we discussed:
*ActiveNav analysis
Transcription:
Dave Simpson
Very good afternoon to you and welcome to the Ascertus webinar today, where we're going to be looking at dark data. I'm joined by some guest speakers today who will introduce themselves in a second.
Just a couple of housekeeping bits. As we go through the webinar today, please do use the Q&A section of the team's platform.
Ask whatever you want, and we'll address questions towards the end. You can make your question private if you don't want to broadcast your names as to who's asked what. Feel free to ask as many questions as you want.
There will be some audience participation and I think we'll kick off with that in just a minute where we are going to be asking a couple of questions of you, so please do take the opportunity to take part in those poll questions.
Just to introduce myself first.
My name is David Simpson.
I'm a strategic account manager here at Assertius.
Let me firstly introduce Daniel Ibrahim. Daniel.
Daniel Ibrahim
Thank you very much, Dave.
Good afternoon, everybody.
My name is Danny Ibrahim.
I'm also a strategic account manager at Asertus and I've been in legal for last 20 years or so, looking after everybody from computer forensics all the way through to document management.
So look forward to the discussion today.
David Simpson
Thanks Daniel, our first external speaker, that I'd like to introduce is Rich Howe. Rich.
Rich Hale
Thanks Dave.
Yeah. Hello, everybody.
My name is Rich Hale.
I am Chief technology officer at ActiveNav. ActiveNav is a global cloud software provider for unstructured data discovery used for compliance and governance across a wide range of industries.
And we have the a range of experience in big and small problems, the largest being dealing with the breach at Equifax and the subsequent clean up.
I've been engaged in this domain, this activity of managing unstructured data for about 20 years now.
And I'm really looking forward to the chapter day and the questions that come from it.
David Simpson
Thanks Rich and last but not least, someone that I think a lot of you might know already, but I'll allow you to issue yourself, Mr. Fowle
Colin Fowle
Thank you, David. Thanks for that.
My name is Colin Fowle. I'm the managing director of Blue Car Technologies.
We're a legal software technology company I've been working in legal for probably the last 25 years.
We work with all the major DM vendors, enterprise legal applications.
We build software bespoke applications as well as providing support services and doing awful lot of work with data across DM platforms as well.
That's me. Thank you.
David Simpson
Thanks Colin, thanks gentlemen.
So my job for the day for everyone that's on the call is to try and keep some order and decorum into our conversation.
That may be easier said than done, but let's see how this pans out.
I mentioned we're going to be asking you guys some question, so I think we're going to kick off with the first one of those.
We're here really today to talk about dark data.
Just so that you know a bit of an agenda, we'll look at asking you a question in a minute and just have a think about we're going to be asking you about how you perceive dark data and whether you think there is an issue in your organisation that you represent, we'll then be going on to discuss the basics of dark data.
What is it?
What does it mean to you?
Then we'll look at the nature and shape and size of the dark data problem within our industry.
I think everyone would appreciate that there is a problem.
I think it's a varying levels and we'll dive deeper into that as we go through.
And then towards the end, we'll be looking at what we perceive is the fix to that dark data problem.
And looking at business objectives, desired outcomes, as I said, we will have time for questions at the end of the session. So please do keep them coming in. The session is being recorded by the way and you'll receive a link to that recording in due course.
Kick off so hopefully this is going to run smoothly.
You should start to see a poll on your screen.
So do you perceive there to be a dark data issue within your organisation?
So we'll give you a few seconds just to vote on that.
Hopefully I think everyone probably thinks there is to varying levels.
Otherwise, I'm not sure how much you'll get out of the day, or maybe you can teach us a thing or two.
So hopefully.
We've got a lot of responses into that.
Mr. Bateman, if you want to put the slide deck up when you're ready.
OK, perfect.
Let's delve into what we think are the basics around dark data, and I think the first question that I want to put to both Rich and Colin is what is dark data?
I think there's use of the word in the industry and I think it will mean a million things to a million people. So Rich, I'm going to come to you first.
In your opinion, what is dark data?
Rich Hale
My opinion.
Thanks. Thanks Dave.
I mean, dark data is the term we've been using for some time now. To coin a bunch of different angles here, and we've used it because it captures what we think should be the imperative of the situation.
We focus on unstructured data because that is the darkest.
Unstructured data isn't in columns. Isn't in rows. When you look at it, when you open up any given repository for your unstructured data, you don't know what's going to be there necessarily.
In fact, our experiences usually, and there's so much of it, it's dark because there could anything.
And within hours of it's creation, it's usually forgotten about by whoever created it because there's so much of it. You can see some stats here.
And it's dark because for anyone that thinks about it, the implications really are pretty scary.
And I know we're going to come on to that in a minute.
And so, yeah, dark data, really is the mountains of unstructured data, word documents, emails.
Client data stored in file shares stored in organizations.
That's increasing every single day.
David Simpson
Thanks Rich.
And then I'm gonna ask exactly the same question for Colin.
Colin Fowle
Thanks Dave.
I agree with Rich, he's exactly right.
There's so much information being generated, you know, especially unstructured data where it's so much harder to look at the relationships between the information and I know we're going to come onto a bit about, you know how it's generated and why there's so much of it in systems, as well that perhaps where you don't even think about, you know, door access systems, those kind of things are generating information about people's movements.
And if if you don't know you've got it you can't analyze it, you don't know what the risks are.
David Simpson
Yeah. And I think it's really interesting actually that you know when we talk to firms, Daniel, you've spoken to a lot of firms about this particular subject.
And it's kind of that everyone knows they've got a problem, but not really knowing to what degree they've got that problem.
Dan?
Daniel Ibrahim
Yeah, I mean, I literally had a large, a very, very large firm and the global CIO said.
I know I've got a problem, but if I identify that problem I then have to fix that problem and for him he just said that it's such a scary thing because every single client that engages with their organisation has to provide 2 forms of ID that are probably saved in three to four different locations.
Then you add in all of the communications that they have with these clients.
And it potentially extremely scary and that's just the structured part. If you then start worrying about unstructured in the dark data and you add that it just compounds and compounds the potential issue.
David Simpson
Yeah, I think for sure that unstructured part that bit below the surface is the bit that you know you don't know what you don't know until someone tells you about it. And then as you say then you've got a problem to fix.
But let's talk about risks, of course.
Rich Hale
Sorry, I was just gonna add here.
I think big part of this problem is that in everyone's day-to-day work, they are not well served necessarily by the electronic systems they use to do their work, and they use coping mechanisms and so we could ask ourselves, why does it get forgotten about? And we all know that we all have that moment.
And so you just save things over here and you forget about it and you send something, a copy, to somebody else. And so it's through the diligent work of everybody in the chain trying to do their best.
But it just creates build up because we know, no one really faces up to the fact that they're not well served necessarily by the IT they've got.
And this has been happening for a long time, and so it's built up. This liability has built up for so much time.
I think that sense of people doing their best and not willing to look because they worry what they find is a part of what makes things the darkest.
Daniel Ibrahim
And you know, on that point specifically, I have several clients who went for the pessimistic security model, locked everything down, but it then became this coping mechanism.
Oh, OK, we'll store in personal folders. We'll work on it until it's finished. Then we'll put it into the right system.
Human error being what?
Human error is it never got found.
So there's huge amounts even in a structured system where there's still dark data.
It's it's an incredible concern to think about.
Colin Fowle
People do just enough to get the job done. I think sometimes, unless there's a a good regime of compliance and data security around it, that's where the problem starts occur.
David Simpson
So look, let's come on and talk about the risks.
And Rich, I'll come to you first on this.
We've got a lot of people on today's call, but I'll ask you the question why should firms care?
Rich Hale
It's really interesting vendors, and I count my organisation as part of that landscape, have been guilty, I think of selling on the basis of compliance and regulation in a very abstract way for some time. And let's be honest, firms and organisations haven't really been held to account by those regulations.
Perhaps until really recently.
I think what's really interesting, I've always thought, let's take privacy for example.
There are many other data regulations, but this one's a hot topic lately I guess.
I always think of stage 1 or stage 2 privacy from a customer perspective, which stage 1 is all of the shouting about GDPR.
It's going to be huge and people do just enough to check a box and it all goes quiet.
I think we've entered stage two where the beginning is to become clearer.
Our recent experience, one of the reasons why we entered the legal market is that clients are now asking their vendors, their service providers, their advisors, to demonstrate compliance, increasingly, and you know that that could be through a range a range of different means.
It could be from simple license agreements or contracts right away through to more specifics of outside counsel guidelines and the like.
And then further, they're beginning to ask our customers to attest. And what's interesting about that is the individuals that are attesting have been kind of running with a sort of top down, my finger in the air, best professional judgment on the problem while knowing what goes on day-to-day and so evidence based attestation is beginning to come onto the scene.
We've recently heard of several customers being asked by insurers to say something more significant about how they're protecting sensitive data.
And so that's the kind of the flow through of regulation into client requirements.
Then there's a simple regulatory compliance themselves.
There can be domain specific as well as general regulatory compliance for different practice groups and different industries.
And then governments are beginning to act. It's interesting.
I'm based in the US at the moment.
I mentioned the Equifax breach earlier.
They were stopped for doing business, or at least threatening to stop doing business by Congress.
And then lastly, we have reputational damage and the and the cost that can come from recovering from a breach and bottom line is when a threat actor gets inside the walls of an organization.
And they will.
We know that to be the case now, it takes up to two hundred 250 days to find them, so they have all that time to fish around for whatever they want. And if data is left lying on the shelf, so to speak, it can be easily lost, breached and result in a reputation and other costs.
Not least the actual cost of recovering the breach and dealing with the insurance cases that run from it.
And so I think these risks have been held as a liability for some time.
And clients, if I was to summarize, I think customers and clients are beginning to hold their vendors to account.
In a way they haven't in the past.
Daniel Ibrahim
Well, I tell you, this is the first time ever where I had a client call me and he said I couldn't win a piece of business because I couldn't prove we were compliant.
Now that's a big thing.
Normally it is well prepared in advance of any of these concerns happening.
But this was the first time I ever had a client call and say "give me something that's going to fix this problem."
Colin Fowle
I definitely echo what Richard said there about the landscape. As a vendor, we get asked more and more about our information security status, you know, are you compliant with particular standards. Can you demonstrate you've got the policies for this, that and the other, which ultimately reflect on your ability to give confidence and mitigate any risk that there might be around the issues with losing personal data or breaches or our ability to operate.
You have to be more rigorous and more stringent, but actually going through some of these processes really do help you understand what your data landscape is, how you can map that information and how you can see what you've got. And really, I think is one of the first steps towards understanding the scale of the problem that you may or may not have with that data.
David Simpson
So interestingly, what you said Colin is going to bring me on nicely to Rich. I'll come back to you in just a second.
You talk about scale and I want to ask you what you think is the scale of the problem now. I guess that you know that everybody's scale of the problem is proportional, but what do you think the scale of the problem is? I'll come to you first rich.
Rich Hale
I think there's so many ways of answering this question that I'll try in a couple of ways. The first thing I always say is the scale's been growing ever since we first connected a couple of PCs together and did a bit of shared storage.
That's where it all started.
And for me, that was way back in perhaps 92.
Perhaps a little earlier, and so that that number's been growing ever since then you can find lots of different references on how fast that data has been growing ever since.
And my point bein, at the beginning, no one was checking and 20 odd years later, still no one's checking and looking.
And so the growth, the growth has been the pace of business, we can talk about doubling. We can talk about some sort of non linear exponential growth.
You start with the small amount and it grows in a non linear way.
We know there's an enormous amount.
That's my first answer.
The second one would be all organisations have the problem, all of them, and the only exceptions I make are those that have embarked on a journey to address this deliberately.
And even those organisations are still in the journey.
And then let's talk about the actual stats.
You know, most firms, we engage with a company. So we engage with you can start with a small, firm or organization, perhaps 50 terabytes of data.
It's about 50 million objects, even 1000 or a couple of 1000 objects for any individual is almost impossible to deal with.
And so, you know, millions of objects get into the realms of not even thinkable.
And that's just the small organizations we're up into the petabyte ranges now, which are billions of objects, billions of files.
And that just mean humans can't deal with it.
And so once you get to this situation, the scale is bigger than one can handle.
By simple traditional means.
And I think that's kind of why we sit here today to say how do one, how do how do you take a group of this and take a change to sort of face up to that, that that fact.
So that's that's how that's my angle on the scale anyway, Dave.
Daniel Ibrahim
And look, I actually had a a client contact me recently because they have to keep 20 years worth of data and you know 20 years ago you'd share 1015 files, 1015 documents. Now the amount of information that is shared to run a matter is obscene.
And the volume and the size and the scope and video files.
Audio files big, big documents and anything can appear within those big, big documents.
And yes, some is stored in a structured way, but some isn't.
And fundamentally, clients have to deal with just obscene volumes of information.
And yes, it costs them more in storage.
And yes, it is a potential issue, but the deeper concern is what's in the database or what isn't. And that's typically what I'm seeing a lot more with clients these days.
Sorry Colin. To you.
Colin Fowle
I was picking up on Rich's point about, you know, these "1,000,000 billions of objects that are being stored will get you at some point"
With the kind of democratization of software development and low code, no code platforms. people are, you know, tech savvy lawyers or whatever, stitching together workflows and integrations, you've got to connect to this that.
From this platform you're sharing maybe a pdf that's been generated off the back of something and pushed into something else.
And all those objects are going into systems like Salesforce, HubSpot maybe you've got electronic signature platforms, you've got your own matter management systems.
All this information is being pushed around, both structured and unstructured. Actually tracking down those 50 million hundred million objects in itself becomes a massive challenge as well.
Rich Hale
I think that makes the point I was going on with earlier, Colin. I'm sure it comes this later.
I talked about people coping.
I think there's also people innovating and there's an interesting balance to strike between how one supports and enables that innovation while protecting the organization. And I think one needs this, this, this balance.
A layered system where you have the pace of innovation and then you have supporting systems that are made that frankly have the innovators back.
How does one innovate while having the businesses back and putting your fingers in the air and these processes create dark data because they're ephemeral.
And they they they stand up and they sit down.
They are disbanded and restarted and the bits we don't sweep up behind ourselves and that's the core of the issue we see all the time.
David Simpson
I think if I can just comment, I've just received the results of the poll that we took earlier on.
And unsurprisingly, 0% of our audience answered No: Everything's under control - I don't have a problem. That doesn't shock us at all, right?
But 31%, when asked, do you think there is a data problem? Simply answered.
"I don't know" and I think that's the problem that we're here to talk about and hopefully help some people out with in the future.
But the reality is no one knows and the scale of the problem that they've got is unknown.
Some know it's huge, some have no idea.
Daniel Ibrahim
I have to say I'm very reassured to see that 15% of you are on your way to actually remedying that.
That's that's higher than we initially thought, so congratulations to those that are starting their process.
David Simpson
So as a kind of rather broad question, are we experiencing a data epidemic and do and I think as a sub question to that, is it only set to get worse?
I'll come to you first, Colin, on this one.
Colin Fowle
Yeah, I mean unequivocally, yes.
Absolutely.
As I said earlier, you know there are more and more, you know integrations.
SAS systems popping up for sharing data left, right and centre, which makes a good point about people don't clean these things up properly.
You know the number of times we've done reviews with prospects looking at an integration between System X and System Y. And you know what did you have previously or we had this bit of tech over here or you know what? What's that doing?
Oh well, you know, it's kind of not working anymore, OK?
What's that database doing there?
Oh, that was from that system.
Well, you know, maybe you should probably get rid of that WhatsApp file share with, you know, 20,000 documents in.
So it is only getting worse.
It is only getting worse.
Daniel Ibrahim
And to be fair, it is being compounded by what happens with an organisations where you got risk and compliance teams saying, hey, let's delete everything. We don't want to have any risk and then you have the fee earning population saying, well, I don't know when I might need to use that again, so let's just keep it just in case and you've got this constant battle between these two groups.
Where to do anything with data that they know about, is hard.
But then there's this whole thing underneath the surface that we need to tackle and deal with as well.
That no one knows truly that much about.
So I find it fascinating when I'll speak with the risk and compliance teams within law firms and other professional services organisations, and they're saying we would delete everything today, seven years old, go right. But you then speak to a fee earner and they say no, I could use that again.
There was some good precedence in there and I said, well, why don't you create a precedent library and move everything over to them and then box everything off, redact, move on.
But even that is not a straightforward process within most organisations.
Rich Hale
Yeah, I think that captures a phrase I've been using for a long, long time, which is; the pace of business.
Without supporting the business fee earners adequately, Daniel drives to the I do it my way versus I do it the businesses way and precedence by definition.
And it is not really a personal thing.
It has to be the firm's thing and I think one has to sort of get on top of that somehow.
You use the term epidemic, Dave. You know that's a negative term.
And for sure it's negative.
It's sort of we can turn on his head and say, what about the missed opportunity as a result of that? And so why?
Why can one not create a precedence library so easily anymore?
Why is that so hard?
What opportunities are lost because you have discrete silos of data wherever they are, dark or otherwise?
And so these things all come into the mix.
Daniel Ibrahim
Very much so.
In fact, I'm working with a large organization at the moment that to avoid having some of these issues, they simply don't allow people to store certain information within a document management system.
And they've kept the technology in as a 1998 based system and they literally don't allow things like video files and audio files and all these things to be stored. And they said to the clients, you know, please don't send them to us, we will not be using them as part of the case, which I find fascinating, that they're using old tech to prevent the capability to have any dark data.
So that's their way of handling the journey.
Rich Hale
Even new technology is, I think, is much harder than anybody would have been led to believe.
Right and happen to be on a call with one of the largest law firms in the US and their AI team, and it's fascinating how so much brains are the room yet they still have some basic issues with data preparation and the like.
And it was the dark data that was stopping them from foundational basics, you know.
Daniel Ibrahim
Absolutely.
David Simpson
OK, I want to keep things moving along and talk about or get your thoughts on risk and the associated risk with not understanding your data, whether it be from a compliance GDPR breach or financial.
I think it was touched upon earlier on, but I would like to delve into that in a bit more detail if we can.
Rich, I'll come to you first on this one.
Rich Hale
Yeah, it was interesting.
I was thinking to myself just now, actually, David, we kind of already talked a little bit about it, but I think there's a subtle angle on this as we move along the story line here, which is we talked about the risk in dark data earlier.
We're now talking perhaps about the risk of not understanding that data, and I think that just shifted, just moves the argument on a little bit. And so I'd start fundamentally by saying you cannot protect what you don't understand and as we look at maturity of the provisions that organisations take, we've learned that for data breaches, for example, it's not if but when.
And so the error one makes is assuming that because one has the best data loss prevention protection, the best digital rights.
Well, I'll come on to that and perhaps talk about this rights.
I mean the best ethical walls, the best provisions for walls around the organization.
The bottom line is they will fail.
You only have to get it wrong once for these things to break and that's not supposed to be me sitting here as a harbinger of doom.
What I'm saying is that if we work with the opening assumption that the organization will be breached, the question is how do you protect?
How do you ensure that you just don't leave stuff lying around on the floor? Metaphorically speaking, for people to sweep up when they're inside the organization and this is a double edged sword here.
So we talked about using the data for the benefit of the business and that's the two sides of the game. If it's lying around on the floor.
Threat actor can sweep it up and take it away and do something with it.
It costs more to protect the broad surface area of systems, and so if your data spread everywhere, it costs much, much more to protect it.
And it makes it hard to leverage the data we mentioned earlier.
The regulations are getting tougher. I think customers and staff are becoming more aware as a result, so they're not willing to look away anymore.
Governance teams are saying I'm not going to sign off anymore on this asset station without actually knowing the answer.
I can't do that.
It's not appropriate for me to do that. And as I said earlier, I think outside counsel guidelines are becoming more stringent.
I've been surprised by the number of GC's that are leading or sponsoring compliance and unstructured data efforts in their businesses in their firms. Just lately, based upon their concern of the scaling risk. And so, I think in firm particularly, the fact that customers choose not to give their business, is perhaps the biggest risk that's out there, but we've already talked about it and we'll talk about that again in a minute.
The lost opportunity that comes from competitive edge is also part of a picture here as well. So there's a wide range of risks involved here.
Colin Fowle
Yeah, I think it's interesting. And one thing, you talk about leaving data lying around, to be swept up, obviously having a good compliance regime in place and data, health, cleanliness, records management, those kind of systems in in plac, reduces your liability if you do have a breach.
In terms of risk, you know you can show that you've been doing all the right things. You're going to reduce your storage costs because if you haven't got all this stuff lying around, well, you're not paying for storage,
One thing that came to me from a discussion with somebody else, you might realize that net zero is quite a big thing that's being pushed at the moment through the government and through some of the ISO standards.
And actually, if you're using less storage, you're using electricity, you're thereby reducing your carbon footprint by having not having this stuff lying around.
So again, this is something that can help organisations show that they're actually being healthy with their data and their information.
David Simpson
I want to pick on one particular issue that I guess causes everybody a problem, but with unstructured data probably causes a bigger problem and that is around data subject access requests.
I think I can talk from personal experience of several times in my career where it's come up and there was a Dave Simpson shaped hole in the wall. Because everyone wants to run away from it.
These DSARs are problematic, time consuming, costly, inefficient. But how do you see us tackling these over time? And so I want to pick on a kind of real case a real life use case.
Rich, I'll come to you on 1st on this one.
Rich Hale
I talked about phase one phase two earlier.
In phase one, I sort of broadly describe it as organisations beginning to get a sense of minimal management required for the privacy process.
I think what's interesting now is organisations actually looking at what the regulations really mean and GDPR is one of the examples of an output.
That's becomes evident if you really think about what the regulations really mean and if you think about the data you actually hold. The question is what's your total obligation here? And so the problem with DSAR response is because organisations don't understand their data and they have.
Spend time organising it.
Moreover, they've allowed it to run broadly across huge data space, the simple act of responding to a DSAR then becomes hugely time consuming just in terms of staff. If one considers about whether regulations get their teeth or not, and when they get their teeth the 30 day response time attached to some of these requests, the question is can you can you achieve them.
I think how organizations actually demonstrate their compliance and how they actually respond to fulfill these requests. When one talks about the dark data, it can be pretty gassing towards impossible sometimes for some organizations to do that.
And I think the question for us all is when do those regulations begin to grow their teeth?
Daniel Ibrahim
Yeah, I mean, I couldn't agree more.
David Simpson
Yeah, Daniel and Colin, I will come to you on this, but I think audience, this is our second poll.
So do have a think about this.
It should have come up on your screens of how problematic is the process in complying with DSA?
So do feel free to go ahead and answer that.
But yeah, Daniel, I'll come to you.
Daniel Ibrahim
Yeah. So what I was going to say is this is probably the most loathed activity within it compliance and any team that this touches you know they have to go through a rigorous discovery process to find stuff that they don't know that they may or may not have.
Once they found it, it has to be reviewed by the right people to make sure that they're allowed to show things.
There has to be a redaction phase, and then finally you hand it over.
It can take months for most organisations. My average client has thirty of these a year.
And you know, you almost have to have a full time staff member just to respond to data subject access requests or people saying I want the right to be forgotten. Go remove everything from the from the system and then they can come back later and say, did you?
Really remove everything and if there's teeth behind that saying, well, you didn't. Now what?
I think that will be a fundamental problem that professional services organisations will meet and something that will have to help them address.
Colin Fowle
Yeah. I agree with both rich and Daniel
If you don't know what's there, you can't find it.
Give it to people and then you can't prove that you've done it anyway, you know.
It's a nightmare.
I can imagine it keeps people awake at night sometimes.
David Simpson
OK.
So we've we've painted a pretty dark picture of of the world.
But, let's talk about how we fix the data problem, so I'll come to Colin first on this one and then rich shortly afterwards.
But what does a solution look like?
And that's a difficult question to answer, I know, but.
See also:
FAQ
Dark data, often underutilised, refers to unprocessed information that organisations collect but don't use. It can offer valuable insights for business decisions, risk management, customer understanding, operational efficiency, and AI training. However, challenges include storage costs, compliance risks, and data quality. In legal tech, dark data can help improve document management, compliance, and predictive analytics. Properly harnessed, dark data can unlock hidden value but requires careful management to avoid potential issues.
Studies on the volume of dark data vary. The Economist quotes 70-90% and IBM estimates around 80% of data typically qualifies as dark data. This encompasses vast amounts of unstructured information such as emails, documents, audio, video files, and system logs that are collected but not analysed or actively used for decision-making. While exact figures vary depending on industry and organisation size, the proportion of dark data is significant across sectors, highlighting the challenge and opportunity in harnessing this untapped resource.