Join Paul Ducklin and SolCyber CTO David Emerson as they talk about the human element in cybersecurity in our podcast TALES FROM THE SOC.
There’s an ever-increasing cost in collecting more and more data just in case. But is there any real value in doing things that way?
Co-hosts Duck and David offer their usual mix of gentle humor and sound advice to stop you getting lost on Butter Mountain, or drowned in Data Lake.
If the media player above doesn’t work in your browser,
try clicking here to listen in a new browser tab.
Find Tales from the SOC on Apple Podcasts, Audible, Spotify, Podbean, or via our RSS feed if you use your own audio app.
Or download this episode as an MP3 file and listen offline in any audio or video player.
[FX: PHONE DIALS]
[FX: PHONE RINGS, PICKS UP]
ETHEREAL VOICE. Hello, caller.
Get ready for “Tales from the SOC.”
[FX: DRAMATIC CHORD]
DUCK. Hello, everybody.
I am Paul Ducklin, and this is the TALES FROM THE SOC podcast.
As always, I am joined by David Emerson, CTO and Head of Operations at SolCyber.
David, welcome back.
DAVID. Good to be here.
DUCK. David, I would like to use a recent article that we published on solcyber.com/blog as the background for what we’re going to talk about, and that is the article entitled Draining the Data Lakes: How much data collection is too much?
At least in Western Europe, the idea, say, of a “wine lake” is a pejorative term.
It’s when you produce too much of a liquid commodity, so much so that you can’t sell it all.
So it’s a very mixed metaphor, the idea of a data lake.
So the article explores, “Is it OK to collect more and more data from more and more sources in case it’s useful later?”
Or should we be going the opposite way, trying to reduce the amount we collect and being smarter about what we do with it up front rather than at the back end, if that makes sense?
DAVID. I was waiting for you to mention the “butter mountain.”
DUCK. [LAUGHS] You have butter mountains, oil lakes, or wine lakes, or whatever.
The butter mountain rather boggles the mind, doesn’t it?
DAVID. It’s disgusting.
It’s completely disgusting!
DUCK. When summer comes around, it’s going to get ugly…
But then an oil lake, even if it’s vegetable oil, is not something that you would particularly want to fall into, is it?
DAVID. It probably has a coastal smegma of some kind.
That’s what I’m imagining.
DUCK. [MOCK HORROR] I don’t know whether you’re allowed to say that word in a family-friendly podcast, are you?
DAVID. Oh, come on!
DUCK. [LAUGHS]
DAVID. Butter mountain…
For the same reasons that a butter mountain is a little repugnant, I think that it’s a general illness of excess.
And most of what we see in data, which is to say the excess of data, is simply inexpensive hardware, right?
So hardware has become cheaper – that’s storage, that’s compute.
The costs of structuring data post-collection, that is to say data after you’ve collected it, have come down.
And that in turn lowers the barrier to collection.
That’s a really good thing for architects.
Architects are less burdened up front by design.
They don’t have to think about the schema of a database as much; they don’t have to think about the data points that they’re collecting as much.
That design itself, coming from a product organization, is frequently unburdened by the knowledge of how that product will actually be used in the field.
And so all this is to say that we end up collecting more data, cheaper than ever before…
…but not perhaps cheaply in the aggregate.
And then we worry about how it’ll be used, or whether it ought to be collected at all, much later in the product lifecycle.
DUCK. When it can be quite hard to remember where it went.
And also, you have that problem of what if the company that collected it in good faith gets bought up by a company that’s maybe in a different jurisdiction, and has different regulations.
DAVID. That was the concern around 23andMe!
DUCK. Absolutely, yes, that’s a good example.
DAVID. We all gave our DNA to 23andMe.
That wasn’t the thing you were thinking of when you were swabbing your cheek.
It probably wasn’t the thing they were thinking of when they were collecting your entire genome and sequencing it.
It’s not that it couldn’t have been foreseen, but there was no incentive to do so, because we are now in an era of very inexpensive compute resources.
DUCK. Yes, it’s almost as though the rule is, “If we can, that’s an excuse that means that we should.”
DAVID. It’s the same discussion that we had in the last podcast, I think it was, about intentional design.
DUCK. Indeed.
DAVID. Methods like MapReduce, and graph databases – these are structures that now are common…
…they’re practically templates for producing an application, and they do not incentivize the structuring of data in advance.
It’s still a good idea, but you don’t absolutely have to do it.
And so you can template your way right into a data lake, or a butter mountain, of data if you’d like to.
DUCK. [LAUGHING] My mind’s boggling now.
I guess the other part of this is the fact that we now have enough bandwidth to make this possible.
DAVID. Oh, it’s a non-issue.
Your default AWS instances have gigabit ethernet, which doesn’t even sound that fast.
10 gigabit networks are the norm now, and they’re not even expensive.
So ,across the spectrum of infrastructure, it is inexpensive.
But that isn’t the same thing as a good idea.
DUCK. Exactly.
The ransomware crooks, as everybody knows, now do what’s often called by the jargon term of “double extortion.”
They don’t just scramble your data in place on all the disks of all your clients and servers on the whole network.
They steal all your data first.
The reason that ransomware was invented 40 years ago to scramble data in place was precisely so that there was no need to upload anything.
They were using the victims’ computers as a place to store the stuff that they had “stolen” without needing any bandwidth at all to upload it.
When it became obvious that most organizations now have enough bandwidth to allow all the trophy data in the entire organization to be uploaded overnight…
…well, why not do both?
It wasn’t really an innovation, it was just that suddenly it became possible, so why not do both?
DAVID. Yes, you hear about breaches in the gigabytes, hundreds of gigabytes, maybe terabytes.
And the fact of the matter is that those are uploadable amounts now.
DUCK. Easily, yes.
DAVID. On my home network, it would not be a challenge to upload a 200 gigabyte file and exfiltrate what would have been previously quite large amounts of data.
DUCK. So how come when crooks do it, it’s deeply dangerous, and everyone goes, “Oh, no, they exfiltrated all this data!”
But yet at the same time, that same organization might voluntarily be sharing huge amounts of stuff with all sorts of vendors who are milking data for all sorts of products because the data might be useful later.
But unless and until it is useful, it’s just a risk, sitting there on somebody else’s computer.
DAVID. It is.
At the very least, it’s an unnecessary burden.
And, if you’re being honest, you didn’t need to be satisfying some of the controls that you will now need to satisfy in order to protect that data, which are probably themselves more expensive.
You know, you get into the realm of GDPR, or HIPAA, or PCI.
With some of these data points…
…it might be cheap to keep all credit card numbers forever, but it’s not a good idea because now you’ve got controls that will cost you far more than that storage and that compute.
And so that total cost, and that risk, is just one of the many parts, of risk assessment that humans are particularly bad at.
You know, at SolCyber, we intentionally do not collect data points that are not security scoped, that are not security related.
That is something that we’re occasionally talked out of by our customers, if they want a particular data point that does contain more risk than we would intrinsically want to assume.
But then it’s a case-by-case basis; it’s field-by-field; it’s intentional.
It’s an intentional design activity.
And by default, we don’t want your PCI controls in our environment either, right?
We don’t want you sending your credit card numbers.
So that is an architecture that we’ve undertaken to somewhat simplify, in a manner which protects us and protects our customers from being over-harvesters of data, over-collectors of data in ways that could be harmless today, but have unforeseen consequences or unforeseen costs down the road.
DUCK. As you mentioned earlier, although it is now possible to store huge amounts of data cheaply, it’s never without cost.
There’s an environmental cost in the amount of energy required.
There’s a data storage cost in the fact that although disks are cheaper, they still don’t cost $0 each.
And the fact that you can just collect more and more doesn’t make it free of charge, and doesn’t make it, if you like, consequence-free.
DAVID. Correct.
Yes, these design decisions, if they are decisions (or maybe lack thereof) – they do have costs.
And those costs may not be direct.
They may be indirect costs.
In some cases, those costs are abstracted from us in ways that are inappropriate and will eventually change, I believe – and I think that’s the environmental point you’re making.
DUCK. Yes.
DAVID. But at the end of the day, intentional design pays dividends.
They may not be enough to offset the costs of intentional design in all cases, but it should be considered whether a design can be made more intentional, and whether you’re incentivized to do so.
And I think, in more cases than people believe, there *is* an incentive to intentional design as a risk mitigation.
DUCK. Well, you only have to go back a few years to see that, I think, both Google and Facebook have had this problem.
They had been collecting data in what they genuinely considered to be anomalous situations, for example where customer logins were failing.
And then somebody realized, “Oh, no! In the log data, we’ve actually been writing cleartext passwords some of the time.”
Well, bless their hearts, both those organizations then turned themselves into the regulator.
They were actually unable to say immediately what the effects were.
They didn’t quite know where the passwords had gone because they didn’t know they were collecting the data in the first place.
So they then had this enormous cost of trying to find out, “Well, how many times did we collect this plain text password data?”
“And where on earth could it be in amongst these multiple data lakes that we have in multiple data centers in multiple countries?”
So you can end up with a real comedy of errors there, can’t you?
DAVID. And it’s cultural too.
You know, their hearts, which may be blessed, are only blessed in certain circumstances, right… in this case.
DUCK. [LOUD LAUGHTER]
DAVID. Google turned themselves in for that.
Who knows what they haven’t turned themselves in for, in other areas of their data lake, or butter mountain, or whatever.
DUCK. Yes!
Or stuff that they’re collecting because they actually want to, that isn’t quite as explosive as plaintext passwords, which you’re not *allowed* to collect.
But maybe they’re things that people wouldn’t naturally realize were being collected.
In the modern era, sudden changes in privacy settings…
…which could mean that, “Hey, unless you opt out, which we’re mentioning now the world has cottoned onto it, we’re going to leech all the posts you ever made on our network to train our AI systems.”
Shouldn’t the internet really strictly be opt-in, and not opt-out?
No matter how inconvenient that might be for vendors, including cybersecurity vendors?
Shouldn’t we err on the side of caution, not on the side of, “Hey, this could be exciting someday”?
DAVID. Enforcing that is going to be hard.
I can’t indict the culture of an advertising company, fundamentally.
I think that, if you don’t like it, don’t store your data at Google.
Personally, I don’t necessarily believe that it’s entirely the burden of the data-storing entity, which would be Google in this case, to do no harm in all circumstances, some of which have not been contemplated.
I think there is a burden on the individual who’s willingly giving up their data.
If a service is free, consider what the product is: it’s probably *you* in today’s internet.
And there’s an excellent concept, if you’ve ever heard Corey Doctorow speak about it.
[PAUSE] This is not family-friendly, perhaps, also, but.. the ‘enshittification’ of the internet.
DUCK. [LAUGHS] I try to avoid using that word, but sometimes you just can’t help it.
DAVID. [LAUGHS] You just can’t!
DUCK. Nobody who hears the word for the first time can fail to understand what it means.
DAVID. No, people know exactly what!
There’s a burden of the individual to not relinquish their data to some of these data hoarders.
And that is something that I think is too often left out of the discussion.
We’re getting free Gmail.
Why?
We’re getting free social networks.
Why?
It’s because of advertising.
DUCK. Yes.
DAVID. And that’s the reason that, personally, I buy my mail service from Fastmail – they are not an advertiser.
It costs me $4 a month, or something.
But at the end of the day, I don’t feel as much of a product, personally.
So I do think that there is a bit of burden on the individual here.
I just think that we’ve been so indoctrinated to presume that the internet is free, and to not question whether or not we are actually the product that makes it free.
I’m more OK with that than pretending that regulation will solve every single data-field problem, because there are so many data fields we may not have yet contemplated collecting.
So I don’t believe that regulation alone works.
I think user education at this point is almost as much a factor in the safety of an individual’s data as regulation around the collection of that data.
And I also think that there are simply bad actors in the field.
You know, there are plenty of advertisers that would not think twice about collecting a data field under the assumption that they will never be caught.
Even though it may not be totally kosher under GDPR, or even though they may not be storing it in a way that is entirely sane, they probably won’t be caught.
They probably know that, and it may be a calculated risk.
I would promote the notion that users consider the storage of their data, consider the transmission of their data, in considering whether they want to consume a service.
That I think is really what you need to be doing.
If you’re onboarding with SolCyber, you should be skeptical that we want to collect more data than we need to provide a cybersecurity service.
If you’re onboarding with an email provider, you should be skeptical if they want to advertise to you and they want to tailor ads to you.
If you’re installing Windows now, you should be skeptical that it wants to take screenshots of your activity every 30 seconds so that Copilot can… I don’t even know, do what?
These are the kinds of things that we need to build a culture of skepticism around.
And without that, there’s no regulation that’s going to intrinsically save us.
DUCK. In fact, one of the tips that I put at the end of the data lakes article
is to suggest that people actually need to take the trouble to review very carefully, even though this sounds like a little bit of a burden…
…review carefully the default settings of any service they sign up for.
Because, with the best will in the world, there’s no way that regulators could demand that every possible service have a consistent, one-size-fits-all user interface where all the options get chosen, because it’ll be different for an email service compared to a blogging service compared to a website service.
DAVID. There’s not even agreement over what we should do for data in an uncontested field that you *must* collect.
If you are creating the initial user in Debian, it says, “OK, what do you want your username to be? What do you want your password to be?”
If you enter a terrible password, it’ll call you out.
It’ll say, “That’s a terrible password. Maybe you should pick something more complicated.”
And if you enter the same terrible password in the OpenBSD installer, it will say nothing.
Well, OpenBSD markets that as, “Guess what? We’re not reading your password. We’re not ingesting any information about your password.”
“And that’s why we don’t call out that you’ve just put in a horrible password.”
And, of course, I’m sure Debian probably believe they’re saving the users from themselves by the very least analyzing what the bit-depth of that password was.
Two totally different ways to look at it.
I’m not sure I agree with one or the other, but, at the end of the day, they market it in wildly different manners.
We’re talking about *one field*!
DUCK. Yes.
DAVID. Imagine multiplying this by the billions of data points that are collected when you simply browse the internet to buy some deodorant.
I mean, it’s really not an easy thing to regulate.
So I don’t believe regulation is the way.
I think intentional design, and care on the part of the consumer, is the way.
DUCK. And if you’re faced with a service that asks you voluntarily to supply information you’re uncomfortable with, simply don’t put it in.
If they’re admitting, “We do not need to collect this; there is no regulatory reason,” then simply don’t put those fields in!
There’s no need to be friendly about it.
On the other hand, if you have a service that’s saying, “We’ll let you create an account, but it’s compulsory for you to give us this, that, and the other information”…
…my recommendation would be not to make up fake information about yourself that you then have to remember in order to pass their checks.
Because then you might be in violation of their terms and conditions, which could cause you trouble down the line.
Instead, maybe consider looking around elsewhere for a company that will let you create a similar service without being quite as intrusive.
In other words, loosely speaking, “Vote with your checkbook.”
DAVID. Yes, absolutely.
DUCK. I think what you’re saying, David, is it would be very, very difficult for us to build a uniform set of regulations that say, for every online site and service, “This far
and no further.”
Because we don’t really know what kind of sites and services are going to show up in the next one, five, ten years.
Innovation is important, even if sometimes it heads us off in a dangerous or a wasteful direction.
DAVID. We don’t know what technologies are going to evolve.
Facial recognition is common now.
Some people have been thinking about that forever, right?
Some people have hidden their face from the internet.
Other people have not thought about it and now perhaps wish they did, because you can just as easily toss my face into a search engine and find all of the places I appear online.
That isn’t a problem for me, but it is a problem for some people.
Quantum cryptography is another example.
I think in some ways that’s a bit like worrying about nuclear war.
It’s not to say that it’s not a concern.
There are some systems you want to harden against the possibility of nuclear war, but there are many systems that you just need to admit aren’t going to make it.
Bringing it back to the fields that are collected, if it were possible in some kind of quantum future that analysis of, let’s say, your mouse movements were enough to identify you as an individual across multiple sites…
…what are you going to do about that?
That may be something that you can’t possibly harden against.
I certainly don’t think it’s appropriate for regulators to start worrying about something like that in such a highly theoretical context.
So probably start thinking about it as an individual, but that’s much more about what you give up.
DUCK. Yes.
DAVID. Think about the data before you enter it into the system.
Think about your face before you upload it.
Is it a problem for you if someone links your identity across multiple facets of your life?
That’s where you have to protect yourself, and also where I think people need to be sensible about our ability to predict the future.
Shore up the things that need to survive nuclear war, or quantum computing.
But you’re probably not going to be able to save everything, and there’s some stuff that you’re just ultimately going to have to admit is in the hands of the future.
DUCK. And, quantum computing or no quantum computing, there are plenty of other ways for crooks to get hold of your passwords right now, which is how they’re getting into networks today.
So don’t let these terrible fears about what might happen in the future discourage you from doing the basics right now.
DAVID. Precisely.
That’s the “anti-aircraft battery on the 16th floor but the front door’s unlocked” problem.
DUCK. Yes. [LAUGHS]
DAVID. The reality is that people are not taking foundational steps today.
DUCK. Indeed!
If someone spent a hundred billion dollars on a quantum computer that can crack 17 messages a day that were completely intractable before…
…that is not going to stop them trying to get in through the front door, if they can do that, because it will still be faster and cheaper.
So David, maybe I can finish up simply by reading out what I wrote at the end of the data lakes article, where I said:
“As we’ve said several times before, a little online caution goes a long way.
But that’s not all.
A bit more caution goes even further.”
Don’t leave your cybersecurity to everybody else, and definitely don’t wait for the regulators to “solve” the problem for you.
Get the basics right now, and then you have more time to worry about the additional things that might be a problem.
DAVID. I completely agree.
DUCK. David, once again, thank you so much for your time.
Thanks to everybody who tuned in and listened.
If you would like to get in touch with SolCyber, you can email us: amos@solcyber.com.
That’s Amos the Armadillo, the metaphorical cybersecurity mascot of SolCyber.
And if you would like more content that will help the community as a whole, including your own friends and family, please head to solcyber.com/blog.
Thanks for listening, and until next time, stay secure.
Catch up now, or subscribe to find out about new episodes as soon as they come out. Find us on Apple Podcasts, Audible, Spotify, Podbean, or via our RSS feed if you use your own audio app.
By subscribing you agree to our Privacy Policy and provide consent to receive updates from our company.