Home
Blog
Draining the Data Lakes: How much data collection is too much?

Draining the Data Lakes: How much data collection is too much?

Paul Ducklin
Paul Ducklin
11/27/2024
Share this article:

Answering a rhetorical question

It’s midnight.

Do you know where your personal data is?

At first, this sounds like a rhetorical question – a literary device that doesn’t expect an answer, but is asked just to make a point.

The worrying truth, though, is that for most of us, the answer is a uncertain mixture of, “Yes. But also no.”

The “Yes” in our answer is because most of us, or at least very many of us, are sufficiently concerned about cybersecurity, privacy and online safety that we take at least some precautions to stay out of the reach of cybercriminals and online scammers.

Some of those practical precautions include:

  • Picking proper passwords. A password manager can help with this. Guessable passwords, or the same password on multiple accounts, can make it easy for attackers to break in and raid your accounts and personal data.
  • Patching promptly. Even major vendors such as Microsoft, Apple and Google regularly find and fix serious security holes in their software. Falling behind on patches typically leaves the criminals ahead.
  • Protecting against malware and phishing. Cybercriminals persistently pitch rogue software or fake websites, using a range of believable lures such as warning you about made-up threats, sending you bogus ‘limited-time offers’, or pretending to be people you know and trust.
  • Turning on MFA. Multi-factor authentication, or MFA for short, is a small extra hassle that means you can’t log into your account with a password alone. Typically, a code that changes every time is generated by an app on your phone, or received as a text message.
  • Being aware before you share. Many apps and online services, such as social media or shopping sites, urge you to let them access data such as your contact list, or to save a copy of your payment card data for next time. But the less you let them keep, the less of your data they have to lose in a breach.

Draining the Data Lakes: How much data collection is too much? - SolCyber

What if we want to share data?

Make no mistake, the precautions above work well, because they provide active protection against intrusions that could lead to the theft of personal data, or against needless oversharing that gives away data that can’t reliably be recalled or recovered later.

But what about personal data that we share because we don’t foresee the risks of doing so, or that we share because we trust the service we’re sharing it with today, or that we don’t even realize we’re sharing?

And what about personal data that we share because we have little choice but to do so?

This is surprisingly common, either as a matter of law, such as know-your-customer rules (KYC) enforced by banks, or as a side-effect of circumstances where we aren’t willing to face the risk of refusing, such as missing out on a job offer or a rental agreement by withholding information that we know other applicants are handing over regardless.

At this point, we start wandering into the “But also no” part of our earlier answer, because we can’t always pinpoint exactly where our personal data has ended up.

Let’s start with data that many people share out of choice, without necessarily thinking what might happen to it, or what it might be used for.

The most obvious example is probably social media and other “lifestyle sharing” sites that may end up providing attackers with a detailed timeline of data points about you that didn’t feel risky to share when you handed out each individual nugget of information.

Fitness apps and services such as Strava and Fitbit, for example, record not merely your movements, such as how vigorous you’ve been or how many kilometers you’ve cycled, but also your location and perhaps other health-related data such as heart rate and power delivery.

This could build a very public and searchable history of your lifestyle and activities.

Combined with your social media posts, especially if you like to include information about your friends and family in your feed, you might be giving away details that are surprisingly informative to so-called social engineers.

That is the deliberately creepy term for cybercriminals who set out to masquerade as you, or to convince other people that they know you well, in order to trick them into doing things they would normally reject out of hand, such as resetting your passwords or swapping your cellphone number to a new SIM card.

What if we don’t know we’re sharing?

Clearly, you can limit your distribution of lifestyle data by posting sparingly and thoughtfully on social media, or by limiting how freely you upload data into the cloud from devices such as GPS trackers or bicycle computers.

But what about detailed data of this sort that you are collecting, and perhaps sharing in real time, without even realizing?

We recently published a challenging article entitled Is your car spying on you? When safety and data privacy collide.

Draining the Data Lakes: How much data collection is too much? - SolCyber

In this article, aimed at addressing the vital question, “What sort of data collection is ‘fair and reasonable’ when it’s your car doing the collecting?”, we enumerated just some of the data that your car might be collecting and sharing, perhaps via a built-in cellular network connection of its own.

Here’s a greatly abbreviated version of that list:

  • How fast you were driving, moment by moment.
  • Your precise location, moment by moment.
  • Data from cameras and microphones inside the car.
  • Data from cameras showing what’s outside the car.
  • When you locked and unlocked the vehicle.
  • When you and others got into or out of the vehicle.

Worryingly, there’s even more to it than the list above:

As the Mozilla Foundation put it in an article published about a year ago:

“There’s probably no other product that can collect as much information about what you do, where you go, what you say, and even how you move your body (‘gestures’) than your car. And that’s an opportunity that ever-industrious car-makers aren’t letting go to waste. Buckle up. From your philosophical beliefs to recordings of your voice, your car can collect a whole lotta information about you.”

Of course, the data that gets collected and uploaded isn’t all that matters: there’s the thorny problem of who gets to see it, including data brokers and third parties who might get a chance to buy it up, mine it, and perhaps to sell it on again.

What if the sharing rules change?

Confoundingly, the issue of which third parties might eventually end up with your data isn’t as clear-cut as you might expect.

You may be able to get precise and detailed lists from your current service providers that specify what the industry euphemistically calls “data sharing partners.”

But what if the company to which you entrusted personal information today gets sold on tomorrow, and ends up in the hands of an entirely different business entity, perhaps even in a different legal jurisdiction?

A fascinating, if worrying, example of this problem is unfolding right now, and concerns online genetic testing company 23andMe, which holds perhaps the most personal information of all about some 15 million customers: their DNA data.

The company’s value declined dramatically from 2021 to 2023, and has declined steadily from then, after the data of about 14,000 users was accessed illegally in October 2023.

As a result, the company may not be able to keep going in its current form, and might therefore be sold.

Reporter James Purtill of Australia’s ABC wryly noted that he had, innocently enough but perhaps without thinking things through, given his own brother a gift of a 23andMe test back in 2016:

[I]n 2020, there was a story about US police tracking down a murderer through members of his family who had done at-home genetic testing. This effectively meant I had [‘turned in’] our entire extended family for past and future crimes. […]

And now comes the big one. Having been valued at $US6 billion in 2021, 23andMe is on the verge of bankruptcy. Its CEO, Ms Wojcicki, is considering selling the company, which means the DNA of its 15 million customers would be up for sale, too.

Draining the Data Lakes: How much data collection is too much? - SolCyber

And, as Purtill reports, even if customers rush right now to activate the option to delete their data from the 23andMe site for fear of it being sold on:

23andMe says it holds onto the data for three years before deleting to comply with “legal obligations”. Until then, your genetic data remains its property.

Three years is a long time. The company could change its privacy policy (a clause gives the company the right do this at any time), which might affect how the data is used, or the data could be sold with the company. Depending on the buyer, it may be hard to keep tabs on what data gets deleted.

What if there were no rules to start with?

Finally, there’s the thorny problem of data that’s personal to you, but shared of necessity and then sought out and commercialized by someone else under the guise of serving the public good.

That’s a tricky issue that we covered in a two-part article earlier this year, following the discovery of what can only be described as a megabreach, even though the company concerned was tiny: NPD, short for National Public Data.

Draining the Data Lakes: How much data collection is too much? - SolCyber

NPD, apparently largely the work of one man named Sal Verini of Florida, offered paid access to an online search engine allegedly based entirely on data that was found openly on public-facing servers.

This scraped-up data was converted into a giant, indexed database of court records and other background information of likely interest to law enforcement and a wide range of other interested parties:

NPD’s website proclaims that its services “are currently used by investigators, background check websites, data resellers, mobile apps, applications and more,” and invites you to “[j]oin now and enjoy quality data with low fees and no monthly minimums.”

The business justifies its name by explaining that it gives “access to the greatest level of public information retrieval available on the Internet,” describing itself as “a public records data provider specializing in background checks and fraud prevention [that obtains] information from various public record databases, court records, state and national databases and other repositories nationwide.”

Loosely put, it runs a web-based subscription service that provides an API (application programming interface) for paying customers to do their own searches against personal data supposedly scraped from locations where it has already appeared online. The presumption seems to be that scraping the data makes the collected information fair game for commercialization.

Even if you didn’t follow this story at the time, the title of the article tells you what happened: NPD was comprehensively breached, and the attackers got away with an indexed, ready-to-use collection of personal information covering about 300 million people, including 270 million SSNs (social security numbers) and 30 million email addresses.

Of these, one estimate suggested that as many as 50% of the victims lost enough combined data to put them at risk of identity theft; the rest were nevertheless at increased risk of stalkers, scammers, spammers and more.

All of this was based on information that the victims almost certainly didn’t intend to publicize themselves, and that they didn’t consent to be commercialized for any purpose at all, let alone for some unknown third party to make money from by peddling it for background checks and the like.

For the record: NPD was formally asked to appear before the US Congress to explain itself following the breach, notably because it tried to sweep the issue under the carpet, and to keep its victims in the dark. Apparently, the company recently filed for bankruptcy, its business having collapsed. What will ultimately happen to the databases and indexes it created is unclear. However, even if NPD’s data is destroyed, the attackers who breached the company still have their copies of it.

Draining the Data Lakes: How much data collection is too much? - SolCyber

What to do?

  • Follow the precautions listed at the top of the article. They will help to protect you from a wide range of security threats, including personal data breaches.
  • Avoid signing up for services that directly expose other people by means of the data you hand over. DNA-related services are an obvious example, as mentioned above: aim to keep the most privacy-conscious person in your family happy, and avoid putting them indirectly at a risk they are unwilling to take themselves.
  • Be wary of voluntarily disclosing more than is strictly needed. If a web form has optional fields, don’t feel compelled to fill them in “for completeness” or to be friendly. At the other end of the scale, if a website insists on being trusted with personal data that is clearly not needed for it to deliver its service, try going elsewhere.
  • Read data sharing terms and conditions carefully before agreeing. Learn not only how to download your data and to delete it from the service when you want to leave, but also where else your data may be going, and how long it will take after you request its removal before it actually gets deleted.
  • Consider switching to offline working when privacy is an issue. Cloud services are convenient but aren’t always necessary. For example: you could review or map your fitness data in a local app that doesn’t upload it anywhere, and you could transcribe locally-recorded audio using standalone software. (Many AI tools are open source and are available in offline packages that work 100% locally, although they may be somewhat harder to setup and use this way.)
  • Carefully review the default settings of every service. Opt out of any you aren’t 100% sure about. When you receive updates to the terms and conditions of a service you use, be sure to check them carefully because new data sharing options (e.g. scraping your data for AI training) are often activated by default, forcing you to opt out instead of inviting you to opt in.
  • Campaign for an opt-in internet if you can. After all, if a new data sharing feature is so useful that you won’t want to live without it, then it doesn’t need to be turned on by default. If it’s truly that good, you will be willing to turn it on for yourself!

As we’ve said several times before: A little caution goes a long way.

But that’s not all: A bit more caution goes even further!


Why not ask how SolCyber can help you do cybersecurity in the most human-friendly way? Don’t get stuck behind an ever-expanding convoy of security tools that leave you at the whim of policies and procedures that are dictated by the tools, even though they don’t suit your IT team, your colleagues, or your customers!

Draining the Data Lakes: How much data collection is too much? - SolCyber


More About Duck


Paul Ducklin is a respected expert with more than 30 years of experience as a programmer, reverser, researcher and educator in the cybersecurity industry. Duck, as he is known, is also a globally respected writer, presenter and podcaster with an unmatched knack for explaining even the most complex technical issues in plain English. Read, learn, enjoy!

Paul Ducklin
Paul Ducklin
11/27/2024
Share this article:

Table of contents:

The world doesn’t need another traditional MSSP 
or MDR or XDR.

What it requires is practicality and reason.

Related articles

Businesses don’t need more security tools; they need transparent, human-managed cybersecurity and a trusted partner who ensures nothing is hidden.

It’s time to move beyond the inadequacies of current managed services and experience true security management.
No more paying for useless bells and whistles.
No more time wasted on endless security alerts.
No more dealing with poor automated services.
No more services that only detect but don’t respond.
No more breaches caused by all of the above.

Follow us!

Subscribe

Join our newsletter to stay up to date on features and releases.

By subscribing you agree to our Privacy Policy and provide consent to receive updates from our company.

CONTACT
©
2024
SolCyber. All rights reserved
|
Made with
by
Jason Pittock

I am interested in
SolCyber XDR++™

I am interested in
SolCyber MDR++™

I am interested in
SolCyber Extended Coverage™

I am interested in
SolCyber Foundational Coverage™

I am interested in a
Free Demo

9878