Three myths about a data science project

Edited by Fedar Kapytsin
Published: July 29, 2024Updated: August 08, 2024
8 min to read
Three myths about a data science project

And how to dispel them.

When our clients ask us about data science, they always have 3 main concerns:

  • Risky ROI;
  • High costs;
  • Security.

And all these fears are valid, but only when a DS project is done wrong. Let’s find out why data science fails, and how to make it right.

Myth #1: Data science brings little value

TLDR. If done wrong, data science is useless. Hire data scientists with a strong math background and industry experience. Both are equally important.

Are they really data scientists?

If you don’t do background checks on your data scientists, it’s easy to hire underqualified staff. This problem often comes with outsourcing.

Data scientist is a very well-paid role, so there are hundreds of applications for any job opening.

Yet, most of them are underqualified. Among hundreds of applicants, it’s still hard to find educated talent. Most don’t have relevant data background.

Most applicants have certificates from a 3-month boot camp. They know boiled-down basics, but not much more. Often, these applicants don’t have a STEM degree.

But data science is a real science. To be good at it, one needs strong statistical and math knowledge. Typically, this means a degree in a quantitative field and a Masters in statistics.

The good outsourcing companies hire qualified data scientists. The cheap ones hire everyone else.

To avoid issues, hire talent from a company you can trust and do some background research on the data scientists that will work with you.

Looking for data scientists?

Hire senior talent from Europe. Our data scientists have 5+ years of experience and MsS / PhD degrees.

DS services

Get industry experts, not just data experts

Now, let’s say you have expert data scientists. Is that enough to yield great results? No.

Without industry knowledge, it’s all worthless because data scientists will look for the wrong cues.

Every data science project begins with business questions. Without industry knowledge, the data scientist can’t ask the right questions. They will not understand why things are done in certain ways. Basically, they will dig in the wrong place.

DS meme

It’s like starting a gold mine in Midtown Manhattan. If there’s no gold to begin with, it doesn’t matter how great your miners are.

Data science = math + computer science + domain knowledge. Each part is equally important:

What is data science?

What is data science?

Can you have an intermediary between a data scientist and a client? Yes, but even in the best-case scenario it will waste a lot of time. With the intermediary, you have to pay for the extra communication hours. That is a budget wasted on both the data scientists and the intermediary.

A software company that specializes in your domain will bring much better results than a purely data science company. Enough domain knowledge beats any model fitting because you know how things work and why.

Myth #2: Data science is too expensive

TLDR. True, data science is not cheap. To cut the costs down, you can outsource data talent.

Most projects are smaller than you think

Software development is expensive. When you build a huge software system, with web- and mobile applications it can take up to a year.

One would think that data science costs even more. After all, it’s trendy and hyped.

But the truth is – most data science projects only take 2-3 months to complete. And you don’t need a large team, even 2 specialists can do the job if they know your domain.

Not every company needs data science. But if you need one, the investments return fast.

Real-life example

In just 3 months, we built a 2-in-1 tool for a major retail company. It does behavior analysis and forecasts the sales for 3 million users. The result: 7% increase in conversions.

Preview
AI-based behavior analysis & sales forecast for retail

Outsource data talent

The average data scientist’s salary in the US is around $120,000 per year. In Lithuania, EU, it’s around €3200 a month – or $42,000 yearly.

Sure, the development company gets a cut for operating costs, but even then outsourcing is much cheaper than in-house development.

Here’s the downside of outsourcing. They are typical to lower-end vendors, but even higher-end companies may have some of these cons as well.

  • Less control. If you choose to not manage the employees directly, you lose some control over the data science processes and outcomes.
  • Communication challenges. Sometimes, the outsourced talent may speak poor English. Also, expect time zone differences, though most vendors provide several overlapping hours daily.

But there are plenty of advantages as well:

  • Cost savings. Apart from the salary difference, you cut down expenses related to hiring, training, and maintaining full-time staff.
  • Access to expertise. While some vendors offer low-quality talent, outsourcing is also a great way to find employees with rare skills and advanced knowledge.
  • Scalability. Scale resources up or down quickly based on project needs.
  • Focus on core business. Enables your team to focus on core business functions while experts handle data science tasks.

Want to know more? Find out how much your DS project costs

Take 2 minutes to fill out our data science calculator. Plus, we offer a free PoC.

Calculate AI development costs

Myth #3: Data will leak

M.A.G.I.C.A: minimize, anonymize, guard (encrypt), isolate, comply, audit

M.A.G.I.C.A: minimize, anonymize, guard (encrypt), isolate, comply, audit

TLDR. Data can leak. But this is true for any software, not just data science. To stay safe, make sure that your vendor takes all safety precautions: anonymize, minimize, and encrypt the data.

Data science needs quality data. But it’s the confidential data that gives the most data insights. No wonder clients feel insecure sharing it.

Yet, all software development deals with this issue. Here are the rules to keep your data safe:

Check your data science vendor

Sure, you will sign the NDA. But how to be sure that the DS vendor will follow through?

First, see if it’s not a fly-by-night company. How long have they been around? If it’s a young trendy company, there is always a chance that they will disappear with all your data. So see if they have a reputation to keep up.

Check their location. Ideally, the company should have an office in your country to sign local contracts. It’s also a good sign they are from the US or EU, where privacy is protected and you can prosecute the company. If the vendor is located at the far end of the globe, it becomes much harder to protect yourself.

So make sure you can trust the vendor. Partner up with reputable companies from countries with strong legal protections.

Minimize and anonymize data

The best way to secure data is to minimize the possible impact of data leaks. That’s why it’s a good idea to remove any personally identifiable information (PII): names, phone numbers, addresses, logins, etc. When there’s nothing important to steal, it becomes less tempting.

The rule of thumb is: do you have to have that PII? If not, get rid of it. How to do that?

  • Anonymize whenever you can. If names are not that important for your data science project, replace them with pseudonyms.
  • Mask data. Replace sensitive info with random characters. That’s why banks replace credit card numbers with asterisks (e.g., **** **** **** 1234).
  • Generalize precise data. Instead of storing the exact age of your customer “34”, store the age range “30-40”.
  • Aggregate data to a higher level. Instead of storing individual transaction amounts, store the total amount spent per day.
  • Add noise (perturbate). Change salary information just a bit: from $50,000 to $50,123..

All of these methods increase security tenfold. Data anonymization is a fine balance, so discuss this with your data scientists.

Encrypt everything, including backups

Make sure that no one can read your data even if it gets stolen.

With encryption your data is transformed into a secure format that can only be accessed with the correct key. Use strong encryption methods like AES or SSL.

Backup data regularly to protect against data loss from system failures, cyberattacks, or disasters. Don’t forget to encrypt the backups as well. A common mistake is to encrypt current data, while leaving backups vulnerable. Both in transit and at rest.

Isolate your systems

Don’t keep all the eggs in one basket. Even if a perpetrator breaks into your system, they won’t access most of the data.

There are many ways to isolate data, but here are the main options:

  • Complete isolation. Completely disconnect your systems, keeping your data physically in separate locations.
  • Network isolation. Segment your network with VLANs or separate subnets.
  • Virtual isolation. Use virtual machines or containers to create separate, isolated environments.
  • Data access isolation. Control access to data through permissions and access controls.
  • Application isolation. Isolate applications to manage security and separate different tasks.
  • Logical isolation. Separate data within databases or storage systems using schemas or tables.

Data isolation is crucial for security, but it also helps with performance and resource management. Your DS solution will run faster and cut down cloud computing costs.

Comply with regulations

It’s not that you want to keep your data secure. In many ways, you have to do this. Especially if you’re launching globally.

There are no federal compliance requirements in the US so far, but there are bills like the Algorithmic Accountability Act.

So if you’re in the US market today, you need to worry about local laws:

  • NYC Law 144. Requires companies to audit automated recruitment tools for bias before use.
  • Colorado SB21-169. Prohibits insurance companies from using algorithms that discriminate based on race, color, national origin, religion, sex, sexual orientation, disability, or gender.
  • Connecticut SB 1103. Mandates state agencies to inventory and assess AI systems to ensure they do not discriminate or give unfair advantage.

Many other markets have specific regulations:

The EU Artificial Intelligence Act will come into force on 1 August 2024. Just like GDPR, it will likely affect counties far from the EU. The act breaks down systems into 4 risk categories:

  • Unacceptable risk. Bans social scoring, biometric classification, and public facial recognition, with some law enforcement exceptions.
  • High risk. Regular assessments by EU regulators for products in critical areas like medical devices, cars, education, and law enforcement.
  • General purpose & generative AI. Must be transparent and safe. Advanced models, like GPT-4, are audited, and all AI content must be disclosed and legal.
  • Limited risk. Minimal transparency requirements; AI use must be disclosed.

Other countries are already rolling out new regulations as well: Canada, China, Japan, Brazil, with more countries to come.

Learn more about AI & DS compliance

Audit regularly

Finally, have your solution checked up regularly.

Some of the security can be automatic. Tools like AWS GuardDuty monitor your environment, detect threats, and let you know about new vulnerabilities.

But it’s a good practice to have manual audits regularly. This way you ensure that security policies and controls are effective and up to date.

Takeaway

In conclusion, while data science projects may seem daunting due to common myths about their expense, security, and return on investment, these concerns can be effectively addressed with the right approach.

Demystifying these myths not only paves the way for successful data science initiatives but also enables companies to leverage data-driven insights to drive growth and innovation. Don’t let misconceptions hold you back – embrace data science with confidence and watch your business thrive.
And if you have more myths to bust, talk to our DS consultants.

Share:
Be the first to receive our articles

AI-based software development cost estimation

Request price

Relevant Articles

Data warehouse vs data lake

Data warehouse vs data lake

May 24, 2024 - 9 min to read

Breakdown of AI compliance in 2024. Why focus on privacy?

Breakdown of AI compliance in 2024. Why focus on privacy?

February 29, 2024 - 5 min to read

How to put AI to use in business in 7 essential steps

How to put AI to use in business in 7 essential steps

January 25, 2024 - 5 min to read

Top 10 data science consulting companies in 2024

Top 10 data science consulting companies in 2024

December 22, 2023 - 7 min to read


We use cookies to ensure that we give you the best experience on our website.
We also use cookies to ensure we show relevant content.