Data Science 101: What It Is, What Data Scientists Do, and Real World Examples
What if I told you that data science could be used in almost every — if not every — industry you could think of? Healthcare? Check. Finance? Check. The government? Check. If there’s data to be collected, data science can and should be used. Artificial intelligence expert Andrew Ng (founder of DeepLearning.AI) seems to agree. Data science is a process that uses AI principles and Ng believes that “the shift to data-centric AI is the most important shift businesses need to make today to take full advantage of artificial intelligence.” To put it plainly, data and data science isn’t going anywhere, especially with the continued rise of AI.
So, what is data science? Keep reading to learn more about data science, its process, what data scientists do, and how they bring their applications into the real world.
Let’s dive in!
Table of Contents
- What is Data Science?
- The Data Science Life Cycle
- What is a Data Scientist?
- Data Science – In the Real World
- The Future of Data Science
What is Data Science?
Data science is the study of data to organize, analyze, and interpret information. Companies and organizations hire data scientists to uncover relevant insights about their business to help guide their strategy. They use data to find patterns and make predictions that can influence decisions like product planning, marketing, trend forecasting, and much more.
At first glance, the term “data science” might seem one-dimensional. It’s data and nothing else, right? Wrong. Data science is a multidisciplinary approach that combines different practices — mathematics, statistics, artificial intelligence, computer science, and advanced analytics techniques like machine learning algorithms, deep learning, and predictive modeling.
The Oxford Learner’s Dictionaries describes data as “facts or information, especially when examined and used to find out things or to make decisions.” When you think — really think — about what data is, it’s easy to come to the conclusion that the world runs on data and that we use the process of data science to make even the smallest decisions.
Why is Data Science Important?
Imagine you’re on the hunt for a new smartphone. You’re weighing the options of upgrading to a newer model that came out a year ago or buying the latest version. Research leads you to the cost of each phone, its features, and tech specifications. You keep track of all the information in a document that you periodically update so you can compare the two phones. You take some time to consider which is best for you before deciding to go with the latest model. To you, all you did was pick a new phone. To me, you used data science to make your decision.
Choosing a new phone seems inconsequential when you compare it to the potential that data science has in industries like healthcare, travel, cybersecurity, and more. Women’s health apps use data science to track menstrual cycles and ovulation schedules which can help women get pregnant. Data science is at the heart of Air Traffic Control where they use traffic flow and weather data to suggest flight routes, predict traffic congestion, and reduce delays.
Data is everywhere, and data science is what we use to make sense of it.
The Data Science Life Cycle
To get to the point in the data science process where you can use data to make predictions and decisions, you have to go through the data science life cycle. Depending on who you ask, there are differing numbers in the steps, but if you ask me, the cycle can be neatly tied into five steps — collection (data capturing), warehousing (data maintenance), mining (data processing), exploration and confirmation (data analysis), and reporting (data communication).
Step 1. Data Collection
The data science life cycle begins with data collection. It starts with identifying sources — databases, APIs, online platforms, surveys, etc. — and pulling structured and unstructured data from those sources. Think of customer information (names, addresses, credit card numbers) as structured data and pictures and videos as examples of unstructured data. Not only does the data acquisition process include data entry, but it also calls for data extraction like web scraping — a method of extracting data from websites. Depending on the source, data scientists need different tools for extraction. For example, SQL queries are ideal for databases while Python scripts are pretty universal with their ability to extract data from databases, APIs, websites, and CSV files.
Step 2. Data Warehousing
The purpose of a warehouse is to store products, equipment, etc. In the data science life cycle, once you’ve collected and extracted data, it needs to be stored and maintained in a data warehouse. Before it makes its way to the warehouse, the data needs to be cleaned and integrated. Data could come in dozens of different formats — text (TXT, HTML, XML), numeric, multimedia (JPEG, PNG, MP3, MP4), tabular data (XML, Excel), and more. Cleaning will get rid of any errors while ETL (extraction, transform, load) jobs “combine data from multiple sources into a single, consistent data set for loading into a data warehouse.”
Step 3. Data Mining (or Processing)
Data mining is the process of processing data to identify patterns and relationships within large data sets. This stage requires statistics, data analytics, and machine learning techniques, like classification and clustering, to organize labeled data into categories or group unlabeled data into “clusters” based on how they correlate. This grouping allows the data to be further analyzed during the next step of the data science life cycle.
Step 4. Data Exploration and Confirmation
The data analytics and machine learning needed during processing make a reappearance during the data analysis phase. Instead of classification, one type of supervised machine learning model, we see regression — a technique for understanding relationships between variables. Combined with predictive and qualitative analytics, data scientists analyze the data to explore patterns, make predictions, and guide decision-making.
Step 5. Data Reporting
The last step in the data science life cycle is for data scientists to communicate their findings. There are multiple ways to go about presenting this information, but the most common is through reports and data visualization. Data visualization is the representation of data through visual graphics—charts, graphs, maps, dashboards, etc. This makes it possible for stakeholders—who likely have no idea about the data science process—to understand the information and transform the data insights into actionable business decisions.
What is a Data Scientist and What Do They Do?
I bet you didn’t know that a career in data science is all math, computer engineering, AI, and—sex appeal? It is if you ask Harvard Business School professionals who named data scientists the sexiest job of the 21st century. What’s even sexier? The median annual salary for data scientists is $103,500.
As previously mentioned, data science is a multidisciplinary approach that uses the principles of multiple practices and boils them down into one job. So what is a data scientist and what do they do? Harvard Business Review thought this was a question best answered by data scientists, and I happen to agree. Here’s how they described the job of a data scientist:
“First, data scientists lay a solid data foundation in order to perform robust analytics. Then they use online experiments, among other methods, to achieve sustainable growth. Finally, they build machine learning pipelines and personalized data products to better understand their business and customers and to make better decisions. In other words, in tech, data science is about infrastructure, testing, machine learning for decision making, and data products.”
You could consider this the general scope of data science jobs, but their day-to-day responsibilities might look like:
- Identifying valuable and relevant data sources
- Collecting structured and unstructured data
- Performing tests to assess and improve the data collection process
- Building predictive models and machine learning algorithms
- Analyzing data to identify trends and patterns
- Communicating insights and recommendations to stakeholders based on data analysis
Skills and Tools of a Data Science
For a successful career, data scientists need well-rounded and highly technical skills. The data science skills, frameworks, and technologies needed to work with large quantities of data include:
- Mathematical skills (linear algebra, linear regression, and statistical analysis)
- Programming languages (Python, R)
- Big data technologies (Apache Hadoop and Apache Spark)
- Experience in data mining
- Database management and tools (SQL, NoSQL, Microsoft Excel)
- Machine learning techniques (Linear regression, logistic regression, decision trees. etc.)
- Experience working with APIs
- Data visualization (Tableau, Google Charts, Microsoft Power BI, D3.js)
- Soft skills: Problem-solving, attention to detail, communication
Data scientists aren’t the only tech professionals who need these skills and qualifications. Other data science careers that you’d find tons of overlap with include data analysts, data engineer, machine learning scientists, data architects, and business intelligence developers.
Data Science – In the Real World
Data science is an everyday process, whether you’re a data scientist or not. Remember our earlier example of choosing a smartphone. This is data science on a small scale. When data scientists go through the data science life cycle, they’re making waves in healthcare, finance, and the government.
Here are some real-life examples of data science to consider:
Healthcare
Data science is used for tracking and preventing diseases, developing new medications, tracking menstrual cycles, developing vaccines, and one of its biggest uses—medical imaging. Imagine a dancer comes into a doctor’s office with a hairline fracture in their foot. These small fractures can be difficult to see with the human eye. Data science makes it possible to develop technologies that can scan these images and pinpoint even the smallest irregularities.
Finance
In the finance industry, data science can be used to analyze the market, detect fraud, forecast financial trends, allocate loans, and manage risks. If you apply for a loan, financial institutions go through a long process for approval. It may take minutes (or even seconds) on your end, but they automate their systems to go through tons of data to identify risks. For example, credit card companies may use data science to look into your financial background and even your social media (yes, these programs may be sifting through your Facebook statuses from 10 years ago) to figure out whether you’re trustworthy and likely to make payments on your account.
Government
The government uses data science in matters like law enforcement, national defense, and tax evasion. One of the more interesting uses of data science is its place in emergency response. In emergencies, data science provides real-time analytics—location, population density, time of day, weather conditions, and more—to help the government optimize its resources, communicate with the public, and mitigate further risks, if necessary. In the event of a big storm, this might look like the government using data science technologies that pull data from the news and social media to plan and prioritize which areas need disaster relief the most.
The Future of Data Science
Data science is a process that lets you take large data sets and make sense of them. Data scientists use their skills—programming languages, machine learning techniques, and data mining experience—and data science tools—big data technology, databases, and data visualization programs—to analyze data and make predictions so businesses can improve their products and services. And while this sounds good for businesses, it’s good for the greater population. Companies are becoming more data-driven, and when data science is used in finance, transportation, and healthcare, it has the potential to be life-changing and life-saving.
Do you think data science is sexy now? I mean, what’s sexier than a field that can positively impact everyone’s lives? And sure, the six-figure median salary doesn’t hurt either. If you’re interested in a career in data science, the time to start is now. With Skillcrush’s Break Into Tech program, you can learn the fundamentals of web development that will help you prepare for a successful future in data science.