
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Convert Unstructured Data to Structured Data Using Python
Unstructured data is data that does not follow any specific data model or format, and it can come in different forms such as text, images, audio, and video. Converting unstructured data to structured data is an important task in data analysis, as structured data is easier to analyse and extract insights from. Python provides various libraries and tools for converting unstructured data to structured data, making it more manageable and easier to analyse.
In this article, we will explore how to convert unstructured biometric data into a structured format using Python, allowing for more meaningful analysis and interpretation of the data.
While there are different approaches that we can make use of to convert unstructured data into structured data in Python. In this article, we will discuss the following two approaches:
Regular Expressions (Regex): This approach involves using regular expressions to extract structured data from unstructured text. Regex patterns can be defined to match specific patterns in the unstructured text and extract the relevant information.
Data Wrangling Libraries: Data wrangling libraries such as pandas can be used to clean and transform unstructured data into a structured format. These libraries provide functions to perform operations such as data cleaning, normalisation, and transformation.
Using Regular Expression
Consider the code shown below.
Example
import re import pandas as pd # sample unstructured text data text_data = """ Employee ID: 1234 Name: John Doe Department: Sales Punch Time: 8:30 AM Employee ID: 2345 Name: Jane Smith Department: Marketing Punch Time: 9:00 AM """ # define regular expression patterns to extract data id_pattern = re.compile(r'Employee ID: (\d+)') name_pattern = re.compile(r'Name: (.+)') dept_pattern = re.compile(r'Department: (.+)') time_pattern = re.compile(r'Punch Time: (.+)') # create empty lists to store extracted data ids = [] names = [] depts = [] times = [] # iterate through each line of the text data for line in text_data.split('\n'): # check if the line matches any of the regular expression patterns if id_pattern.match(line): ids.append(id_pattern.match(line).group(1)) elif name_pattern.match(line): names.append(name_pattern.match(line).group(1)) elif dept_pattern.match(line): depts.append(dept_pattern.match(line).group(1)) elif time_pattern.match(line): times.append(time_pattern.match(line).group(1)) # create a dataframe using the extracted data data = {'Employee ID': ids, 'Name': names, 'Department': depts, 'Punch Time': times} df = pd.DataFrame(data) # print the dataframe print(df)
Explanation
First, we define the unstructured text data as a multiline string.
Next, we define regular expression patterns to extract the relevant data from the text. We use the re module in Python for this.
We create empty lists to store the extracted data.
We iterate through each line of the text data and check if it matches any of the regular expression patterns. If it does, we extract the relevant data and append it to the corresponding list.
Finally, we create a Pandas dataframe using the extracted data and print it.
Output
Employee ID Name Department Punch Time 0 1234 John Doe Sales 8:30 AM 1 2345 Jane Smith Marketing 9:00 AM
Using Pandas Library
Suppose we have unstructured data that looks like this.
employee_id,date,time,type 1001,2022-01-01,09:01:22,Punch-In 1001,2022-01-01,12:35:10,Punch-Out 1002,2022-01-01,08:58:30,Punch-In 1002,2022-01-01,17:03:45,Punch-Out 1001,2022-01-02,09:12:43,Punch-In 1001,2022-01-02,12:37:22,Punch-Out 1002,2022-01-02,08:55:10,Punch-In 1002,2022-01-02,17:00:15,Punch-Out
Example
import pandas as pd # Load unstructured data unstructured_data = pd.read_csv("unstructured_data.csv") # Extract date and time from the 'date_time' column unstructured_data['date'] = pd.to_datetime(unstructured_data['date_time']).dt.date unstructured_data['time'] = pd.to_datetime(unstructured_data['date_time']).dt.time # Rename 'date_time' column to 'datetime' and drop it unstructured_data = unstructured_data.rename(columns={"date_time": "datetime"}) unstructured_data = unstructured_data.drop(['datetime'], axis=1) # Pivot the table to get 'Punch-In' and 'Punch-Out' time for each employee on each date structured_data = unstructured_data.pivot(index=['employee_id', 'date'], columns='type', values='time').reset_index() # Rename column names structured_data = structured_data.rename(columns={"Punch-In": "punch_in", "Punch-Out": "punch_out"}) # Calculate total hours worked by subtracting 'punch_in' from 'punch_out' structured_data['hours_worked'] = pd.to_datetime(structured_data['punch_out']) - pd.to_datetime(structured_data['punch_in']) # Print the structured data print(structured_data)
Output
type employee_id date punch_in punch_out hours_worked 0 1001 2022-01-01 09:01:22 12:35:10 03:33:48 1 1001 2022-01-02 09:12:43 12:37:22 03:24:39 2 1002 2022-01-01 08:58:30 17:03:45 08:05:15 3 1002 2022-01-02 08:55:10 17:00:15 08:05:05
Conclusion
In conclusion, unstructured data can be difficult to analyse and interpret. However, with the help of Python and various approaches such as regular expressions, text parsing, and machine learning techniques, it is possible to convert unstructured data into structured data.