As Data Architects, we all face the challenge of the first step in ETL process – identifying the correct data type of the files ingested. If the data types are identified correctly during the ingestion process, then the end-to-end data pipeline will be executed without any type conversion errors.
There are various file formats for data – csv, json, txt, tsv etc. But the most popular are the csv and json formats. In this blog we will pick the “csv” format, which is easy to understand.
Compared to the data dictionary feature in database (Oracle / MS SQL / Postgres) tables, the csv files lacks data types information. So all the attributes are mostly considered as string.
In this blog, we will analyze a csv file containing USA House listing from 2001 to 2020. The size of the file is 100MB. The csv file as a zip format can be download here.
The csv file has these attributes
- slno
- list_year
- date_recorded
- town
- address
- assessed_value
- sale_amount
- sales_ratio
- property_type
- residential_type
- nonuse_code
- remarks
- opm_remarks
- location
Databricks
When ingesting csv files into Databricks, there are two ways to analyze the data types of the attributes in the csv file –
- Automatic – inferSchema set to True
- Manual – User defined Schema with correct data types
- Automatic – inferSchema
As per the official documentation, to read the csv file via spark.read.csv and load it into a dataframe, there is an option to automatically analyze and detect the data types using the inferSchema option.
One option required for inferSchema is to set the header to ‘true’, Databricks uses the first row of the file for attributes names. Setting the header to true is needed if it is a interger data type, but the attribute name might be a string.
Below is the code in Databricks –
Lets look at the schema and the data types recommended by inferSchema using sales_df.printSchema()
root
|-- slno: string (nullable = true)
|-- list_year: string (nullable = true)
|-- date_recorded: string (nullable = true)
|-- town: string (nullable = true)
|-- address: string (nullable = true)
|-- assessed_value: string (nullable = true)
|-- sale_amount: string (nullable = true)
|-- sales_ratio: string (nullable = true)
|-- property_type: string (nullable = true)
|-- residential_type: string (nullable = true)
|-- nonuse_code: string (nullable = true)
|-- remarks: string (nullable = true)
|-- opm_remarks: string (nullable = true)
|-- location: string (nullable = true)
From the above schema output, all the attributes are default as “string“, but like list_year some of the other attributes are integers and date_recorded is a date format.
Also, if you have noticed, there are 2 Jobs / Stages to analyze the data type. One of the job/stage is to load the entire 100MB file into memory for analyses.
The total time is complete the execution is 7.07 seconds.
So there are few disadvantages using automatic detection of data types –
- huge execution time.
- loads the entire dataset into memory for analyzing.
- there are 2 jobs/stages to analyze.
2. Manual – User defined Schema
In Manual, the data types are manually specified by the user / developer. The types are initialized by using the class – pyspark.sql.types. The schema of the csv is defined as a StructType with StructField assigned to each attribute with a data type.
The types are first imported using
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
Then the attributes are assigned with a particular data types like IntegerType, StringType, DoubleType, etc.
sales_schema = StructType(
fields =
[
StructField("slno",IntegerType(),False),
StructField("list_year",IntegerType(),False),
StructField("date_recorded",StringType(),False),
StructField("town",StringType(),False),
StructField("address",StringType(),False),
StructField("assessed_value",DoubleType(),False),
StructField("sale_amount",DoubleType(),False),
StructField("sales_ratio",DoubleType(),False),
StructField("property_type",StringType(),False),
StructField("residential_type",StringType(),False),
StructField("nonuse_code",StringType(),False),
StructField("remarks",StringType(),False),
StructField("opm_remarks",StringType(),False),
StructField("location",StringType(),False)
]
)
Finally the csv file is read in spark, with the schema
sales_df = spark.read.option("header",True).schema(sales_schema).csv("/mnt/realestate/real_estate_sales.csv")
As per the command execution time above, its only 0.40 seconds to analyze the attribute types of the csv file.
The printSchema() also reflects the same datatypes, as provided during the loading of the csv file.
Conclusion –
By comparing the difference between inferSchema vs User Defined, it clearly shows 94% decrease in processing time, if User Defined is utilized for assigning attributes.
Even though it takes a lot of time to explicitly set the attribute names and their data types by using the schema option, the end-to-end ETL data pipeline will be run successfully without any data type format issues and very less time to load the csv file resulting in a great performance of memory and CPU.