Apache Spark Foundation Course - Spark Data Types and Metadata


Welcome back to Learning Journal. This video is the last item in the Spark SQL series. In this concluding video, I will talk about following things.

  1. Spark SQL data types
  2. Spark SQL Metadata
  3. Spark SQL functions and user-defined functions.

Let's start with the Spark SQL data types.

Spark SQL data types

As on date, if you Google for the Spark SQL data types, you won't be able to find a suitable document with the list of SQL data types and appropriate information about them. I mean, I was expecting something like Hive data type document. However, I couldn't find anything similar for Apache Spark SQL. The best thing that I saw was the documentation for org.apache.spark.sql.types package. So, let's look at the same. The package documentshows the list of the Spark SQL data types. Those are all Scala classes. Most of them also have a companion object. And that is the best place to look for the supported data types. However, you cannot use them in SQL with thenames given in the documentation. But you can find the SQL compatible name for those classes. Most of them have a static method to give you the equivalent name for SQL. You can try following examples.

                                
    ShortType.sql
    //gives you SMALLINT. 
    StringType.sql
    //gives you STRING, 
    DateType.sql
    //gives you DATE
    TimestampType.sql
    //gives you TIMESTAMP    
                         

You also have a VarcharType. Try following examples.

                                
    VarcharType(10).sql
    // gives you VARCHAR(10)
    DecimalType(6,2).sql
    //gives DECIMAL(6,2)    
                         

Spark SQL Metadata Catalog

The next item on my list is the metadata. You already know what the Metadata means. Right? Let me quickly refresh your knowledge.
If you have been using traditional databases like Oracle or any other RDBMS like MySQL and PostgreSQL, you must have used the data dictionary or the metadata. In most of the databases, the dictionary is a read-only set of tables that provide information about the database objects. A typical data dictionary contains information as listed below.

  1. Object definitions (tables, views, functions)
  2. Column names and data types
  3. Partitions, location, and other statistics

The data dictionary is primarily used by administrators and the developers when they want to get some details about the database objects. I have also seen some operations teams using metadata information for developing a bunch of scripts to automate their day to day monitoring and investigation work. But all that is not the real purpose of the Metadata store. The data dictionary views are the by-product. The real purpose of metadata store is much more fundamental.

Why do we need Metadata?

An SQL compiler can't work without a metadata store. When you write a SELECT * FROM TABLE1, the compiler doesn't know what the TABLE1 is? Where is the data stored? How to read and interpret it? All those answers are hidden in the metadata store. And that's the real purpose of the metadata store.
So, an SQL compatible database needs a metadata store, and hence the Spark also needs one. The creators of Spark decided to reuse the Hive Meta store instead of developing a new one. That means Spark comes with a bundled Hive Meta store.
However, you might already have a Hive cluster with a functioning Hive Meta store. Now you want to bring the Spark into your ecosystem. And that requirement causes two types of challenges.

  1. The version of your existing Hive Meta store might be different than the one that comes with your Spark.
  2. You might want to use a central Metadata store across your different clusters. The central store allows you to access your tables from all the clusters.

The first point is clear. I don't think there is any confusion on the first point. However, if you haven't seen complex cluster deployments in a cloud environment, you might have some doubts about the second point. Let's try to understand that.

Central Metadata store for Apache Spark

Assume I already have a cluster setup in a cloud environment. I have already integrated a bunch of technologies to design this cluster.I have a YARN cluster with Apache Hive and Spark as well. I also have a Hive Metadata store. The Hive Metadata store is common for Spark and Hive. However, I am storing my data files in S3 bucket, and I have created Spark SQL Tables using those data files.
Now I have a new requirement. I want to execute a machine learning model using the data that I already have. The model needs a lot of CPU and GPU resources, and I don't have that kind of machines in this existing cluster. So, I decided to spin a new Spark cluster with high-end CPU and GPU resources. Since I am keeping my data in the S3 bucket, I can easily access those data files from this new cluster as well. However, the table definitions on those data files are stored in my existing Meta store. The default meta store of the new cluster doesn't have those definitions.
If I try to execute the SELECT * FROM MYTABLE, I will get an error as the table does not exists. How would you fix that?
There are two ways.

  1. Execute all the Create Table scripts in this cluster, and I guess that's a foolish thing to do.
  2. The other alternative is to reconfigure the new cluster to point to the earlier Metadata store.

Even if you have 10K table definitions and hundreds of user-defined functions, all of them will be available to you in this new cluster without re-executing those DDL in this new cluster.
In a large organization, the need for spinning a new cluster keeps arriving that forces you to plan for a central Metadata store.
The configuration and set up a central meta store is not within the scope of this tutorial. So, we leave that part and move on to the methods of accessing the metadata using Spark.
Once you understand the importance of metadata, the next question is how to access it?

How to access Spark Metadata?

There are two methods to access the metadata.

  1. SQL
  2. Spark Catalog API

You are already familiar with the SQL method. It is all about using the DESCRIBE and SHOW commands. Both commands work with database, table, partition and the functions. The SHOW command is to get a list of the objects. Here are some examples.

                                
    SHOW DATABASES;
    SHOW TABLES IN mysparkdb;
    SHOW CREATE TABLE mysparkdb.surveys;
    SHOW COLUMNS FROM mysparkdb.surveys;
    SHOW COLUMNS IN mysparkdb.surveys;      
                         

Similarly, the DESCRIBE command is to get the details about an object. Here are some examples.

                                
    DESCRIBE DATABASE mysparkdb;
    DESCRIBE TABLE mysparkdb.surveys;
    DESCRIBE TABLE EXTENDED mysparkdb.surveys;      
                         

Most of these are documented in Databricks Spark SQL User guide.

Apache Spark Catalog API

The second method to access the metadata information is using API.As on date, you can see a list of five classes in the org.apache.spark.sql.catalog package . You should be able to find an equivalent API for all the SHOW and DESCRIBE commands in the catalog package.
Let's look at some examples.


                                
    spark.catalog.listDatabases
    // same as SHOW DATABASES
    //This API gives you a dataset for the list of all databases. You can display the list using the show method.
    spark.catalog.listDatabases.show
    //You can collect it back to the master node as a Scala Array.
    val dbs = spark.catalog.listDatabases.collect
    //Then you can loop through the array and apply a function on each element. Let's apply the println.
    dbs.foreach(println)        
                         

The point that I want to make is this.
The API gives you the metadata as a distributed collection.You can convert it into a local array or list. And that offers a lot of flexibility in the hand of a programmer.
Great! Let's try to get the details of a column.

                                
    spark.catalog.listColumns("mysparkdb","surveys").show                                  
                         

Well, that shows a lot of things. I am looking to get a list of column name and the data type.And I also want to sort the list by the column name. That's an easy thing to do.

                                
    spark.catalog.listColumns("mysparkdb","surveys")
                 .foreach(x=>println(x.name + "-->" + x.dataType.toUpperCase ))     
                         

Great! I leave you there. You can look at the documentation and explore the variety of those catalog APIs. In fact, you can do anything that you want to do with the metadata.

Spark SQL Functions

The next Item on my list is the SQL functions. If you have been working with SQL, you already know that the SQL function comes in three flavours.

  1. Standard functions
  2. Aggregate functions
  3. Table functions

What is Spark UDF?

The standard functions are the most straightforward. These functions transform values from a single row within a table, and they produce a single corresponding output value per row.For example, an UPPER function returns an uppercase version of the input string.If the desired functionality is not available, you can define your custom function, and we call them a user-defined function or UDF.
Implementing such UDFs is a straightforward thing. I will show you some examples.

What is Spark UDAF?

The next category is for the aggregate function.These functions act on multiple rows at once, but they return a single value as a result.Such functions typically work together with the GROUP BY statement.For example, the SUM function returns the total of the input value for all the rows in the given group. If you do not have a required functionality, you can implement a custom aggregate function as well, and we call then a user-defined aggregate function or UDAF. Creating a UDAF is little more complicated. But nothing to worry, I will show you an example for this as well.

What is Spark UDTF?

The final category is the table functions.These are like a standard function that accepts values from a single row, but they produce multiple corresponding output values per row. For example, an explode function that takes an array and returns each element of the array as an independent value.
So, these functions are known for producing multiple rows against each row from the table.
Apache Spark as on date doesn't have direct support for implementing such custom functions.
However, Hive supported creating such table functions, and we call them as user-defined table functions of UDTF.
As you already know that Spark SQL maintains a healthy relationship with Hive, So, it allows you to import and use all types of Hive functions to Spark SQL. What does that mean?
That means, if you already have a Java or Scala implementation for a bunch of Hive UDFs, you don't have to re-implement them for Spark SQL.You can package them in a Jar and use them in Spark SQL.
I am expecting that the future releases will start supporting UDTF implementation.
However, if you need a UDTF as on date, you can implement it in Hive and use it in Spark.
I will create examples for Spark UDF, UDAF, and Hive UDTF and explain the details in the next video as those details require a separate focus. However, if you want to get a list of all available Spark SQL functions, you can refer to Spark SQL Documentation. Alternatively, you can directly execute SHOW FUNCTIONS command.If you want to get some more details about the function, you can DESCRIBE a specific function.
Great. That's it for this session. See you again in the next video.
Thank you for watching Learning Journal. Keep learning and keep growing.


You will also like:


Pure Function benefits

Pure Functions are used heavily in functional programming. Learn Why?

Learning Journal

Apache Spark Introduction

What is Apache Spark and how it works? Learn Spark Architecture.

Learning Journal

Referential Transparency

Referential Transparency is an easy method to verify the purity of a function.

Learning Journal

Scala Variable length arguments

How do you create a variable length argument in Scala? Why would you need it?

Learning Journal

Free virtual machines

Get upto six free VMs in Google Cloud and learn Bigdata.

Learning Journal