Actionable Data Analytics
Join Our Email List for Data News Sent to Your Inbox

Export Parquet Files with Column Names with Spaces

In this blog post, I’ll show you a design pattern to export parquet files with column names with spaces. You can use Azure Data Factory or Azure Synapse Analytics to overcome this problem. I will use dynamic T-SQL  

Introduction 

To begin, one of the limitations when exporting data to parquet files in Azure Synapse Analytics or Azure Data Factory is you can’t export tables that have columns with blank spaces in their names.  

One of my readers, Marcus, asked me about how to do this recently, so I thought I’d write about it. (Leave me a comment if you ever have any questions, and your answer may turn into a helpful blog post!) 

For example, the following table cannot be exported with column header because the names have spaces. 

Table that cannot be exported

The solution is to rename the column names when querying the data while minimizing the development effort. 

If you try to do it, you get the following error: 

"message": "Failure happened on 'Sink' side. ErrorCode=ParquetInvalidColumnName,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The column name is invalid. Column name cannot contain these character:[,;{}()\\n\\t=],Source=Microsoft.DataTransfer.Common,'", 
Error message

Pre-requisites (recommended) 

First, have a look at this blog post on how to export parquet files in Azure to understand the solution. 

Prepare the Dynamic T-SQL Code 

In this example, I’ll use Azure SQL Database as the source system. If you are using any Relational Database Management System (RDBMS), the design pattern will be the same. Just change the dynamic queries. 

The main idea is to generate a dynamic query that modifies the column names to remove the spaces. If you haven’t worked with Dynamic SQL before, I suggest learning about it. I learned this in my early years from my mentor Enrique Viamontes.  

In T-SQL, you can use: 

DECLARE @s VARCHAR(MAX) 

DECLARE @tablename VARCHAR(500) ='DimAccount' 

 

SELECT @s = ISNULL(@s+', ','') + '[' + c.name + ']' + ' as ['+replace(c.name,' ','')+'] ' 

FROM sys.all_columns c join sys.tables t 

ON c.object_id = t.object_id 

WHERE t.name = @tablename 

 

SELECT 'select ' + @s + ' from ' + @tablename as Query 
T SQL

Export Parquet Files with Column Names with Spaces 

Now, let’s include the code in an integration pipeline (Azure Data Factory or Synapse Analytics) using a Lookup Activity. 

In your ForEachTable, add a lookup activity as follows and click the query. 

Export parquet files with columns with spaces

Paste the following query: 

@concat(' 

DECLARE @s VARCHAR(500) 

DECLARE @tablename VARCHAR(500) = ''', item().Table_Name ,''' 

 

SELECT @s = ISNULL(@s + '', '','''') + ''['' + c.name + '']'' + '' as [''+replace(c.name,'' '','''')+''] '' 

FROM sys.all_columns c join sys.tables t 

ON c.object_id = t.object_id 

WHERE t.name =  @tablename 

 

SELECT ''select '' + @s + '' from '' +  @tablename  as Query') 

Now, modify the copy activity source query. 

Modify the copy activity query

Paste the following code. 

@activity('RemoveBlankSpacesColumnNames').output.firstRow.Query 

Now, the copy activity will export the data correctly by renaming the column names. 

Columns are renamed correctly

Summary  

Today you have learned how to overcome one of the limitations when working with parquet files.  

What’s Next?      

In upcoming blog posts, we’ll continue to explore Azure Data Service features.          

Please  follow Tech Talk Corner on Twitter for blog updates, virtual presentations, and more!                

Check out these other blog posts

comments [ 12 ]
share
No tags 0
12 Responses
  • Kasper Lindblad Nielsen
    09 . 05 . 2021

    This was extremely helpful thank you so much David! The example easily extends to other parquet special characters by just nesting the replace function. I will have to look into these dynamic queries it seems like a very powerful tool!

    • David Alzamendi
      12 . 05 . 2021

      Thank you for your feedback Kasper, dynamic queries are great in combination with Azure Data Factory!

    • Ajay
      14 . 06 . 2021

      Hi Kasper,

      I am new to query language and ADF, Can I know how to extend to other parquet special characters like [,;{}()\n\t=] which can cause errors when converting to parquet.

  • Sid
    10 . 08 . 2021

    Hi David,

    Is there a way to port this solution to MySQL source?

    Regards,
    Sid

  • prateek gandhi
    24 . 11 . 2021

    Hi SID,

    Did you get any solution for mysql source?

    Regards
    Prateek

  • Zac
    23 . 02 . 2022

    This is excellent in theory, but when trying to implement it in practice, it doesn’t seem to work. The comma introduced in the below line ends up placing a comma at the end of the list of columns, before the from clause in the final product:

    SELECT @s = ISNULL(@s + ”, ”,””) + ”[” + c.name + ”]” + ” as [”+replace(c.name,” ”,””)+”] ”

    As such, the operation fails citing a syntax error near “from”, because there is a comma before it. Can you help me understand what I’m missing? Thanks!

    • Zac
      24 . 02 . 2022

      Sorry, I figured it out. The value for @s in my use case needed to be longer than 500 characters, so it was truncating the column list until I expanded the length of that variable. The solution works great, thanks!

      • David Alzamendi
        01 . 04 . 2022

        Thank you for the feedback Zac, I don’t even quite remember why I selected VARCHAR(500) (I guess just as an example). I changed it to MAX.

  • Luke
    23 . 03 . 2022

    Hi David, Thank you for taking the time to share your solution. It is an interesting way to dynamically change the query per table in a ForEach. Very helpful. I did need to increase the size of @s due to truncation of larger views and did a simple modification to integrate using the table’s schema. Interestingly we have some curated views for tabular models that have some quoted columns with spaces that we are caching to parquet files and reintroducing the spaces back when consuming the parquet files out through serverless Synapse by leveraging the spaces in the original view’s metadata but removing spaces from the added with() block in the external view. Your solution was helpful in that. We used sys.views instead of sys.tables but I left it as sys.tables to be as close as possible to your original. Thanks again!

    @concat(‘

    DECLARE @s VARCHAR(MAX)

    DECLARE @tablename VARCHAR(500) = ”’, item().Table_Name ,”’
    DECLARE @schemaname VARCHAR(500) = ”’, item().Schema_Name ,”’
    DECLARE @fulltablename VARCHAR(1000) = ”’, item().Schema_Name,’.’,item().Table_Name ,”’

    SELECT @s = ISNULL(@s + ”, ”,””) + ”[” + c.name + ”]” + ” as [”+replace(c.name,” ”,””)+”] ”
    FROM sys.all_columns c
    join sys.tables t ON c.object_id = t.object_id
    join sys.schemas s ON t.schema_id = s.schema_id
    WHERE s.name = @schemaname AND t.name = @tablename

    SELECT ”select ” + @s + ” from ” + @fulltablename as Query’)

    • David Alzamendi
      01 . 04 . 2022

      Amazing contribution Luke! I might review the blog post and include your contribution soon!

Do you want to leave a comment?

Your email address will not be published. Required fields are marked *