Set of CSVs as a database for Power BI

What if part of your reporting database is a set of CSV files?

Apart from possible problems with different number of columns, data types, delimiters, encoding etc., you have to care about performance.

According to my practice, large number of files kills productivity. It is better to firstly combine CSV / TXT files into one, then use Power Query to load it.

Experiment with reasonable number of files

In this scenario I just read and load all files into one table

Set of files One BIG file
40 files 1 file
584 MB 598 MB
1 min 15 sec 53 sec

With transformation like “Change Data Type”, Remove some columns, Replace Values:

Set of files One BIG file
1 min 29 sec 1 min 4 sec

As we may see, simple transformation only slightly increased load time.

Let’s double number of files:

Set of files One BIG file
80 files 1 file
1.14 GB 1.2 GB
2 min 45 sec 1 min 40 sec

Difference: 55 sec. Significant!

Seems, like it is always better to combine files into one.

How to combine CSV files into one

Locate your folder with CSV/TXT files in File Explorer, hold SHIFT and use right-click:

clip_image001[4]

Use command

copy *.csv result.csv

clip_image002[4]

All files in the folder will be combined into one.

If you have headers in each file, filter them out later in Power Query.

If you want to save resulting file to another folder – I recommend to read more about COPY command here and here.

Experiment with large number or files

Around a year ago or so, I had an experience of working with ~2000 CSV files.

I don’t have those files now, but I remember that it took for me around 30 min to load all those files to Power BI Desktop.

Then I combined them into one and load time dropped to 5 min!

Out of curiosity, I decided to check current version of Power BI Desktop and confirm “theory of big file”.

I generated fake files and compared load time (read and load, without transformation)

Set of files One BIG file
679 files 1 file
2.4 GB 2.5 GB
5 min 19 sec 3 min

We may see significant difference! Data Model gets the same information, but we release workstation resources on 2 min earlier.

And this is just a “read and load”. No transformations, no relationships in data model.

CSV vs XSLX

You may read here why I do not consider XLSB as a source for Power Query.

Just a few days ago, Ken Puls published very interesting results of his experiment with csv and xl files. Must read if your reporting database is set of files.

Although, XLSX is slower, it still can be useful in simple scenarios.

When I plan architecture of reporting solution, I keep in mind following matrix

Small data volume Large data volume
Small number of files XSLX or CSV if you sure in data quality CSV
Large number of files (>24) CSV – > combine into one big CSV CSV – > combine into one big CSV

Of course, there should be a balance between number of files and data volume. At some point it is better to go away from XLSX.

Pay attention, that with CSV and TXT you have to care about encoding, delimiters, possible “carriage return” symbols, quotes etc. If you have no control over CSV creation and it comes from untrusted source – be ready for unpleasant surprises.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s