Hey All, We had good discussion around V Order in the recent monthly call. As now I also got answers to some open questions I had, so sharing all my learnings here with you all. [Long post ahead]
What is V Order?
- V Order is an optimization for parquet files, solely within fabric.
- As Delta table holds parquet files underneath, it is applied to those parquet files as well.
Idea of V Order:
- It applies some additional sorting and compression on the parquet files while WRITING (consuming ~15% more time), and making the READ very fast (up to 50%) for the fabric engines (Spark, SQL, PowerBI).
Key points:
- Any parquet file which you "write" (not copied, not uploaded, not shortcut-ed) in fabric, will get V order optimization applied by default.
- For example, if you write a parquet file using a copy data activity in data pipeline, the resulted parquet file will be v ordered.
- If you write a parquet file using a spark notebook, the resulted file will be v ordered here as well.
- In both the above examples, with format as delta also this holds true.
How to disable it: (Screenshot attached)
- For spark notebook, you can use spark conf. command and turn it to false.
- For data pipelines, you can use file format settings and untick the v order option. (You will only get this option if file format is parquet)
- For data flows, it only writes as delta table - couldn't find any option to disable it.
How to check if a parquet file is v ordered or not? (Screenshot attached)
- V ordered parquet files looks no different than a normal parquet file. Only difference can be seen in the metadata of the parquet file.
- You can read the metadata of the parquet file using code.
- Or you can also use a parquet viewer to open and read the file directly.
- You will NOT find the highlighted key "com.microsoft.parquet.vorder.enabled" in the metadata of a normal parquet file.
PS: These are just based on my findings, please correct me in case of inaccuracies😊