Using CSV files in batch processing applications has many advantages, most prominently interoperability between programming languages and tools. One of its weaker points is data integrity though. The format has no way to declare data types or additional metadata other than assigning names to data fields using a header.
The simple metadata format proposed in this article can help to mitigate this disadvantage.
A Case for CSV
First of all, why would anyone use such a simple plain text format? After all it's just a semi-structured collection of data sets.
Sure, CSV isn't exactly the most sophisticated data format available, but it has many advantages that make up for this flaw:
- CSV is simple and well-understood.
- There are libraries available for many programming languages or they can easily be written.
- It can be analysed in a text editor.
- Samples of a file can be loaded into a spread sheet application.
- Files can be processed using standard Unix utilities.
- It is easy to split CSV files into individual, self-contained chunks.
See The Pragmatic Programmer, "The Power of Plain Text" (chapter 3) for a more detailed discussion.
Unfortunately, there are disadvantages, too. Parsing the file format is usually a bit more expensive than for carefully designed binary formats (though it's much cheaper than parsing XML). And since the ASCII representation of numbers can easily be three times as big as the usual binary representation, CSV files tend to be rather large.
Fortunately, streaming compressors like GZIP typically reduces files to about 20% of their original sizes even on the lowest compression ratio. Compression saves disk space and network bandwidth but it comes at a slightly increased processing cost. In many scenarios, however, the added overhead is negligible compared to the cost of the actual processing.
Using a Simple Metadata Format
There are basically three places where metadata can reside:
- Inside the data file.
- As a file next to the actual data files.
- In a remote metadata repository.
If the data file format is extended to include metadata, we'd have to abandon the CSV format together with its advantages listed above. A remote metadata repository may be useful, but testing is a lot easier if you aren't coupled to network resources, so a file next to the data seems to be the way to go.
The metadata format proposed here uses XML because it is human readable, well-understood and has excellent tool support (that should sound familiar). The format serves several purposes:
- It declares the overall representation of CSV files (field separator, whether compression is used etc.).
- It lists the set of files that make up the entire collection of data sets.
- It defines the data fields and their respective types.
Here's an example of how it looks (see the Relax NG Schema for the semi-formal definition):
<?xml version="1.0" encoding="UTF-8"?> <meta xmlns="http://mafr.de/ns/meta-0.1#"> <file-location format="csv" compression="gzip" separator="\t"> <chunk location="data_1.csv.gz" size="123456"/> <chunk location="data_2.csv.gz" size="34354"/> <!-- more chunk declarations ... --> <chunk location="data_N.csv.gz" size="7890123"/> </file-location> <data-fields> <string-field name="USER_ID"/> <string-field name="NAME"/> <continuous-field name="AGE" missing="0"/> <categorical-field name="COUNTRY" missing="?"> <category value="DE"/> <category value="UK"/> <category value="OTHERS"/> </categorical-field> </data-fields> </meta>
The data fields have to be listed in the order they appear in the CSV files. Reading applications may choose to accept any order though. Field names have to be unique and may not contain the field separator for obvious reasons.
Depending on policy, applications can either ignore unknown elements (or attributes) or flag an error. There's also the option of declaring optional parts of the format in a different XML namespace. Some applications can use those elements while others may ignore them safely.
There are several ways of extending XML schemas, either by explicitly allowing unvalidated content (usually in a different namespace) or by including the basic schema from another, more specialized schema that extends definitions as necessary. Updating the base schema can be done via namespace and/or schema versioning, but since this isn't entirely trivial I'll leave it for a future article.
A Note on Container Formats
When you split your files into multiple chunks for parallel processing, you end up with lots of files. To avoid confusion and to simplify transfer between systems you might be tempted to use a container format that packages all your data into a single file (using tar, for example). This may work in some cases, but if your data files are large and created in parallel, the creation of the container is a long and I/O-intensive operation. In batch operations this causes a significant overhead that you have to subtract from your time window.
A compromise is to use a natural but cheap container format: A file system directory. It may only be a "virtual" container, but combined with a proper delivery protocol, it's still useful.
The Delivery Protocol
If you're handing collections of CSV files from one system to another make sure you follow a simple protocol: Copy the data files first and the metadata file last. The delivery isn't complete until the metadata file is there. The best approach is to take advantage of the atomic renaming feature many file systems provide (see the rename(2) syscall on Unix). Copy the metadata to a temporary file and then rename it. That way the receiver will never try to read an incomplete file.
The Verification Process
With the information contained in the metadata file, it is easy to verify the CSV files as much as required. The most basic check would only make sure that all of the listed files are there and have the declared file sizes. A simple consistency check would parse the headers to see if all data fields are there. This would also detect if the field separator is correct.
The most thorough check would then go through all of the files and make sure the data fields match their declarations. Since this is extremely expensive, it should usually be a side effect of regular processing rather than an up-front operation.