A Metadata Format For CSV Files

Using CSV files in batch processing applications has many advantages, most prominently interoperability between programming languages and tools. One of its weaker points is data integrity though. The format has no way to declare data types or additional metadata other than assigning names to data fields using a header.

The simple metadata format proposed in this article can help to mitigate this disadvantage.

A Case for CSV

First of all, why would anyone use such a simple plain text format? After all it’s just a semi-structured collection of data sets.

Sure, CSV isn’t exactly the most sophisticated data format available, but it has many advantages that make up for this flaw:

  • CSV is simple and well-understood.
  • There are libraries available for many programming languages or they can easily be written.
  • It can be analysed in a text editor.
  • Samples of a file can be loaded into a spread sheet application.
  • Files can be processed using standard Unix utilities.
  • It is easy to split CSV files into individual, self-contained chunks.

See The Pragmatic Programmer, "The Power of Plain Text" (chapter 3) for a more detailed discussion.

Unfortunately, there are disadvantages, too. Parsing the file format is usually a bit more expensive than for carefully designed binary formats (though it’s much cheaper than parsing XML). And since the ASCII representation of numbers can easily be three times as big as the usual binary representation, CSV files tend to be rather large.

Fortunately, streaming compressors like GZIP typically reduces files to about 20% of their original sizes even on the lowest compression ratio. Compression saves disk space and network bandwidth but it comes at a slightly increased processing cost. In many scenarios, however, the added overhead is negligible compared to the cost of the actual processing.

Using a Simple Metadata Format

There are basically three places where metadata can reside:

  • Inside the data file.
  • As a file next to the actual data files.
  • In a remote metadata repository.

If the data file format is extended to include metadata, we’d have to abandon the CSV format together with its advantages listed above. A remote metadata repository may be useful, but testing is a lot easier if you aren’t coupled to network resources, so a file next to the data seems to be the way to go.

The metadata format proposed here uses XML because it is human readable, well-understood and has excellent tool support (that should sound familiar). The format serves several purposes:

  • It declares the overall representation of CSV files (field separator, whether compression is used etc.).
  • It lists the set of files that make up the entire collection of data sets.
  • It defines the data fields and their respective types.

Here’s an example of how it looks (see the Relax NG Schema for the semi-formal definition):

<?xml version="1.0" encoding="UTF-8"?>

<meta xmlns="http://mafr.de/ns/meta-0.1#">
  <file-location format="csv" compression="gzip" separator="\t">
    <chunk location="data_1.csv.gz" size="123456"/>
    <chunk location="data_2.csv.gz" size="34354"/>
    <!-- more chunk declarations ... -->
    <chunk location="data_N.csv.gz" size="7890123"/>
  </file-location>

  <data-fields>
    <string-field name="USER_ID"/>
    <string-field name="NAME"/>
    <continuous-field name="AGE" missing="0"/>
    <categorical-field name="COUNTRY" missing="?">
      <category value="DE"/>
      <category value="UK"/>
      <category value="OTHERS"/>
    </categorical-field>
  </data-fields>
</meta>

This format is pretty much self-explanatory, except for the data field declarations which use data mining terminology. The continuous type is for floating point numeric values, while the categorical field can be compared to enums in programming languages like C. The optional missing attribute defines how an unknown (aka NULL) value is represented. Of course, different and/or more data types with arbitrary
restrictions could be defined.

The data fields have to be listed in the order they appear in the CSV files. Reading applications may choose to accept any order though. Field names have to be unique and may not contain the field separator for obvious reasons.

Depending on policy, applications can either ignore unknown elements (or attributes) or flag an error. There’s also the option of declaring optional parts of the format in a different XML namespace. Some applications can use those elements while others may ignore them safely.

There are several ways of extending XML schemas, either by explicitly allowing unvalidated content (usually in a different namespace) or by including the basic schema from another, more specialized schema that extends definitions as necessary. Updating the base schema can be done via namespace and/or schema versioning, but since this isn’t entirely trivial I’ll leave it for a future article.

A Note on Container Formats

When you split your files into multiple chunks for parallel processing, you end up with lots of files. To avoid confusion and to simplify transfer between systems you might be tempted to use a container format that packages all your data into a single file (using tar, for example). This may work in some cases, but if your data files are large and created in parallel, the creation of the container is a long and I/O-intensive operation. In batch operations this causes a significant overhead that you have to subtract from your time window.

A compromise is to use a natural but cheap container format: A file system directory. It may only be a "virtual" container, but combined with a proper delivery protocol, it’s still useful.

The Delivery Protocol

If you’re handing collections of CSV files from one system to another make sure you follow a simple protocol: Copy the data files first and the metadata file last. The delivery isn’t complete until the metadata file is there. The best approach is to take advantage of the atomic renaming feature many file systems provide (see the rename(2) syscall on Unix). Copy the metadata to a temporary file and then rename it. That way the receiver will never try to read an incomplete file.

The Verification Process

With the information contained in the metadata file, it is easy to verify the CSV files as much as required. The most basic check would only make sure that all of the listed files are there and have the declared file sizes. A simple consistency check would parse the headers to see if all data fields are there. This would also detect if the field separator is correct.

The most thorough check would then go through all of the files and make sure the data fields match their declarations. Since this is extremely expensive, it should usually be a side effect of regular processing rather than an up-front operation.

Advertisements
This entry was posted in best practices and tagged , , , . Bookmark the permalink.

3 Responses to A Metadata Format For CSV Files

  1. grismar2015 says:

    Hey, your article is pretty old, but it came up in my searches for a metadata format to describe .csv and other file formats suitable for use on the web. I’m wondering what you ended up with and whether you’re aware of anything getting close to a usable standard?

    I’ve seen Google’s DSPL (https://developers.google.com/public-data/docs/developer_guide) and initiatives like SDF from OKF (http://data.okfn.org/vision) but apart from their own use, these standard appear to get little use and most people are inventing their own alternatives.

    I’m getting a strong Colossal Cave-vibe: “you are in a maze of twisty little passages, all alike.” But if you found a way out, I’d love to hear about it.

  2. Matthias says:

    Well, we couldn’t convince the team producing the data to use a metadata format at all, so we settled on CSV with headers and an informal description. Internally, we used a format pretty similar to the one described in the article, but without the file-location element. That simple format is still used today as far as I know, even after the data processing pipeline has been ported to Hadoop (still with CSV, but wrapped in Hadoop Sequence Files).

    Thanks for the pointer to DSPL, that looks promising! For Big Data applications, ZIP files are hardly practical, but the XML metadata format is quite nice. I haven’t found anything better, but I admit I haven’t done much research afterwards (except for Weka’s ARFF, which is a niche solution).

  3. grismar2015 says:

    Ok, thanks – I may have a little more luck getting my team to accept a standard, since they need the system we’re developing to publish their data and I’m looking to set some rules for accepting data. And even though CSV is ugly and easy to mess up, it sure is efficient, so CSV with some JSON or perhaps XML metadata format should do.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s