CSV File Quirks
CSV files are supposed to be the standardized way to share data, right? Unfortunately, a “standard” CSV file is anything but. Let’s go over some of the CSV file quirks that any supporting application will need to consider.
What do you mean, quirks? Isn’t CSV a standard?
The general idea of having comma-separated values seems simple, but in actual implementation it is riddled with questions. In fact, there was a standard created, called RFC 4180, that attempted to remove some of the ambiguities (but wasn’t successful in removing all of them). Plus, there are many, many variations on CSVs that are floating around in the wild.
So while, yes, there is now a standard defined, there is not really a standard in practice.
Considerations When Dealing with Delimited Files
Below are just some of the considerations to make when dealing with this data-interchange format:
- Delimiters – A delimiter is needed to distinguish records and fields. Typically in a CSV (comma-separated values) file, the field delimiter is a comma and the newline delimiter is a \r\n. However, many systems and locales utilize different record delimiters, like a semi-colon, tab (TSV: tab-separated values), pipe, or other character.
- Newlines – Are the newlines \n (LF) only or \r\n (CRLF)?
- Newlines in fields – If your field genuinely contains a newline as part of the data, how do you distinguish this from a new record? Typically a field is quoted so that any newlines within it are included as part of the single field.
- Quotes in fields with newlines – Now that quotes have special meaning, how do we put actual quotes in our fields? Typically by using an escape character like another quote or a backslash (“” or \”).
- Control characters – There may be hidden characters within the document.
- Character sets – Is the document in utf-8 or ascii or something else?
- Header optionality – Whether or not the first record in the file is designated as having the names of the columns.
- Sparse data – How are blank fields handled?
- Mismatched field length/numbers – What if the header has 5 columns defined but a record has only 3, or maybe 8?
- Comments or blank lines – Are blank lines records or to be ignored? And did somebody add comments to their CSV with lines beginning with # (or some other character)?
- Headers or tailers – Some data files have metadata encoded on lines above or below the actual data in the file
- Spaces – Should whitespace around field values be ignored or treated as part of the field?
Some of these are defined in RFC 4180 and some aren’t; regardless, these are challenges to tackle if you are going to be developing your own CSV uploader.
Testing Your CSV Uploader Implementation
One way to test your CSV uploader implementation for CSV file quirks is to run the csv-spectrum files through. csv-spectrum
is described as “A bunch of different CSV files to serve as an acid test for CSV parsing libraries.” A good implementation will handle all of the cases.
Instead of trying to build an application to handle these quirks yourself, consider trying Delimited for free, which will give you an uploader that handles cases like these in just five minutes.