Writing
Use CSVY Format for Data Storage
I have used fread() and fwrite() from the data.table package for years. Recently, I noticed that a change introduced in version 1.11.0 broke some old code. The change was:
Numeric data that has been quoted is now detected and read as numeric.
Quoted numbers used to be read as character values. Now, data.table can read quoted numbers as numeric values even when they are quoted.
The old code still runs, but the data no longer arrives in the expected form. For example, an ID column with values such as “0001, 0002, 0003, …” may be read as “1, 2, 3, …”. That does not trigger an immediate error. The failure appears later, when downstream code tries to treat the ID column as character data.
At first, I was annoyed by the change. After thinking about it, though, the root problem is not data.table. The deeper issue is that CSV files do not include metadata about column types. To read CSV files reliably, we need a way to store column definitions alongside the data. That solution already exists.
CSVY adds YAML front matter to CSV files. Along with other descriptive information, the YAML front matter can include column definitions like this:
schema:
fields:
- name: x
type: numeric
- name: y
type: character
- name: z
type: POSIXct
With this information saved inside the file, the data can be read back with the intended column types.
Using CSVY with data.table is straightforward. Since version 1.12.4, both fread() and fwrite() support the yaml argument. Use fwrite(..., yaml = TRUE) to save a CSVY file, then use fread(..., yaml = TRUE) to load it. This gives CSV files a practical, long-term place to store column definitions.