Why to build your own CSV parser (or maybe not)
I will discuss various properties of a CSV parser, compare several parsers from NuGet and also providing my custom implementation.
This is based on a work I did in the past for a customer in finances. We were building a calculation engine and importing CSV files was one of the main tasks there. I evaluated a lot of CSV parsers but eventually we decided to roll out our own implementation because of our unique requirements. As it’s usually the case these special requirements were not that important after all but of course we learned that only after the implementation was finished. In the end we replaced the core of our custom solution with TextFieldParser which had better handling of quoted field. Let me now share what I learned about parsing CSVs.
What is a CSV and how to parse it
CSV format is defined by RFC4180 as this:
file = [header CRLF] record *(CRLF record) [CRLF] header = name *(COMMA name) record = field *(COMMA field) name = field field = (escaped / non-escaped) escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE non-escaped = *TEXTDATA COMMA = %x2C CR = %x0D DQUOTE = %x22 LF = %x0A CRLF = CR LF TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Let us discuss few interesting facts there.
Comma as sole field separator
This is usually not true as people will send you files which can have pretty much anything for separator. Granted that comma, pipe and semicolon are the most common.
We actually had a requirement originally that the separator can be more than one character. This has never been used.
CR+LF as line ending
Again this requirement gets ignored very often. You can be sure that nobody will care that the file they are sending to you has Windows, Unix or Mac line endings or a mixture of thereof. You just have to deal with it. Ideally we should be able to parse any reasonable combination of CR and LF much like StreamReader.ReadLine method does.
In our implementation we were very specific and allowed the user to specify the line endings explicitly. This was never popular for good reasons – instead of solving the problem we just dumped it on the user. In the end we defaulted it to Windows+Unix+Mac and never touched it again.
Quoted field must begin and end with double quote. Quote inside a field is escaped by doubling it. Quoted field can contain commas (the field separator) and new line (the line separator). This is very important as you can get all kind of characters in text fields.
Note that it is invalid for the quote to be included in the text data: abc “efg” hij. In theory this should fail as undefined state but in the real life you want to import as much as possible so this case would just be ignored and quote will be parsed as normal character.
I need to say that implementing a custom quote character is not much of a problem. Nevertheless I have never seen anything else than double quote. Occasionally people try to escape quotes using backslash but that’s just wrong.
There are no comments or empty lines!
In most implementations you are allowed to use single line comments. If you start the line with a hashtag (#), or a custom character of your choice, this line is skipped.
Likewise empty lines are not part of the grammar above. But many implementations still allow you to skip empty lines automatically.
Comments are useful – they are occasionally used. On the other hand I didn’t see files with empty lines in production environment.
Occasionally there is the option to skip first N lines. This may bee useful if somebody sends you garbage at the beginning of the file. On the other side if there is garbage it is much better to use comments to skip it.
Some of the implementations allow you to trim white spaces around fields automatically. While this may be useful I feel it is not responsibility of the parser but rather some layer above. If this functionality is provided should be optional.
Some of the parsers try to do more than just giving you the lines and fields – they try to convert string into types specified by the user. Some implementers feel that loading data directly into POCO entities is the only correct way of parsing CSV. I certainly think these features have they target audience but in my case I was either disabling this functionality if possible or excluding the parser from my evaluation altogether as this didn’t fit my needs.
Reading and writing data
There are implementations that parse the whole file into an array or list or DataTable and return the whole content to you in one piece. This is just fine for small files but becomes a blocker when you need to import gigabytes of data. I was only looking at implementations that could read the file line by line.
Some libraries provide the functionality to write CSV data. I didn’t focus on this part as it was not important to my use case.
Some of the interesting features of each parser. Empty fields mean I didn’t find the respective information.
|Parser||Access||Delimiter||Line Endings||Quote||EOL inside quote||Comment||Trim white spaces|
|Microsoft.VisualBasic.FileIO.TextFieldParser||Read line||multiple strings||“||yes||multiple strings||optional|
|Nuget: Cinchoo ETL 220.127.116.11|
(article on CP)
|Enumerator||single string||single string||single char||yes||multiple strings||yes|
|Nuget: Csv 1.0.11||Enumerable||single char||StreamReader.ReadLine||“||no||no||optional|
|Nuget: CsvHelper 3.0.0-beta7||Read line||single string||windows, unix||single char||yes||single char||optional|
|Nuget: CsvToolkit 0.13.0||Enumerable||single char||windows, unix||single char||yes||no||optional|
|Nuget: DevLib.Csv 18.104.22.16810||Enumerable||single char||StreamReader.ReadLine||single char||no||no||no|
|Nuget: LibCsv4Net 22.214.171.1242||Enumerable||single char||single string||single char||yes||no||no|
|Enumerable||single char||windows, unix||single char||yes||single char||optional|
|Nuget: Net.Code.Csv 1.0.3||Data reader||single char||single char||yes||single char||optional|
|Nuget: Nortal.Utilities.Csv 0.9.2||Read line||single char||single string||single char||yes||no||no|
|Nuget: Uncomplicated.Csv 1.5.2||Read line||single char||windows||single char||yes||no||no|
|My implementation||Read line||single char||CR,LF,CR+LF,LF+CR normalizes into CR+LF||“||yes||#||no|
|CodeProject: C# – Light and Fast CSV Parser||Enumerable||single char||windows, unix||single char||yes||no||yes|
|CodeProject: C# CSV File and String Reader Classes||Read line||single char||single char||yes||no||no|
Explanation of data access:
- Read line: There is a method which can read a single line in form of a string array (or equivalent) and returns null when end of file is reached.
- Enumerable: The parser returns an enumerable of lines (string arrays or equivalent).
- Data reader: Parser implements IDataReader interface.
Using the code
To replicate the results shown below please use the SimpleCsvReader.Demo project which is a simple console application. There are three modes.
First you need to generate a random test file. You can specify number of lines i.e. the size in the second parameter.
SimpleCsvReader.Demo.exe /gen 10000
When you have the file generated you can parse it using the various parsers and record the time. The second parameter specifies how many times you want to repeat the measurements, averages of the results will be calculated.
SimpleCsvReader.Demo.exe /run 3
Finally there is a third option to verify if the parsed data matches the generated data. This is done using a SHA256 hash of the data written to and read from the file.
The table shows the average time in seconds to parse a file with given number of rows. The sample files didn’t contain any comments but they did contain quoted fields with line breaks. Please see also comments below the table.
|Nuget: Cinchoo ETL 126.96.36.199 (3)||0.71||6.80||67.72||678.22||6890.05|
|Nuget: Csv 1.0.11 (1)(3)||0.66||6.51||64.45||662.48||6642.64|
|Nuget: CsvHelper 3.0.0-beta7||0.06||0.56||5.60||55.69||584.74|
|Nuget: CsvToolkit 0.13.0 (3)||0.12||1.18||11.68||118.92||1221.08|
|Nuget: DevLib.Csv 188.8.131.5210 (1)||0.04||0.35||3.51||34.95||374.55|
|Nuget: LibCsv4Net 184.108.40.2062||0.25||2.46||24.50||244.65||2507.25|
|Nuget: LumenWorksCsvReader 3.9.1||0.02||0.21||2.11||20.96||229.66|
|Nuget: Net.Code.Csv 1.0.3 (3)||0.05||0.52||5.16||51.30||541.96|
|Nuget: Nortal.Utilities.Csv 0.9.2||0.06||0.59||5.89||59.02||613.04|
|Nuget: Uncomplicated.Csv 1.5.2||0.04||0.34||3.43||33.95||365.23|
|CodeProject: C# – Light and Fast CSV Parser||0.06||0.56||5.53||56.26||585.87|
|CodeProject: C# CSV File and String Reader Classes (3)||0.05||0.51||5.12||52.02||545.95|
- Doesn’t respect line break inside quotes.
- My implementation.
- Additional overhead because output needs to be adapted to collection of strings.
The following graph shows the relative speed of the parser compared to the VB TextFieldParser which is used as reference. Parsers slower than the reference are not shown.
LumenWorks is the best. I was able to get under 40% with my own implementation. CsvHelper beta7 performance is about 30% better than the last stable version.
First I need to say that what I’m showing here is not the implementation I wrote for my former employer as that is proprietary work. What I show here is my own implementation done for research purposes.
The code described in the section is located in SimpleCsvReader.Lib is the attached project.
CSV parser can be implemented as a finite-state machine. Parsing one line can be represented by following diagram. Each transition is labeled with the corresponding input that triggers the transition slash the performed action. This is so called Mealy machine.
Start Line is the start state, End Line is the final state. To consume multiple line endings these are pre-processed so the state machine doesn’t have to deal with it directly and is therefore represented by single token EOL. EOF is end of file. ELSE means any other character. When there is no more data in the input null is returned. If the line is empty and empty string array is returned.
Append char means adding the current char to the current field. End field takes the current field, adds it to the current line and clears the current field so parsing can continue. Keep char means that the input is not advanced to the next char so the same char is processed in the next state.
Please note that there is no transition from Quoted Field on EOF – this will throw an exception as the quoted field is not closed. Also there can be no text following Double Quote – this will again raise an exception. On the other hand we allow quote inside Regular Field.
The underlying file is read char by char using TextReader.Read method. This returns an integer and returns -1 in case end of file is reached. To allow simple manipulation with line endings we wrap this and return -2 in case of line end.
This class has another important responsibility and that is tracking the current position in the input file. Not only the absolute position of the current character but also logical line and column number. This is useful when reporting errors to the user.
Here the line is built field by field and char by char. Current field is represented by a StringBuilder, current line by list of strings. Chars are being added to the current field, line ends are automatically translated to Environment.NewLine – the original value is lost. When field ends the string is appended to the current line and current field’s StringBuilder is cleared. When the line ends the current line is returned to the caller.
Context and state classes
I chose to implement the parser using State design pattern. This has the benefit of having the transition logic nicely encapsulated. It is direct translation of the diagram above. Context class serves as the main hub facilitating the transitions and executing related actions while the actual logic deciding next state and what to do is in the respective state class.
This is where is all comes together. CsvParser class is the main interface to the parser – the ReadLine method resides here. In case we would need more parameters than the delimiter this is the place where they should be processed and validated. We also catch the parsing exceptions here and inject line and column numbers to help with problem solving.
My parser was working fine, unit tests were passing but I was a bit disappointed with the performance. I though my parser would be lightning fast! And although I’ve surpassed the reference implementation by more than a half there were others who were better. And that of course I couldn’t let go.
If you look at the code I didn’t change that much. Mainly I merged everything into one class making it private. I also implemented the state machine using simple switch statement and states using enums. With all of this I was able to squeeze additional 5% improvement out of it.
By rewriting the state machine I got rid of the virtual methods. This in turn allowed the JIT compiler to inline far more stuff than before. I guess this is the main reason for the performance gain.
Points of Interest
If you think that CSV files are obsolete then you are probably right. There are better ways to exchange information. Is it dead? Not by far.
Can we further improve the performance?
I spent some time with profiler looking at the performance and as you have probably guessed most time is spent in reading the file itself – no surprise there. The second bottleneck was data conversion. There is definitely room for improvement there.
Also there is a lot of buffering and copying between buffers going on. FileStream reads bytes into a buffer. StreamReader uses encoding to get chars from bytes and puts those into its own buffer. You then create strings out of these chars. Then conversions take in strings but internal implementations work with char arrays. SqlBulkCopy always calls GetValue method which returns object and your numbers and dates get boxed.
With some hacking you can save some memory allocations but I doubt it’s worth the effort.
Is it worth to write my own CSV parser?
Not really. Just pick one from NuGet and be off with it. You can always replace it with something better if you need to.
Why we didn’t use some existing library? Well one reason were the ridiculous requirements. Another reason was the conversion layer that turns strings into numbers and dates. You would be surprised how difficult is to get this right considering you need to parse data in different formats and cultures. This is why we kept the conversion layer and only replaced the parser.
Why we didn’t use some existing solution like SSIS? Now that is a very good question and some day I just might write another article about that.
- 16 March 2017 – Initial release.
- 17 March 2017 – CsvParser 0.5.2 removed as it parses data upfront.
- 20 March 2017 – new parsers added to the comparison:
- Two parsers from CodeProject articles
- Cinchoo ETL 220.127.116.11
- CsvHelper updated to 3.0.0-beta7