I haven’t posted in four years! My SparkChess project demands a lot of dedication on frontend as well as server and I just never got the time to prepare a proper post. Still, as I’m winding down for Christmas and New Year, I want to finally share some stuff.
The Task
I wanted to generate some chess-related statistics (see? chess stuff again) and for this I got some official data from FIDE (The World Chess Federation). I expected a CSV or something, but instead I got a fixed-width text file. It’s the first time I came across this format in 25 years!
If you read this, you arrived here looking for a convertor yourself, but just in case you don’t know what a fixed-width text file is, I’ll explain. Unlike CSV, this format doesn’t use a separator to differentiate between fields, instead it pads each one until a column size. Have a look at a hypothetical example:
ID Name Ti Age 7 Fabiano Caruana GM 26 1234 Magnus Carlsen GM 28 629 Garry Kasparov GM 55 83933 Viswanathan Anand GM 49
In the example above we have 3 columns – ID (6 chars), name (20 chars), title (2 chars) and age (3 chars). In a way, it’s easier to process than other formats, because parsing is easier if you just want a column. Still, I find it quite inefficient overall and especially as an exchange format (FIDE also provides the data as XML but I ultimately needed it as an SQL table and I wasn’t looking forward to parsing a 1Gb XML file.
Converting to CSV is simple enough:
- Read the input file line by line
- Extract the desired columns
- Trim the extra spaces
- Glue the columns with a comma or tab
- Write to output file.
I wrote the C version first and then I created the other versions as an exercise. I also imposed a few limitations on myself (in descending order of importance):
- No additional frameworks or libraries (this is especially for Node; I’m sure there’s a node module somewhere that does just this conversion, but where would be the fun?
- Simplest implementation. This is not a showcase of language features or cleverness. Simple, easy to understand code I more useful.
- Fastest way, within reason. For each language I tried to use the fastest way, but I did not benchmark every possible decision (e.g. buffer size, string concatenation strategies, etc.)
A couple of notes:
- I’m deliberately skipping a column, to show how it can be done. All examples copy only 3 columns. My real data is more complex.
- I’m using comma as a separator, without enclosing strings in quotes. This may cause problems with real-world data, so of course you may want to either use tab as separator or enclose the strings.
- The parsers works with UTF-8.
C
As I mentioned, this was the first version I wrote:
#include <stdio.h> #include <stdlib.h> #include <mem.h> #include <ctype.h> // number of columns to process #define LINE_SIZE 256 #define BUFFER_SIZE 8192 #define INFILE "in.txt" #define OUTFILE "out.csv" size_t const RANGES[][2] = {{0, 6}, {6, 20}, {29, 3}}; #define COLS (sizeof RANGES/sizeof RANGES[0]) int trimcpy(char *destination, const char *source, size_t len) { // trim spaces from the end - we only care about the space char while (source[len-1]==' ' && len>0) len--; memcpy(destination, source, len); destination[len] = '\0'; return len; } int main(void) { FILE *rfp; FILE *wfp; char line[LINE_SIZE]; char out[BUFFER_SIZE]; out[0] = '\0'; rfp = fopen(INFILE, "r"); if (rfp == NULL) { perror(INFILE); exit(EXIT_FAILURE); } wfp = fopen(OUTFILE, "w"); if (wfp == NULL) { perror(OUTFILE); exit(EXIT_FAILURE); } int p = 0; // fgets is 4x faster than getline! while (fgets(line, LINE_SIZE, rfp) != NULL) { // write buffer if almost full (largest column is 20 chars) if (p + 20 > BUFFER_SIZE) { fputs(out, wfp); p = 0; } // go through the columns for (int i=0; i<COLS; i++) { p += trimcpy(out+p, line+RANGES[i][0], RANGES[i][1]); p += trimcpy(out+p, i<COLS-1 ? "," : "\n", 1); } } // write any remaining data in buffer fputs(out, wfp); fclose(rfp); fclose(wfp); return 0; }
For this I created a function that both copies a string from source to destination and trims trailing spaces and the way it’s used acts like a substring. So it’s basically three-in one, like a concat(out, trim(substr(source, start, len)
. The trimcpy()
function starts by looking at the source string backwards from the given len until it finds an non-space character. Since I know exactly what my data looks like, I can ignore other whitespace characters like tabs, formfeed and so on. Then it copies the source to destination (the offset is provided when calling the function until all remaining characters are copied. Finally, the character terminator is added and the function returns the number or characters copied.
Instead of writing each line, I’m using a 8KB buffer. I’m simply appending data to that out buffer. When the buffer is almost full (the space remaining is less than the largest column), the buffer is written and the position within buffer is reset.
With minimal effort, this can be made into a command line utility.
Java 8
import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Path; import java.util.ArrayList; import java.util.List; import java.util.stream.Stream; public class Main { private static final String IN_FILE = "in.txt"; private static final String OUT_FILE = "out.csv"; private static final int[][] RANGES = {{0, 6}, {6, 20}, {29, 3}}; public static void main(String[] args) { try { BufferedWriter outFile = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(OUT_FILE), StandardCharsets.UTF_8)); Path path = new File(IN_FILE).toPath(); Stream lines = Files.lines(path, StandardCharsets.US_ASCII); lines.forEach(line -> { List<String> out = new ArrayList<>(); for (int[] r : RANGES) { try { out.add(line.toString().substring(r[0], r[0]+r[1]).trim()); } catch(StringIndexOutOfBoundsException ex) {} } try { outFile.write(String.join(",", out).concat("\n")); } catch (IOException ex) { System.out.println("Can't write to " + OUT_FILE); System.exit(-1); } }); outFile.close(); } catch (IOException ex) { System.out.println("Can't read from " + IN_FILE); System.exit(-1); } } }
The Java version is quite similar in structure, except of course we don’t need a function to copy strings. String.join()
is available since Java 8 so it shouldn’t be a problem. Note how in Java I’m not using my own buffer since it has its own BufferedWriter
with a default size of 8 KB. Still, I didn’t see any speed difference between buffered and unbuffered writes.
PHP
<?php const IN_FILE = 'in.txt'; const OUT_FILE = 'out.csv'; const BUFFER_LINES = 400; const RANGES = [[0, 6], [6, 20], [29, 3]]; $rfp = fopen(IN_FILE, 'r') or die('Can\'t read from ' . IN_FILE); $wfp = fopen(OUT_FILE, 'w') or die('Can\'t write to ' . OUT_FILE); $buffered_lines = 0; $buffer = ""; while ($line = stream_get_line($rfp, 192, "\n")) { $parts = []; foreach(RANGES as $range) $parts[] = trim(substr($line, $range[0], $range[1])); $buffer .= implode(",", $parts)."\n"; if (++$buffered_lines == BUFFER_LINES) { fwrite($wfp, $buffer); $buffered_lines = 0; $buffer = ""; } } fwrite($wfp, $buffer); fclose($rfp); fclose($wfp);
PHP was a bit strange. Writes are buffered by default (again, with a 8KB buffer), but I saw a massive speed difference when using my own buffer. This buffer is similar to the C. Lines are appended to a string, but instead of keeping track of string length, I’m keeping track on the number of lines. For my data, 400 lines take about 8KB as well.
Node JS
const readline = require('readline'), fs = require('fs'); const IN_FILE = 'in.txt', OUT_FILE = 'out.csv', BUFFER_LINES = 200; const RANGES = [[0, 6], [6, 20], [29, 3]]; const instream = fs.createReadStream(IN_FILE), outstream = fs.createWriteStream(OUT_FILE), rl = readline.createInterface({input: instream}); let buffer = '', bufferedLines = 0; instream.on('error', (e)=>{console.error(e.message);}); outstream.on('error', (e)=>{console.error(e.message);}); rl.on('line', (line) => { parts = []; for (let range of RANGES) parts.push(line.substr(range[0], range[1]).trim()); buffer += parts.join("\t")+"\n"; if (++bufferedLines == BUFFER_LINES) { outstream.write(buffer); bufferedLines = 0; buffer = ""; } }); rl.on('close', ()=>{ outstream.write(buffer); outstream.close(); });
While I love Javascript (I code primarily in it), I’m really not a fan of everything being asynchronous and event-driven. This is a simple example, but nested callbacks can become tiresome. I’m again using a buffer that gets written once instead of writing line by line. Node has cork()/uncork() for streams but I think it would complicate the code even more.
Python 3
IN_FILE = "in.txt" OUT_FILE = "out.csv" RANGES = ((0, 6), (6, 20), (29, 3)) try: rfp = open(IN_FILE, 'r') except IOError: print ("Could not read from", IN_FILE) raise SystemExit try: wfp = open(OUT_FILE, 'w') except IOError: print ("Could not write to", OUT_FILE) raise SystemExit for line in rfp: parts = [] for rng in RANGES: parts.append(line[rng[0]:rng[0]+rng[1]].strip()) wfp.write("\t".join(parts)+"\n") rfp.close() wfp.close()
Python gets the award for the most concise code. Note that I’m not doing any buffering myself. Based on my tests, the speed is the same, which makes sense since it’s supposed to have its own internal buffer as well.
Benchmarking
Just for fun, I decided to test each implementation against a big 1.3 GB file with some 9 million rows.
Now, I must emphasize that you should not read too much into this as it’s not a comprehensive test. Some languages are better at some tasks than the others and this only tests reading, writing and some strings manipulation.
I tested the code on a laptop running Windows 10 with an Intel i7-6700K CPU at 4GHz, 32 GB RAM and a Samsung 860 EVO SSD.
The languages and compilers used are:
- C: MinGW gcc 6.3.0
- Java 1.8.0_191
- PHP 7.3.0
- Node 11.5.0
- Python 3.7.1
I run each code 5 times and averaged the results. I timed execution with the free ptime utility.
As I alluded above, I tried each code with and without additional buffering.
Version | Without Buffer | With Buffer | Diff |
---|---|---|---|
C | 8.42s | 7.87s | 17% slower |
Java | 6.73s | 6.71s | |
PHP | 54.21s | 9.75s | 45% slower |
Node | 43.13s | 9.22s | 37% slower |
Python | 32.50s | 32.58s | 485% slower |
The results are quite surprising. I course I expected C to be fast, but I did not expect Java to be 17% faster than it! You can see how both Node and PHP benefit from the line cache. They have similar performance, but I still found it surprising that Node is actually faster than PHP. I was also surprised by Python. It’s so much slower than the others it’s not even in the same league. I made another test, just reading the data, without writing it back, but the results were similar.
Conclusions
The “Java is slow” myth persists from the bygone era of java applets. Modern Java and its JIT is incredibly fast, as can be seen from this test and from my own experience. It won’t be faster than C for all operations, but it is comparable, and when you add the ease of development, it’s very tempting on the server side. I still dislike its verbosity (just look how complicated is to write to UTF-8!).
Note this is not a server-side test as I was not testing concurrent requests, load and so on. You can read an interesting benchmark of PHP vs Java vs Node here.