Programming-related articles, from quick tips to tutorials on a variety of topics for webmasters and developers alike. PHP, C#, Python, Java, Flash, MySQL, MSSQL and more are covered.

Making a Fixed-Width Text File to CSV Converter in C, Java, PHP, Javascript and Python

I haven’t posted in four years! My SparkChess project demands a lot of dedication on frontend as well as server and I just never got the time to prepare a proper post. Still, as I’m winding down for Christmas and New Year, I want to finally share some stuff.

The Task

I wanted to generate some chess-related statistics (see? chess stuff again) and for this I got some official data from FIDE (The World Chess Federation). I expected a CSV or something, but instead I got a fixed-width text file. It’s the first time I came across this format in 25 years!

If you read this, you arrived here looking for a convertor yourself, but just in case you don’t know what a fixed-width text file is, I’ll explain. Unlike CSV, this format doesn’t use a separator to differentiate between fields, instead it pads each one until a column size. Have a look at a hypothetical example:

ID    Name                Ti Age
7     Fabiano Caruana     GM 26
1234  Magnus Carlsen      GM 28
629   Garry Kasparov      GM 55
83933 Viswanathan Anand   GM 49

In the example above we have 3 columns – ID (6 chars), name (20 chars), title (2 chars) and age (3 chars). In a  way, it’s easier to process than other formats, because parsing is easier if you just want a column. Still, I find it quite inefficient overall and especially as an exchange format (FIDE also provides the data as XML but I ultimately needed it as an SQL table and I wasn’t looking forward to parsing a 1Gb XML file.

Converting to CSV is simple enough:

  1. Read the input file line by line
  2. Extract the desired columns
  3. Trim the extra spaces
  4. Glue the columns with a comma or tab
  5. Write to output file.

I wrote the C version first and then I created the other versions as an exercise. I also imposed a few limitations on myself (in descending order of importance):

  • No additional frameworks or libraries (this is especially for Node; I’m sure there’s a node module somewhere that does just this conversion, but where would be the fun?
  • Simplest implementation. This is not a showcase of language features or cleverness. Simple, easy to understand code I more useful.
  • Fastest way, within reason. For each language I tried to use the fastest way, but I did not benchmark every possible decision (e.g. buffer size, string concatenation strategies, etc.)

A couple of notes:

  • I’m deliberately skipping a column, to show how it can be done. All examples copy only 3 columns. My real data is more complex.
  • I’m using comma as a separator, without enclosing strings in quotes. This may cause problems with real-world data, so of course you may want to either use tab as separator or enclose the strings.
  • The parsers works with UTF-8.

C

As I mentioned, this was the first version I wrote:

#include <stdio.h>
#include <stdlib.h>
#include <mem.h>
#include <ctype.h>

// number of columns to process

#define LINE_SIZE 256
#define BUFFER_SIZE 8192

#define INFILE "in.txt"
#define OUTFILE "out.csv"

size_t const RANGES[][2] = {{0, 6}, {6, 20}, {29, 3}};

#define COLS (sizeof RANGES/sizeof RANGES[0])

int trimcpy(char *destination, const char *source, size_t len) {
    // trim spaces from the end - we only care about the space char
    while (source[len-1]==' ' && len>0)
        len--;

    memcpy(destination, source, len);
    destination[len] = '\0';

    return len;
}

int main(void) {
    FILE *rfp;
    FILE *wfp;

    char line[LINE_SIZE];
    char out[BUFFER_SIZE];
    out[0] = '\0';

    rfp = fopen(INFILE, "r");
    if (rfp == NULL) {
        perror(INFILE);
        exit(EXIT_FAILURE);
    }


    wfp = fopen(OUTFILE, "w");
    if (wfp == NULL) {
        perror(OUTFILE);
        exit(EXIT_FAILURE);
    }

    int p = 0;

    // fgets is 4x faster than getline!
    while (fgets(line, LINE_SIZE, rfp) != NULL) {
        // write buffer if almost full (largest column is 20 chars)
        if (p + 20 > BUFFER_SIZE) {
            fputs(out, wfp);
            p = 0;
        }

        // go through the columns
        for (int i=0; i<COLS; i++) {
            p += trimcpy(out+p, line+RANGES[i][0], RANGES[i][1]);
            p += trimcpy(out+p, i<COLS-1 ? "," : "\n", 1);
        }
    }

    // write any remaining data in buffer
    fputs(out, wfp);

    fclose(rfp);
    fclose(wfp);

    return 0;
}

For this I created a function that both copies a string from source to destination and trims trailing spaces and the way it’s used acts like a substring. So it’s basically three-in one, like a concat(out, trim(substr(source, start, len). The trimcpy() function starts by looking at the source string backwards from the given len until it finds an non-space character. Since I know exactly what my data looks like, I can ignore other whitespace characters like tabs, formfeed and so on.  Then it copies the source to destination (the offset is provided when calling the function until all remaining characters are copied. Finally, the character terminator is added and the function returns the number or characters copied.

Instead of writing each line, I’m using a 8KB buffer. I’m simply appending data to that out buffer. When the buffer is almost full (the space remaining is less than the largest column), the buffer is written and the position within buffer is reset.

With minimal effort, this can be made into a command line utility.

Java 8

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Stream;

public class Main {

    private static final String IN_FILE = "in.txt";
    private static final String OUT_FILE = "out.csv";

    private static final int[][] RANGES = {{0, 6}, {6, 20}, {29, 3}};
    
    public static void main(String[] args) {
        try {
            BufferedWriter outFile = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(OUT_FILE), StandardCharsets.UTF_8));

            Path path = new File(IN_FILE).toPath();
            Stream lines = Files.lines(path, StandardCharsets.US_ASCII);
            lines.forEach(line -> {
                List<String> out = new ArrayList<>();

                for (int[] r : RANGES) {
                    try {
                        out.add(line.toString().substring(r[0], r[0]+r[1]).trim());
                    }
                    catch(StringIndexOutOfBoundsException ex) {}
                }
                
                try {
                    outFile.write(String.join(",", out).concat("\n"));
                }
                catch (IOException ex) {
                    System.out.println("Can't write to " + OUT_FILE);
                    System.exit(-1);
                }
            });
            outFile.close();
        }
        catch (IOException ex) {
            System.out.println("Can't read from " +  IN_FILE);
            System.exit(-1);
        }
    }
}

The Java version is quite similar in structure, except of course we don’t need a function to copy strings. String.join() is available since Java 8 so it shouldn’t be a problem. Note how in Java I’m not using my own buffer since it has its own BufferedWriter with a default size of 8 KB. Still, I didn’t see any speed difference between buffered and unbuffered writes.

PHP

<?php
const IN_FILE = 'in.txt';
const OUT_FILE = 'out.csv';
const BUFFER_LINES = 400;
const RANGES = [[0, 6], [6, 20], [29, 3]];

$rfp = fopen(IN_FILE, 'r') or die('Can\'t read from ' . IN_FILE);
$wfp = fopen(OUT_FILE, 'w') or die('Can\'t write to ' . OUT_FILE);

$buffered_lines = 0;
$buffer = "";

while ($line = stream_get_line($rfp, 192, "\n")) {
    $parts = [];
    foreach(RANGES as $range)
        $parts[] = trim(substr($line, $range[0], $range[1]));

    $buffer .= implode(",", $parts)."\n";

    if (++$buffered_lines == BUFFER_LINES) {
        fwrite($wfp, $buffer);
        $buffered_lines = 0;
        $buffer = "";
    }
}

fwrite($wfp, $buffer);
fclose($rfp);
fclose($wfp);

PHP was a bit strange. Writes are buffered by default (again, with a 8KB buffer), but I saw a massive speed difference when using my own buffer. This buffer is similar to the C. Lines are appended to a string, but instead of keeping track of string length, I’m keeping track on the number of lines. For my data, 400 lines take about 8KB as well.

Node JS

const readline = require('readline'),
      fs       = require('fs');

const IN_FILE = 'in.txt',
      OUT_FILE = 'out.csv',
      BUFFER_LINES = 200;
      const RANGES = [[0, 6], [6, 20], [29, 3]];

const instream  = fs.createReadStream(IN_FILE),
      outstream = fs.createWriteStream(OUT_FILE),
      rl        = readline.createInterface({input: instream});

let buffer = '',
    bufferedLines = 0;

instream.on('error', (e)=>{console.error(e.message);});
outstream.on('error', (e)=>{console.error(e.message);});

rl.on('line', (line) => {
    parts = [];
    for (let range of RANGES)
        parts.push(line.substr(range[0], range[1]).trim());
  
    buffer += parts.join("\t")+"\n";
  
    if (++bufferedLines == BUFFER_LINES)
    {
        outstream.write(buffer);
        bufferedLines = 0;
        buffer = "";
    }
});

rl.on('close', ()=>{
  outstream.write(buffer);
  outstream.close();
});

While I love Javascript (I code primarily in it), I’m really not a fan of everything being asynchronous and event-driven. This is a simple example, but nested callbacks can become tiresome. I’m again using a buffer that gets written once instead of writing line by line. Node has cork()/uncork() for streams but I think it would complicate the code even more.

Python 3

IN_FILE = "in.txt"
OUT_FILE = "out.csv"

RANGES = ((0, 6), (6, 20), (29, 3))

try:
    rfp = open(IN_FILE, 'r')
except IOError:
    print ("Could not read from", IN_FILE)
    raise SystemExit
  
try:
    wfp = open(OUT_FILE, 'w')
except IOError:	
    print ("Could not write to", OUT_FILE)
    raise SystemExit

for line in rfp:
    parts = []
    for rng in RANGES:
        parts.append(line[rng[0]:rng[0]+rng[1]].strip())

    wfp.write("\t".join(parts)+"\n")
    
rfp.close()
wfp.close()

Python gets the award for the most concise code. Note that I’m not doing any buffering myself. Based on my tests, the speed is the same, which makes sense since it’s supposed to have its own internal buffer as well.

Benchmarking

Just for fun, I decided to test each implementation against a big 1.3 GB file with some 9 million rows.

Now, I must emphasize that you should not read too much into this as it’s not a comprehensive test. Some languages are better at some tasks than the others and this only tests reading, writing and some strings manipulation.

I tested the code on a laptop running Windows 10 with an Intel i7-6700K CPU at 4GHz, 32 GB RAM and a Samsung 860 EVO SSD.

The languages and compilers used are:

  • C: MinGW gcc 6.3.0
  • Java 1.8.0_191
  • PHP 7.3.0
  • Node 11.5.0
  • Python 3.7.1

I run each code 5 times and averaged the results. I timed execution with the free ptime utility.

As I alluded above, I tried each code with and without additional buffering.

VersionWithout BufferWith BufferDiff
C8.42s7.87s17% slower
Java6.73s6.71s
PHP54.21s9.75s45% slower
Node43.13s9.22s37% slower
Python32.50s32.58s485% slower

The results are quite surprising. I course I expected C to be fast, but I did not expect Java to be 17% faster than it! You can see how both Node and PHP benefit from the line cache. They have similar performance, but I still found it surprising that Node is actually faster than PHP. I was also surprised by Python. It’s so much slower than the others it’s not even in the same league. I made another test, just reading the data, without writing it back, but the results were similar.

Conclusions

The “Java is slow” myth persists from the bygone era of java applets. Modern Java and its JIT is incredibly fast, as can be seen from this test and from my own experience. It won’t be faster than C for all operations, but it is comparable, and when you add the ease of development, it’s very tempting on the server side. I still dislike its verbosity (just look how complicated is to write to UTF-8!).

Note this is not a server-side test as I was not testing concurrent requests, load and so on. You can read an interesting benchmark of PHP vs Java vs Node here.

 

Using Python to batch rename email files

This little tutorial is intended for those learning Python and demonstrates a number of features, including OS-independent file manipulation, email parsing, string formatting and error handling. In this post we’re using email metadata to name the files but you can apply the same principles for other operations.

The problem

Here at Media Division we’ve been archiving our email since 1997. We used several email clients over the years, starting with Netscape, Outlook , Thunderbird and more recently Windows Live Mail. Even though we use Google Apps for Business, we do not rely on Gmail for storage, preferring our own storage for speed, privacy and backup options. We converted email from  various formats used in the past – mailbox, dbx, Outlook PST and so on into separate EML files, which are very convenient because are easy to read  parse, can be indexed and searched by the OS and there’s no processing overhead.

As we backup the emails, we wanted to have as much info as possible in the filename itself so we can find emails even without any parsing or the aid of an index.

I decided to rename all emails in a common format: yyyymmdd HHmmss [from -- to] Subject, e.g. 20140914 172000 [[email protected][email protected]] Hello World .

The Solution

I’m learning Python myself so maybe the code is not always “pythonesque” (no pun intended) but I preferred legibility over language features.

Obtaining a filtered list of files in a folder is one-liner:

import glob
path = "/path/to/files"
ext = ".eml";
files = glob.glob(path + "*" + ext)

I split it in 3 lines just for convenience. This returns a list of all EML files in the given path. It works with Windows paths too, even UNC paths like ‘\\server\share\folder’.

If you need to recurse directories, use os.walk() instead.

We can now go through each file:

index = 0
for file_path in files:    
    index += 1
    percent_done = "{:0>7.2%}".format(index/len(files))
    
    try:
        fp = open(file_path)
    except IOError:
        print("{} Error opening {}: {}".format(percent_done, file_path, e.reason))
        continue
    except FileNotFoundError:
        print("{} File {} no longer exists".format(percent_done, file_path))
        continue

Note we’re not explicitly closing files, a file is closed automatically when the file object is reassigned.

You can also see the pretty powerful string formatting in Python. A “{}” or “{0}“, “{1}” pattern will be replaced with values provided by the format() function. The bits after colon represent: “0” – pad with zeroes, “>” right-aligned, “7” for a total of 7 digits, “.2” with two decimal places, “%” as percent; so the numbers will always look like 000.00%.

Moving on, we could read and parse the file with some regexes, but fortunately Python has a whole class for creating and parsing emails.

from email.parser import Parser
parser = Parser()

And now we can parse the email like this:

    try:
        msg = parser.parse(fp, True)
        fp.close()
    except UnicodeDecodeError as e:
        print("{} Error parsing {}: {}".format(percent_done, file_path, e.reason))
        continue

I’m using the second (optional) True parameter in the parse() function to parse only the headers. If you want to parse the whole email from file, omit the parameter or, even better, use email.message_from_file(fp) instead. I noticed that some seemingly valid emails fail parsing, hence the try block.

By now, we can access the headers from the msg Dictionary, like this: msg['From']. The problem is that headers can be Q-encoded like this:

=?utf-8?q?this=20is=20some=20text?=

and to make matters worse, multiple encodings can be specified within the same header.

Using email.header.decode_header(encoded) we can convert an encoded string into touples containing the decoded text and the corresponding set. Somewhat confusingly, the decoded text can be string or bytes. If it’s bytes, we need to also decode it using the character set provided. Finally we have to join together all the parts. Ugh. This is definitely something that should have been handled internally.

Let’s make a function to handle all this mess:

def parse_rfc2047_charset(encoded):
    output  = ""

    try:
        parts = email.header.decode_header(encoded)
        for text, charset in parts:
            if (isinstance(text, bytes)):
                #text = text.decode(charset or 'ascii', 'ignore')
                text = text.decode('ascii', 'ignore')
            output += text
    except Exception:
        output = "-"
        
    return output

There are two ways we can decode the bytes – based on the original encoding or always as ASCII. Since I wanted to have pure ASCII filenames, I chose to always decode as ASCII but your requirements may differ, so I provided the alternative for text.decode().

And now we can get the subject as a nice Unicode string:

    mail_subj = parse_rfc2047_charset(msg['Subject'])

Next, we can process the From and To headers a bit. A full email address looks like John Doe <[email protected]>. I didn’t want that, so I decided to show just the email address. This can be achieved with a regex, but Python has an utility to parse the email address and return a touple containing the name and email:

    mail_from = parse_rfc2047_charset(email.utils.parseaddr(msg['From'])[1])
    mail_to = parse_rfc2047_charset(email.utils.parseaddr(msg['To'])[1])

Finally, the date. Email dates are represented like this: Thu, 14 Jan 2010 13:10:46 +0530. What we want is an ISO format (yyyy-mm-dd HH:mm:ss) that is easy to sort, understand and process. Processing the date is another multi-step process. First we use mail.utils.parsedate_tz() to convert the date string into a touple with 10 elements. But, to format a date we use strfttime(), which requires a touple with 9 elements (no time zone), so we need some intermediary steps.

First we obtain a UTC  timestamp (which we’ll need later anyway):

    timestamp = email.utils.mktime_tz(email.utils.parsedate_tz(msg['Date']))

Then we convert the timestamp to a 9-touple with gmtime() and format it in the desired format:

    mail_date = time.strftime("%Y%m%d %H%M%S", time.gmtime(timestamp))

Now we can finally construct the elements into the new file name:

    base_name = "{} [{} -- {}] {}".format(mail_date, mail_from, mail_to, mail_subj)

strip any illegal characters and truncate the length below 255 characters:

    for char in '<>:"/\\|?*\n':
        base_name = base_name.replace(char, '')
    base_name = base_name[:240]

Just in case, let’s check if the file has already been renamed to the desired format and skip renaming if so:

    name = os.path.basename(file_path)
    if (name == base_name + ext):
        print("{} File {} already ok".format(percent_done, file_path))
        continue

We also need to check if the new file already exists so we don’t overwrite it. If the file already exists, we append a number, first we try (1), then (2) if needed and so on.

    i = 1
    new_name = base_name + ext
    while(os.path.isfile(path + new_name)):
        new_name = base_name + " (" + str(i) + ")" + ext
        i = i + 1

Finally, we can do the actual renaming. To make it nicer, we’ll also change the file modified time to match the email date. This way the emails can be processed even more efficiently:

    new_file_path = path + new_name
    try:
        os.utime(file_path, (timestamp, timestamp))
        os.rename(file_path, new_file_path)
        print("{} {} -> {}".format(percent_done, name, new_name))
    except OSError as e:
        print("{} Error renaming {} to {}: {}".format(percent_done, file_path, new_file_path, e.strerror))

And here’s the complete code for reference:

import email
import glob
import os
import time

from email.parser import Parser

def parse_rfc2047_charset(encoded):
    "Process an encoded header. Multiple encodings may exist in the same header. Returns an unicode string or '-' on error"
    output  = ""

    try:
        parts = email.header.decode_header(encoded)
        for text, charset in parts:
            if (isinstance(text, bytes)):
                #text = text.decode(charset or 'ascii', 'ignore')
                text = text.decode('ascii', 'ignore')
            output += text
    except Exception:
        output = "-"
        
    return output

path   = "/path/to/files/"
ext    = ".eml";
files  = glob.glob(path + "*" + ext)
index  = 0
parser = Parser()

for file_path in files:
    
    index += 1
    percent_done = "{:0>7.2%}".format(index/len(files))
    
    # open the file for reading
    try:
        fp = open(file_path)
    except IOError:
        print("{} Error opening {}: {}".format(percent_done, file_path, e.reason))
        continue
    except FileNotFoundError:
        print("{} File {} no longer exists".format(percent_done, file_path))
        continue
    
    # parse the file as email
    try:
        msg = parser.parse(fp, True)
        fp.close()
    except UnicodeDecodeError as e:
        print("{} Error parsing {}: {}".format(percent_done, file_path, e.reason))
        continue
    
    #convert the email date from 'Thu, 14 Jan 2010 13:10:46 +0530' to '20100114 131046'
    try:
        timestamp = email.utils.mktime_tz(email.utils.parsedate_tz(msg['Date']))
        mail_date = time.strftime("%Y%m%d %H%M%S", time.gmtime(timestamp))
    except TypeError as e:
        mail_date = "00000000 000000"

    # get and process encoded From, To and Subject headers
    mail_from = parse_rfc2047_charset(email.utils.parseaddr(msg['From'])[1])
    mail_to = parse_rfc2047_charset(email.utils.parseaddr(msg['To'])[1])
    mail_subj = parse_rfc2047_charset(msg['Subject'])

    # format the new name
    base_name = "{} [{} -- {}] {}".format(mail_date, mail_from, mail_to, mail_subj)
    
    # strip illegal characters
    for char in '<>:"/\\|?*\n':
        base_name = base_name.replace(char, '')
    
    # truncate name if needed
    base_name = base_name[:240]    
    
    #don't rename if already in the desired format
    name = os.path.basename(file_path)
    if (name == base_name + ext):
        print("{} File {} already ok".format(percent_done, file_path))
        continue
    
    # check if new file name already exists, if so append a number
    i = 1
    new_name = base_name + ext
    while(os.path.isfile(path + new_name)):
        new_name = base_name + " (" + str(i) + ")" + ext
        i = i+1

    #compose the full path
    new_file_path = path + new_name

    # rename the file
    try:
        os.utime(file_path, (timestamp, timestamp))
        os.rename(file_path, new_file_path)
        print("{} {} -> {}".format(percent_done, name, new_name))
    except OSError as e:
        print("{} Error renaming {} to {}: {}".format(percent_done, file_path, new_file_path, e.strerror))

Batch minfy & gzip new files only with some date tricks

With everybody using grunt.js for linting, css compilation and minification nowadays, I wanted to make a simple experiment and see if I could do the same using a plain batch file – in Windows no less. Turns out some things are very easy and some quite tricky.  While this article is about batch processing of html files, you can apply the same principle for any sort of batch processing in Windows.
Read more

Smart generation of Gzip files for nginx

As the complexity of today’s sites increases, so are the challenges to keep the site loading fast and bandwidth usage low. Minified scripts, concatenated CSS, image sprites and even hand-crafted static html are used for speedy delivery. This article discusses some less-known features of nginx that can lead to significant speed increase.
Read more

Replacing Union Platform Default Derby Datasource with MySQL Server

Union Platform is powerful and easy-to-use multiplayer server written in Java. By default it uses an embedded Derby database. In this article I’ll show you how you can replace this datasource with a MySQL one and how to process the database with minimum downtime.
Read more

Correct Name Capitalization in PHP

One annoying scenario is when you let users enter their names and then you need to output their names nicely, for example in a newsletter. Some users simply enter their names in upper/lowercase, but obviously when you address them you can’t do the same. On the other hand PHP’s ucfirst() and ucwords() functions are too naive for proper capitalization.
Read more

PHP Download Script with Resume option

A while ago I wrote an article about the common pitfalls of handling file downloads in PHP. One thing I did not realize at that time is that in most cases developers don’t have the time to write such a script and they’ll use whatever they can find, even if it has flaws.
Read more

The free, easy way to backup your (WordPress) site

There are many ways to backup a site. There are remote storage solutions, WordPress plugins and more. They all have various advantages and disadvantages. Some of them incur a monthly fee, some of them are manual only and even fewer solutions offer any level of versioning.
Read more

The “right way” to handle file downloads in PHP

I’ve seen many download scripts written in PHP, from simple one-liners to dedicated classes. Yet, at least half of them share common errors; in many cases programmers simply copy the code from something that works, without even attempting to understand what it really does.
Read more

Fancy Javascript slideshow – no framework required

If you thought having broadband makes page size irrelevant, you were wrong. Bandwidth costs money and a few saved kilobytes over a million impressions means real savings, not to mention that Google ranks sites based on their speed too.
Read more