Posts

Using Python to batch rename email files

This little tutorial is intended for those learning Python and demonstrates a number of features, including OS-independent file manipulation, email parsing, string formatting and error handling. In this post we’re using email metadata to name the files but you can apply the same principles for other operations.

The problem

Here at Media Division we’ve been archiving our email since 1997. We used several email clients over the years, starting with Netscape, Outlook , Thunderbird and more recently Windows Live Mail. Even though we use Google Apps for Business, we do not rely on Gmail for storage, preferring our own storage for speed, privacy and backup options. We converted email from  various formats used in the past – mailbox, dbx, Outlook PST and so on into separate EML files, which are very convenient because are easy to read  parse, can be indexed and searched by the OS and there’s no processing overhead.

As we backup the emails, we wanted to have as much info as possible in the filename itself so we can find emails even without any parsing or the aid of an index.

I decided to rename all emails in a common format: yyyymmdd HHmmss [from -- to] Subject, e.g. 20140914 172000 [armand@gmail.com — johndoe@server.net] Hello World .

The Solution

I’m learning Python myself so maybe the code is not always “pythonesque” (no pun intended) but I preferred legibility over language features.

Obtaining a filtered list of files in a folder is one-liner:

1
2
3
4
import glob
path = "/path/to/files"
ext = ".eml";
files = glob.glob(path + "*" + ext)

I split it in 3 lines just for convenience. This returns a list of all EML files in the given path. It works with Windows paths too, even UNC paths like ‘\\server\share\folder’.

If you need to recurse directories, use os.walk() instead.

We can now go through each file:

5
6
7
8
9
10
11
12
13
14
15
16
17
index = 0
for file_path in files:    
    index += 1
    percent_done = "{:0>7.2%}".format(index/len(files))
 
    try:
        fp = open(file_path)
    except IOError:
        print("{} Error opening {}: {}".format(percent_done, file_path, e.reason))
        continue
    except FileNotFoundError:
        print("{} File {} no longer exists".format(percent_done, file_path))
        continue

Note we’re not explicitly closing files, a file is closed automatically when the file object is reassigned.

You can also see the pretty powerful string formatting in Python. A “{}” or “{0}“, “{1}” pattern will be replaced with values provided by the format() function. The bits after colon represent: “0” – pad with zeroes, “>” right-aligned, “7” for a total of 7 digits, “.2” with two decimal places, “%” as percent; so the numbers will always look like 000.00%.

Moving on, we could read and parse the file with some regexes, but fortunately Python has a whole class for creating and parsing emails.

5
6
from email.parser import Parser
parser = Parser()

And now we can parse the email like this:

20
21
22
23
24
25
    try:
        msg = parser.parse(fp, True)
        fp.close()
    except UnicodeDecodeError as e:
        print("{} Error parsing {}: {}".format(percent_done, file_path, e.reason))
        continue

I’m using the second (optional) True parameter in the parse() function to parse only the headers. If you want to parse the whole email from file, omit the parameter or, even better, use email.message_from_file(fp) instead. I noticed that some seemingly valid emails fail parsing, hence the try block.

By now, we can access the headers from the msg Dictionary, like this: msg['From']. The problem is that headers can be Q-encoded like this:

=?utf-8?q?this=20is=20some=20text?=

and to make matters worse, multiple encodings can be specified within the same header.

Using email.header.decode_header(encoded) we can convert an encoded string into touples containing the decoded text and the corresponding set. Somewhat confusingly, the decoded text can be string or bytes. If it’s bytes, we need to also decode it using the character set provided. Finally we have to join together all the parts. Ugh. This is definitely something that should have been handled internally.

Let’s make a function to handle all this mess:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def parse_rfc2047_charset(encoded):
    output  = ""
 
    try:
        parts = email.header.decode_header(encoded)
        for text, charset in parts:
            if (isinstance(text, bytes)):
                #text = text.decode(charset or 'ascii', 'ignore')
                text = text.decode('ascii', 'ignore')
            output += text
    except Exception:
        output = "-"
 
    return output

There are two ways we can decode the bytes – based on the original encoding or always as ASCII. Since I wanted to have pure ASCII filenames, I chose to always decode as ASCII but your requirements may differ, so I provided the alternative for text.decode().

And now we can get the subject as a nice Unicode string:

26
    mail_subj = parse_rfc2047_charset(msg['Subject'])

Next, we can process the From and To headers a bit. A full email address looks like John Doe <john@domain.com>. I didn’t want that, so I decided to show just the email address. This can be achieved with a regex, but Python has an utility to parse the email address and return a touple containing the name and email:

27
28
    mail_from = parse_rfc2047_charset(email.utils.parseaddr(msg['From'])[1])
    mail_to = parse_rfc2047_charset(email.utils.parseaddr(msg['To'])[1])

Finally, the date. Email dates are represented like this: Thu, 14 Jan 2010 13:10:46 +0530. What we want is an ISO format (yyyy-mm-dd HH:mm:ss) that is easy to sort, understand and process. Processing the date is another multi-step process. First we use mail.utils.parsedate_tz() to convert the date string into a touple with 10 elements. But, to format a date we use strfttime(), which requires a touple with 9 elements (no time zone), so we need some intermediary steps.

First we obtain a UTC  timestamp (which we’ll need later anyway):

29
    timestamp = email.utils.mktime_tz(email.utils.parsedate_tz(msg['Date']))

Then we convert the timestamp to a 9-touple with gmtime() and format it in the desired format:

30
    mail_date = time.strftime("%Y%m%d %H%M%S", time.gmtime(timestamp))

Now we can finally construct the elements into the new file name:

31
    base_name = "{} [{} -- {}] {}".format(mail_date, mail_from, mail_to, mail_subj)

strip any illegal characters and truncate the length below 255 characters:

32
33
34
    for char in '<>:"/\\|?*\n':
        base_name = base_name.replace(char, '')
    base_name = base_name[:240]

Just in case, let’s check if the file has already been renamed to the desired format and skip renaming if so:

35
36
37
38
    name = os.path.basename(file_path)
    if (name == base_name + ext):
        print("{} File {} already ok".format(percent_done, file_path))
        continue

We also need to check if the new file already exists so we don’t overwrite it. If the file already exists, we append a number, first we try (1), then (2) if needed and so on.

39
40
41
42
43
    i = 1
    new_name = base_name + ext
    while(os.path.isfile(path + new_name)):
        new_name = base_name + " (" + str(i) + ")" + ext
        i = i + 1

Finally, we can do the actual renaming. To make it nicer, we’ll also change the file modified time to match the email date. This way the emails can be processed even more efficiently:

43
44
45
46
47
48
49
    new_file_path = path + new_name
    try:
        os.utime(file_path, (timestamp, timestamp))
        os.rename(file_path, new_file_path)
        print("{} {} -> {}".format(percent_done, name, new_name))
    except OSError as e:
        print("{} Error renaming {} to {}: {}".format(percent_done, file_path, new_file_path, e.strerror))

And here’s the complete code for reference:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import email
import glob
import os
import time
 
from email.parser import Parser
 
def parse_rfc2047_charset(encoded):
    "Process an encoded header. Multiple encodings may exist in the same header. Returns an unicode string or '-' on error"
    output  = ""
 
    try:
        parts = email.header.decode_header(encoded)
        for text, charset in parts:
            if (isinstance(text, bytes)):
                #text = text.decode(charset or 'ascii', 'ignore')
                text = text.decode('ascii', 'ignore')
            output += text
    except Exception:
        output = "-"
 
    return output
 
path   = "/path/to/files/"
ext    = ".eml";
files  = glob.glob(path + "*" + ext)
index  = 0
parser = Parser()
 
for file_path in files:
 
    index += 1
    percent_done = "{:0>7.2%}".format(index/len(files))
 
    # open the file for reading
    try:
        fp = open(file_path)
    except IOError:
        print("{} Error opening {}: {}".format(percent_done, file_path, e.reason))
        continue
    except FileNotFoundError:
        print("{} File {} no longer exists".format(percent_done, file_path))
        continue
 
    # parse the file as email
    try:
        msg = parser.parse(fp, True)
        fp.close()
    except UnicodeDecodeError as e:
        print("{} Error parsing {}: {}".format(percent_done, file_path, e.reason))
        continue
 
    #convert the email date from 'Thu, 14 Jan 2010 13:10:46 +0530' to '20100114 131046'
    try:
        timestamp = email.utils.mktime_tz(email.utils.parsedate_tz(msg['Date']))
        mail_date = time.strftime("%Y%m%d %H%M%S", time.gmtime(timestamp))
    except TypeError as e:
        mail_date = "00000000 000000"
 
    # get and process encoded From, To and Subject headers
    mail_from = parse_rfc2047_charset(email.utils.parseaddr(msg['From'])[1])
    mail_to = parse_rfc2047_charset(email.utils.parseaddr(msg['To'])[1])
    mail_subj = parse_rfc2047_charset(msg['Subject'])
 
    # format the new name
    base_name = "{} [{} -- {}] {}".format(mail_date, mail_from, mail_to, mail_subj)
 
    # strip illegal characters
    for char in '<>:"/\\|?*\n':
        base_name = base_name.replace(char, '')
 
    # truncate name if needed
    base_name = base_name[:240]    
 
    #don't rename if already in the desired format
    name = os.path.basename(file_path)
    if (name == base_name + ext):
        print("{} File {} already ok".format(percent_done, file_path))
        continue
 
    # check if new file name already exists, if so append a number
    i = 1
    new_name = base_name + ext
    while(os.path.isfile(path + new_name)):
        new_name = base_name + " (" + str(i) + ")" + ext
        i = i+1
 
    #compose the full path
    new_file_path = path + new_name
 
    # rename the file
    try:
        os.utime(file_path, (timestamp, timestamp))
        os.rename(file_path, new_file_path)
        print("{} {} -> {}".format(percent_done, name, new_name))
    except OSError as e:
        print("{} Error renaming {} to {}: {}".format(percent_done, file_path, new_file_path, e.strerror))