Using Python to batch rename email files

Share on facebook
Share on twitter
Share on linkedin
Share on reddit

This little tutorial is intended for those learning Python and demonstrates a number of features, including OS-independent file manipulation, email parsing, string formatting and error handling. In this post we’re using email metadata to name the files but you can apply the same principles for other operations.

The problem

Here at Media Division we’ve been archiving our email since 1997. We used several email clients over the years, starting with Netscape, Outlook , Thunderbird and more recently Windows Live Mail. Even though we use Google Apps for Business, we do not rely on Gmail for storage, preferring our own storage for speed, privacy and backup options. We converted email from  various formats used in the past – mailbox, dbx, Outlook PST and so on into separate EML files, which are very convenient because are easy to read  parse, can be indexed and searched by the OS and there’s no processing overhead.

As we backup the emails, we wanted to have as much info as possible in the filename itself so we can find emails even without any parsing or the aid of an index.

I decided to rename all emails in a common format: yyyymmdd HHmmss [from -- to] Subject, e.g. 20140914 172000 [[email protected][email protected]] Hello World .

The Solution

I’m learning Python myself so maybe the code is not always “pythonesque” (no pun intended) but I preferred legibility over language features.

Obtaining a filtered list of files in a folder is one-liner:

  1. import glob
  2. path = "/path/to/files"
  3. ext = ".eml";
  4. files = glob.glob(path + "*" + ext)

I split it in 3 lines just for convenience. This returns a list of all EML files in the given path. It works with Windows paths too, even UNC paths like ‘\\server\share\folder’.

If you need to recurse directories, use os.walk() instead.

We can now go through each file:

  1. index = 0
  2. for file_path in files:    
  3.     index += 1
  4.     percent_done = "{:0>7.2%}".format(index/len(files))
  5.  
  6.     try:
  7.         fp = open(file_path)
  8.     except IOError:
  9.         print("{} Error opening {}: {}".format(percent_done, file_path, e.reason))
  10.         continue
  11.     except FileNotFoundError:
  12.         print("{} File {} no longer exists".format(percent_done, file_path))
  13.         continue

Note we’re not explicitly closing files, a file is closed automatically when the file object is reassigned.

You can also see the pretty powerful string formatting in Python. A “{}” or “{0}“, “{1}” pattern will be replaced with values provided by the format() function. The bits after colon represent: “0” – pad with zeroes, “>” right-aligned, “7” for a total of 7 digits, “.2” with two decimal places, “%” as percent; so the numbers will always look like 000.00%.

Moving on, we could read and parse the file with some regexes, but fortunately Python has a whole class for creating and parsing emails.

  1. from email.parser import Parser
  2. parser = Parser()

And now we can parse the email like this:

  1.     try:
  2.         msg = parser.parse(fp, True)
  3.         fp.close()
  4.     except UnicodeDecodeError as e:
  5.         print("{} Error parsing {}: {}".format(percent_done, file_path, e.reason))
  6.         continue

I’m using the second (optional) True parameter in the parse() function to parse only the headers. If you want to parse the whole email from file, omit the parameter or, even better, use email.message_from_file(fp) instead. I noticed that some seemingly valid emails fail parsing, hence the try block.

By now, we can access the headers from the msg Dictionary, like this: msg['From']. The problem is that headers can be Q-encoded like this:

=?utf-8?q?this=20is=20some=20text?=

and to make matters worse, multiple encodings can be specified within the same header.

Using email.header.decode_header(encoded) we can convert an encoded string into touples containing the decoded text and the corresponding set. Somewhat confusingly, the decoded text can be string or bytes. If it’s bytes, we need to also decode it using the character set provided. Finally we have to join together all the parts. Ugh. This is definitely something that should have been handled internally.

Let’s make a function to handle all this mess:

  1. def parse_rfc2047_charset(encoded):
  2.     output  = ""
  3.  
  4.     try:
  5.         parts = email.header.decode_header(encoded)
  6.         for text, charset in parts:
  7.             if (isinstance(text, bytes)):
  8.                 #text = text.decode(charset or 'ascii', 'ignore')
  9.                 text = text.decode('ascii', 'ignore')
  10.             output += text
  11.     except Exception:
  12.         output = "-"
  13.  
  14.     return output

There are two ways we can decode the bytes – based on the original encoding or always as ASCII. Since I wanted to have pure ASCII filenames, I chose to always decode as ASCII but your requirements may differ, so I provided the alternative for text.decode().

And now we can get the subject as a nice Unicode string:

  1.     mail_subj = parse_rfc2047_charset(msg['Subject'])

Next, we can process the From and To headers a bit. A full email address looks like John Doe <[email protected]>. I didn’t want that, so I decided to show just the email address. This can be achieved with a regex, but Python has an utility to parse the email address and return a touple containing the name and email:

  1.     mail_from = parse_rfc2047_charset(email.utils.parseaddr(msg['From'])[1])
  2.     mail_to = parse_rfc2047_charset(email.utils.parseaddr(msg['To'])[1])

Finally, the date. Email dates are represented like this: Thu, 14 Jan 2010 13:10:46 +0530. What we want is an ISO format (yyyy-mm-dd HH:mm:ss) that is easy to sort, understand and process. Processing the date is another multi-step process. First we use mail.utils.parsedate_tz() to convert the date string into a touple with 10 elements. But, to format a date we use strfttime(), which requires a touple with 9 elements (no time zone), so we need some intermediary steps.

First we obtain a UTC  timestamp (which we’ll need later anyway):

  1.     timestamp = email.utils.mktime_tz(email.utils.parsedate_tz(msg['Date']))

Then we convert the timestamp to a 9-touple with gmtime() and format it in the desired format:

  1.     mail_date = time.strftime("%Y%m%d %H%M%S", time.gmtime(timestamp))

Now we can finally construct the elements into the new file name:

  1.     base_name = "{} [{} -- {}] {}".format(mail_date, mail_from, mail_to, mail_subj)

strip any illegal characters and truncate the length below 255 characters:

  1.     for char in '<>:"/\\|?*\n':
  2.         base_name = base_name.replace(char, '')
  3.     base_name = base_name[:240]

Just in case, let’s check if the file has already been renamed to the desired format and skip renaming if so:

  1.     name = os.path.basename(file_path)
  2.     if (name == base_name + ext):
  3.         print("{} File {} already ok".format(percent_done, file_path))
  4.         continue

We also need to check if the new file already exists so we don’t overwrite it. If the file already exists, we append a number, first we try (1), then (2) if needed and so on.

  1.     i = 1
  2.     new_name = base_name + ext
  3.     while(os.path.isfile(path + new_name)):
  4.         new_name = base_name + " (" + str(i) + ")" + ext
  5.         i = i + 1

Finally, we can do the actual renaming. To make it nicer, we’ll also change the file modified time to match the email date. This way the emails can be processed even more efficiently:

  1.     new_file_path = path + new_name
  2.     try:
  3.         os.utime(file_path, (timestamp, timestamp))
  4.         os.rename(file_path, new_file_path)
  5.         print("{} {} -> {}".format(percent_done, name, new_name))
  6.     except OSError as e:
  7.         print("{} Error renaming {} to {}: {}".format(percent_done, file_path, new_file_path, e.strerror))

And here’s the complete code for reference:

  1. import email
  2. import glob
  3. import os
  4. import time
  5.  
  6. from email.parser import Parser
  7.  
  8. def parse_rfc2047_charset(encoded):
  9.     "Process an encoded header. Multiple encodings may exist in the same header. Returns an unicode string or '-' on error"
  10.     output  = ""
  11.  
  12.     try:
  13.         parts = email.header.decode_header(encoded)
  14.         for text, charset in parts:
  15.             if (isinstance(text, bytes)):
  16.                 #text = text.decode(charset or 'ascii', 'ignore')
  17.                 text = text.decode('ascii', 'ignore')
  18.             output += text
  19.     except Exception:
  20.         output = "-"
  21.  
  22.     return output
  23.  
  24. path   = "/path/to/files/"
  25. ext    = ".eml";
  26. files  = glob.glob(path + "*" + ext)
  27. index  = 0
  28. parser = Parser()
  29.  
  30. for file_path in files:
  31.  
  32.     index += 1
  33.     percent_done = "{:0>7.2%}".format(index/len(files))
  34.  
  35.     # open the file for reading
  36.     try:
  37.         fp = open(file_path)
  38.     except IOError:
  39.         print("{} Error opening {}: {}".format(percent_done, file_path, e.reason))
  40.         continue
  41.     except FileNotFoundError:
  42.         print("{} File {} no longer exists".format(percent_done, file_path))
  43.         continue
  44.  
  45.     # parse the file as email
  46.     try:
  47.         msg = parser.parse(fp, True)
  48.         fp.close()
  49.     except UnicodeDecodeError as e:
  50.         print("{} Error parsing {}: {}".format(percent_done, file_path, e.reason))
  51.         continue
  52.  
  53.     #convert the email date from 'Thu, 14 Jan 2010 13:10:46 +0530' to '20100114 131046'
  54.     try:
  55.         timestamp = email.utils.mktime_tz(email.utils.parsedate_tz(msg['Date']))
  56.         mail_date = time.strftime("%Y%m%d %H%M%S", time.gmtime(timestamp))
  57.     except TypeError as e:
  58.         mail_date = "00000000 000000"
  59.  
  60.     # get and process encoded From, To and Subject headers
  61.     mail_from = parse_rfc2047_charset(email.utils.parseaddr(msg['From'])[1])
  62.     mail_to = parse_rfc2047_charset(email.utils.parseaddr(msg['To'])[1])
  63.     mail_subj = parse_rfc2047_charset(msg['Subject'])
  64.  
  65.     # format the new name
  66.     base_name = "{} [{} -- {}] {}".format(mail_date, mail_from, mail_to, mail_subj)
  67.  
  68.     # strip illegal characters
  69.     for char in '<>:"/\\|?*\n':
  70.         base_name = base_name.replace(char, '')
  71.  
  72.     # truncate name if needed
  73.     base_name = base_name[:240]    
  74.  
  75.     #don't rename if already in the desired format
  76.     name = os.path.basename(file_path)
  77.     if (name == base_name + ext):
  78.         print("{} File {} already ok".format(percent_done, file_path))
  79.         continue
  80.  
  81.     # check if new file name already exists, if so append a number
  82.     i = 1
  83.     new_name = base_name + ext
  84.     while(os.path.isfile(path + new_name)):
  85.         new_name = base_name + " (" + str(i) + ")" + ext
  86.         i = i+1
  87.  
  88.     #compose the full path
  89.     new_file_path = path + new_name
  90.  
  91.     # rename the file
  92.     try:
  93.         os.utime(file_path, (timestamp, timestamp))
  94.         os.rename(file_path, new_file_path)
  95.         print("{} {} -> {}".format(percent_done, name, new_name))
  96.     except OSError as e:
  97.         print("{} Error renaming {} to {}: {}".format(percent_done, file_path, new_file_path, e.strerror))
Armand Niculescu

Armand Niculescu

As the Senior Project manager, Armand is one of the rare kind of developers that can do both design and programming with equal skill. This, coupled with a solid background and many years of experience, enables him to see the big picture and plan for the small details.

3 Responses

  1. Note that there appears to be an error in the code. After line 85 you need a

    > i = i + 1

    So that the code keeps trying increasingly large numbers to append to the email’s name.

    That being said, thank you so god darn much because this saved me hours.

    1. Ah, thanks Robert, looks like I missed that line when I copied the code to make the article. I corrected it. Glad you found it useful.

Comments are closed.