Extraer el cuerpo de un correo electrónico del archivo mbox, decodificarlo en texto plano independientemente de Charset y Content Transfer Encoding

Estoy tratando de usar Python 3 para extraer el cuerpo de los mensajes de correo electrónico de un archivo thunderbox mbox. Es una cuenta IMAP.Extraer el cuerpo de un correo electrónico del archivo mbox, decodificarlo en texto plano independientemente de Charset y Content Transfer Encoding

Me gustaría tener la parte de texto del cuerpo del correo electrónico disponible para procesar como una cadena unicode. Debería 'parecer' que el correo electrónico lo hace en Thunderbird, y no contener caracteres escapados como \ r \ n = 20 etc.

Creo que son las codificaciones de transferencia de contenido las que no sé cómo decodificar o retirar. Recibo correos electrónicos con una variedad de diferentes tipos de contenido y diferentes codificaciones de transferencia de contenido. Este es mi intento actual:

import mailbox 
import quopri,base64 

def myconvert(encoded,ContentTransferEncoding): 
    if ContentTransferEncoding == 'quoted-printable': 
     result = quopri.decodestring(encoded) 
    elif ContentTransferEncoding == 'base64': 
     result = base64.b64decode(encoded) 

mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX' 

for msg in mailbox.mbox(mboxfile): 
    if msg.is_multipart(): #Walk through the parts of the email to find the text body. 
     for part in msg.walk(): 
      if part.is_multipart(): # If part is multipart, walk through the subparts. 
       for subpart in part.walk(): 
        if subpart.get_content_type() == 'text/plain': 
         body = subpart.get_payload() # Get the subpart payload (i.e the message body) 
        for k,v in subpart.items(): 
          if k == 'Content-Transfer-Encoding': 
           cte = v    # Keep the Content Transfer Encoding 
      elif subpart.get_content_type() == 'text/plain': 
       body = part.get_payload()   # part isn't multipart Get the payload 
       for k,v in part.items(): 
        if k == 'Content-Transfer-Encoding': 
         cte = v      # Keep the Content Transfer Encoding 

print(body) 
print('Body is of type:',type(body)) 
body = myconvert(body,cte) 
print(body)

Pero esto no funciona con:

Body is of type: <class 'str'> 
Traceback (most recent call last): 
File "C:/Users/David/Documents/Python/test2.py", line 31, in <module> 
    body = myconvert(body,cte) 
File "C:/Users/David/Documents/Python/test2.py", line 6, in myconvert 
    result = quopri.decodestring(encoded) 
File "C:\Python32\lib\quopri.py", line 164, in decodestring 
    return a2b_qp(s, header=header) 
TypeError: 'str' does not support the buffer interface

Fuente

2011-08-23 dcb

eso es extraño. get_payload() debería devolver bytes, pero str bajo Python 3, a menos que ingrese 'decode = True', que no es así. –

Lo he intentado con decode = True, y eso devuelve bytes, por lo que no hay ningún error. Parece que la decodificación ya está hecha, y ahora todo lo que tengo que hacer es convertir bytes en cadenas. Aunque todavía no lo he probado con correos electrónicos con una gran variedad de codificación de contenido. – dcb

Huh, parece un error, debería ser al revés, decodificar = True debe devolver str y decodificar = False bytes. :-) –

Aquí hay un código que hace el trabajo, imprime errores en lugar de estrellarse de esos mensajes en los que sería un fracaso. Espero que pueda ser útil. Tenga en cuenta que si hay un error en Python 3, y eso es fijo, entonces las líneas .get_payload (decode = True) pueden devolver un objeto str en lugar de un objeto bytes. Ejecuté este código hoy en 2.7.2 y en Python 3.2.1.

import mailbox 

def getcharsets(msg): 
    charsets = set({}) 
    for c in msg.get_charsets(): 
     if c is not None: 
      charsets.update([c]) 
    return charsets 

def handleerror(errmsg, emailmsg,cs): 
    print() 
    print(errmsg) 
    print("This error occurred while decoding with ",cs," charset.") 
    print("These charsets were found in the one email.",getcharsets(emailmsg)) 
    print("This is the subject:",emailmsg['subject']) 
    print("This is the sender:",emailmsg['From']) 

def getbodyfromemail(msg): 
    body = None 
    #Walk through the parts of the email to find the text body.  
    if msg.is_multipart():  
     for part in msg.walk(): 

      # If part is multipart, walk through the subparts.    
      if part.is_multipart(): 

       for subpart in part.walk(): 
        if subpart.get_content_type() == 'text/plain': 
         # Get the subpart payload (i.e the message body) 
         body = subpart.get_payload(decode=True) 
         #charset = subpart.get_charset() 

      # Part isn't multipart so get the email body 
      elif part.get_content_type() == 'text/plain': 
       body = part.get_payload(decode=True) 
       #charset = part.get_charset() 

    # If this isn't a multi-part message then get the payload (i.e the message body) 
    elif msg.get_content_type() == 'text/plain': 
     body = msg.get_payload(decode=True) 

    # No checking done to match the charset with the correct part. 
    for charset in getcharsets(msg): 
     try: 
      body = body.decode(charset) 
     except UnicodeDecodeError: 
      handleerror("UnicodeDecodeError: encountered.",msg,charset) 
     except AttributeError: 
      handleerror("AttributeError: encountered" ,msg,charset) 
    return body  


#mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX' 
print(mboxfile) 
for thisemail in mailbox.mbox(mboxfile): 
    body = getbodyfromemail(thisemail) 
    print(body[0:1000])

Fuente

2011-08-25 09:27:19 dcb

Este script parece volver todos los mensajes correctamente:

def getcharsets(msg): 
    charsets = set({}) 
    for c in msg.get_charsets(): 
     if c is not None: 
      charsets.update([c]) 
    return charsets 

def getBody(msg): 
    while msg.is_multipart(): 
     msg=msg.get_payload()[0] 
    t=msg.get_payload(decode=True) 
    for charset in getcharsets(msg): 
     t=t.decode(charset) 
    return t

ex respuesta de ACD vuelve a menudo sólo algunas pie de página del mensaje real. ( al menos en los messagens correo electrónico gmane estoy abriendo de esta caja de herramientas: https://pypi.python.org/pypi/gmane )

aplausos

Fuente

2015-10-21 04:44:51

Extraer el cuerpo de un correo electrónico del archivo mbox, decodificarlo en texto plano independientemente de Charset y Content Transfer Encoding

Respuesta

Cuestiones relacionadas