Copying malformed gzipped S3 object to RDS or efficiently changing line terminator

I have an S3 object from a third-party data provider who has (by whatever mechanism) inconsistently written this object. The header row is terminated by a newline \n but every subsequent row is terminated by a carriage return+line feed \r\n. This object is gzipped csv. I want to copy this to our S3 bucket from theirs, and then copy that object into a PostgreSQL RDS table using the aws_s3 extensions. These extensions can usually handle gzipped csv files, but because the header row has a different line terminator than the remaining rows, a copy fails with the following error even with copy options '(format csv, delimiter '','', header true)':

psycopg2.errors.BadCopyFileFormat: unquoted carriage return found in data
HINT:  Use quoted CSV field to represent carriage return.

Ultimately, I gather from the PostgreSQL source code that this is because of the inconsistency in line terminators; the PostgreSQL engine looks to the first row to determine line terminator even with header true.

So my first question is, is my assessment here true, or is there a way to instruct PostgreSQL COPY to handle a csv file where the header has a newline character at the end but a carriage return ending all other lines?

Assuming that it cannot, I now need to normalize the line terminator before I write this object out to S3. From boto3's get_object I have a bunch of bytes:

resp = s3_client.get_object(Bucket="the-source-bucket", Key="location/of/the_object.csv.gz")
body_bytes = resp["Body"].read()

And I want to replace any instance of \r\n with just \n before I do something like

s3_client.put_object(
  Bucket="the-destination-bucket", 
  Key="/location/of/the/output.csv[.gz?]", 
  Body=<body_bytes with the correct line terminator everywhere>
)

What is the most efficient way to do this?

Search This Blog

Theprogrammersfirst | A technical portal.

Copying malformed gzipped S3 object to RDS or efficiently changing line terminator

Comments

Post a Comment