Loading data to mariaDB via Python and Int Column is being set to 0
I am loading a dataset into my MariaDB database via Python, and althought the data looks correct when I view a subset in a DataFrame in Python, when it reaches the table in mariaDB some errors have occured
- date column last 2 values missing
- price column set to 0 and potentially other issues
I am loading this data set, looping over the half-yearly files https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads
My method is to create a database table and then load the data via for loop from locally saved csv files
#define fn to run queries
def run(query):
connection = mysql.connector.connect( user=USR, password=PASSWORD, database=DBNAME, host=ENDPOINT , port =PORT, allow_local_infile = True)
cursor = connection.cursor()
cursor.execute(query, multi=True)
#create database table
query = """
DROP TABLE IF EXISTS `pp_data`;
CREATE TABLE IF NOT EXISTS `pp_data` (
`transaction_unique_identifier` tinytext COLLATE utf8_bin NOT NULL,
`price` int(10) unsigned NOT NULL,
`date_of_transfer` date NOT NULL,
`postcode` varchar(8) COLLATE utf8_bin NOT NULL,
`property_type` varchar(1) COLLATE utf8_bin NOT NULL,
`new_build_flag` varchar(1) COLLATE utf8_bin NOT NULL,
`tenure_type` varchar(1) COLLATE utf8_bin NOT NULL,
`primary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`secondary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
`street` tinytext COLLATE utf8_bin NOT NULL,
`locality` tinytext COLLATE utf8_bin NOT NULL,
`town_city` tinytext COLLATE utf8_bin NOT NULL,
`district` tinytext COLLATE utf8_bin NOT NULL,
`county` tinytext COLLATE utf8_bin NOT NULL,
`ppd_category_type` varchar(2) COLLATE utf8_bin NOT NULL,
`record_status` varchar(2) COLLATE utf8_bin NOT NULL,
`db_id` bigint(20) unsigned NOT NULL
) DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
"""
cursor.execute(query, multi = True)
#add primary key
query=""" ALTER TABLE `pp_data`
ADD PRIMARY KEY (`db_id`);
MODIFY `db_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=1;
CREATE INDEX `pp.postcode` USING HASH
ON `pp_data`
(postcode);
CREATE INDEX `pp.date` USING HASH
ON `pp_data`
(date_of_transfer);"""
run(query)
#pull data via for loop
year_list = range(1995,2000)
filename1 = ["http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-{0}-part1.csv".format(year) for year in year_list]
filename2 = ["http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-{0}-part2.csv".format(year) for year in year_list]
filenames= filename1 + filename2
connection = mysql.connector.connect( user=USR, password=PASSWORD, database=DBNAME, host=ENDPOINT , port =PORT, allow_local_infile = True)
cursor = connection.cursor()
query = """LOAD DATA local INFILE 'out.csv'
INTO TABLE property_prices.pp_data
FIELDS TERMINATED BY ','
ENCLOSED BY ''
LINES TERMINATED BY '\n';
"""
for url in filenames:
data = pd.read_csv(url)
df = pd.DataFrame(data)
df.to_csv('out.csv')
#print(df.head())
cursor.execute(query)
When I look at one of the data frames (df) within Python the data looks correct:
{36F3EF94-82A0-418E-8A00-19D504CFC03B} 33000 1995-11-17 00:00 PL24 2NH \
0 {4D99BC24-0AD1-454E-96BB-19D51B3AD118} 54000 1995-04-28 00:00 CB23 7XB
1 {0B64C330-4E83-427D-9C98-20F0FE8697F0} 27000 1995-02-24 00:00 DL3 9TU
2 {AC7D0429-9801-4584-8E5E-20F10CF3DD36} 56950 1995-10-20 00:00 CV7 7LY
3 {8EB62929-D245-4BB9-ADB5-20F10E847F1A} 55000 1995-01-31 00:00 MK41 7LS
4 {BCB01E86-FB6A-4031-A67D-20F11887223B} 62995 1995-04-07 00:00 E3 2UR
T N F 41 Unnamed: 8 FORE STREET ST BLAZEY \
0 T N F 19 NaN LAMBOURNE ROAD HARDWICK
1 T N F 11 NaN ATHERSTONE WAY DARLINGTON
2 T N F 10 NaN HIGHFIELD MERIDEN
3 S N F 12 NaN LELY CLOSE BEDFORD
4 F Y L PARK WEST BUILDING FLAT 38 FAIRFIELD ROAD LONDON
PAR RESTORMEL CORNWALL A A.1
0 CAMBRIDGE SOUTH CAMBRIDGESHIRE CAMBRIDGESHIRE A A
1 DARLINGTON DARLINGTON DARLINGTON A A
2 COVENTRY SOLIHULL WEST MIDLANDS A A
3 BEDFORD NORTH BEDFORDSHIRE BEDFORDSHIRE A A
4 LONDON TOWER HAMLETS GREATER LONDON A A
But when I look at the table within SQL it looks quite different, in particular, the date column is truncated and the price column values are 0
query = "select * from pp_data limit 5 ;"
cursor.execute(query)
records = cursor.fetchall()
print(records)
[(b'', 0, None, '1995-11-', 'P', 'T', 'N', b'F', b'41', b'Unnamed: 8', b'FORE STREET', b'ST BLAZEY', b'PAR', b'RESTORMEL', 'CO', 'A', 0), (b'0', 0, None, '1995-04-', 'C', 'T', 'N', b'F', b'19', b'', b'LAMBOURNE ROAD', b'HARDWICK', b'CAMBRIDGE', b'SOUTH CAMBRIDGESHIRE', 'CA', 'A', 0), (b'1', 0, None, '1995-02-', 'D', 'T', 'N', b'F', b'11', b'', b'ATHERSTONE WAY', b'DARLINGTON', b'DARLINGTON', b'DARLINGTON', 'DA', 'A', 0), (b'2', 0, None, '1995-10-', 'C', 'T', 'N', b'F', b'10', b'', b'HIGHFIELD', b'MERIDEN', b'COVENTRY', b'SOLIHULL', 'WE', 'A', 0), (b'3', 0, None, '1995-01-', 'M', 'S', 'N', b'F', b'12', b'', b'LELY CLOSE', b'BEDFORD', b'BEDFORD', b'NORTH BEDFORDSHIRE', 'BE', 'A', 0)]
I Am flummoxed as to what to do especially since the schema for the table was provided by my Uni! Thanks :)
from Recent Questions - Stack Overflow https://ift.tt/3cQ7DVJ
https://ift.tt/eA8V8J
Comments
Post a Comment