2021-11-27

Loading data to mariaDB via Python and Int Column is being set to 0

I am loading a dataset into my MariaDB database via Python, and althought the data looks correct when I view a subset in a DataFrame in Python, when it reaches the table in mariaDB some errors have occured

  • date column last 2 values missing
  • price column set to 0 and potentially other issues

I am loading this data set, looping over the half-yearly files https://www.gov.uk/government/statistical-data-sets/price-paid-data-downloads

My method is to create a database table and then load the data via for loop from locally saved csv files


    #define fn to run queries 
    def run(query):
        connection =   mysql.connector.connect( user=USR, password=PASSWORD, database=DBNAME,  host=ENDPOINT , port =PORT, allow_local_infile = True)
        cursor = connection.cursor()
        cursor.execute(query, multi=True)
    
    
    #create database table 
    query = """
    DROP TABLE IF EXISTS `pp_data`;
    CREATE TABLE IF NOT EXISTS `pp_data` (
      `transaction_unique_identifier` tinytext COLLATE utf8_bin NOT NULL,
      `price` int(10) unsigned NOT NULL,
      `date_of_transfer` date NOT NULL,
      `postcode` varchar(8) COLLATE utf8_bin NOT NULL,
      `property_type` varchar(1) COLLATE utf8_bin NOT NULL,
      `new_build_flag` varchar(1) COLLATE utf8_bin NOT NULL,
      `tenure_type` varchar(1) COLLATE utf8_bin NOT NULL,
      `primary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
      `secondary_addressable_object_name` tinytext COLLATE utf8_bin NOT NULL,
      `street` tinytext COLLATE utf8_bin NOT NULL,
      `locality` tinytext COLLATE utf8_bin NOT NULL,
      `town_city` tinytext COLLATE utf8_bin NOT NULL,
      `district` tinytext COLLATE utf8_bin NOT NULL,
      `county` tinytext COLLATE utf8_bin NOT NULL,
      `ppd_category_type` varchar(2) COLLATE utf8_bin NOT NULL,
      `record_status` varchar(2) COLLATE utf8_bin NOT NULL,
      `db_id` bigint(20) unsigned NOT NULL
    ) DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=1 ;
    """
    
    cursor.execute(query, multi = True)
    
    #add primary key 
    query=""" ALTER TABLE `pp_data`
    ADD PRIMARY KEY (`db_id`);
    MODIFY `db_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,AUTO_INCREMENT=1;
    CREATE INDEX `pp.postcode` USING HASH
      ON `pp_data`
        (postcode);
    CREATE INDEX `pp.date` USING HASH
      ON `pp_data` 
        (date_of_transfer);"""
    
    run(query)
    
    #pull data via for loop 
    year_list = range(1995,2000)
    
    filename1 = ["http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-{0}-part1.csv".format(year) for year in year_list]
    filename2 =  ["http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-{0}-part2.csv".format(year) for year in year_list] 
    filenames= filename1 + filename2 
    
    connection =   mysql.connector.connect( user=USR, password=PASSWORD, database=DBNAME,  host=ENDPOINT , port =PORT, allow_local_infile = True)
    cursor = connection.cursor()
    
    
    query =  """LOAD DATA local INFILE 'out.csv' 
    INTO TABLE property_prices.pp_data 
    FIELDS TERMINATED BY ',' 
    ENCLOSED BY ''
    LINES TERMINATED BY '\n';
    """
    
    for url in filenames:
        data = pd.read_csv(url)   
        df = pd.DataFrame(data)
        df.to_csv('out.csv')
        #print(df.head())
        cursor.execute(query)

When I look at one of the data frames (df) within Python the data looks correct:

   {36F3EF94-82A0-418E-8A00-19D504CFC03B}  33000  1995-11-17 00:00  PL24 2NH  \
0  {4D99BC24-0AD1-454E-96BB-19D51B3AD118}  54000  1995-04-28 00:00  CB23 7XB   
1  {0B64C330-4E83-427D-9C98-20F0FE8697F0}  27000  1995-02-24 00:00   DL3 9TU   
2  {AC7D0429-9801-4584-8E5E-20F10CF3DD36}  56950  1995-10-20 00:00   CV7 7LY   
3  {8EB62929-D245-4BB9-ADB5-20F10E847F1A}  55000  1995-01-31 00:00  MK41 7LS   
4  {BCB01E86-FB6A-4031-A67D-20F11887223B}  62995  1995-04-07 00:00    E3 2UR   

   T  N  F                  41 Unnamed: 8     FORE STREET   ST BLAZEY  \
0  T  N  F                  19        NaN  LAMBOURNE ROAD    HARDWICK   
1  T  N  F                  11        NaN  ATHERSTONE WAY  DARLINGTON   
2  T  N  F                  10        NaN       HIGHFIELD     MERIDEN   
3  S  N  F                  12        NaN      LELY CLOSE     BEDFORD   
4  F  Y  L  PARK WEST BUILDING    FLAT 38  FAIRFIELD ROAD      LONDON   

          PAR             RESTORMEL        CORNWALL  A A.1  
0   CAMBRIDGE  SOUTH CAMBRIDGESHIRE  CAMBRIDGESHIRE  A   A  
1  DARLINGTON            DARLINGTON      DARLINGTON  A   A  
2    COVENTRY              SOLIHULL   WEST MIDLANDS  A   A  
3     BEDFORD    NORTH BEDFORDSHIRE    BEDFORDSHIRE  A   A  
4      LONDON         TOWER HAMLETS  GREATER LONDON  A   A 

But when I look at the table within SQL it looks quite different, in particular, the date column is truncated and the price column values are 0


    query = "select * from pp_data limit 5  ;"
    cursor.execute(query)
    records = cursor.fetchall()
    print(records)

[(b'', 0, None, '1995-11-', 'P', 'T', 'N', b'F', b'41', b'Unnamed: 8', b'FORE STREET', b'ST BLAZEY', b'PAR', b'RESTORMEL', 'CO', 'A', 0), (b'0', 0, None, '1995-04-', 'C', 'T', 'N', b'F', b'19', b'', b'LAMBOURNE ROAD', b'HARDWICK', b'CAMBRIDGE', b'SOUTH CAMBRIDGESHIRE', 'CA', 'A', 0), (b'1', 0, None, '1995-02-', 'D', 'T', 'N', b'F', b'11', b'', b'ATHERSTONE WAY', b'DARLINGTON', b'DARLINGTON', b'DARLINGTON', 'DA', 'A', 0), (b'2', 0, None, '1995-10-', 'C', 'T', 'N', b'F', b'10', b'', b'HIGHFIELD', b'MERIDEN', b'COVENTRY', b'SOLIHULL', 'WE', 'A', 0), (b'3', 0, None, '1995-01-', 'M', 'S', 'N', b'F', b'12', b'', b'LELY CLOSE', b'BEDFORD', b'BEDFORD', b'NORTH BEDFORDSHIRE', 'BE', 'A', 0)]

I Am flummoxed as to what to do especially since the schema for the table was provided by my Uni! Thanks :)



from Recent Questions - Stack Overflow https://ift.tt/3cQ7DVJ
https://ift.tt/eA8V8J

No comments:

Post a Comment