Understanding Pandas read_csv read_excel errors
In data science we often deal with messy, heterogeneous data and file types too.
Python Pandas is a very powerful data science tool.
A simple but not infrequent mistake is using the wrong Pandas function to read data, that is, using read_excel
to read CSV data or read_csv
to read Excel spreadsheet data.
Note: Pandas cannot read ODS OpenDocument formats, so for those using LibreOffice/OpenOffice, convert ODS data to XLSX first.
Pandas wrong function format errors
The intended Pandas reader usage is:
read_csv()
for .csv and .tsv filesread_excel
for .xls and .xlsx files
read_excel(.csv)
This leads to errors including:
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b''
read_csv → .xlsx
This leads to errors including:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfa in position 1: invalid start byte
arserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Pandas prereqs
To help avoid excessive prerequisites, Pandas makes the xlrd
install optional–until using read_excel
, so simply do:
pip install pandas xlrd