Creating a DataFrame from a Fixed-Width File using Pandas

Some datasets are provided in a fixed-width file format (common extension is .txt, but includes many others as well). The pd.read_fwf function provides the functionality to read fixed-width file formats.

Fixed-width files are those whose format is specified through the width of the columns, where each column will always have a certain number of characters.

Reading in a fixed-width file from a URL

With many publicly available datasets, you can access the dataset with a URL. One example of this is a dataset from the National Oceanic and Atmospheric Administration that gives the location of every station that is a part of the Global Historical Climatology Network, which collects climate data. The dataset contains a row for every weather stations in a fixed-width format:

ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       
ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    
AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196
AEM00041194  25.2550   55.3640   10.4    DUBAI INTL                             41194
AEM00041217  24.4330   54.6510   26.8    ABU DHABI INTL                         41217
AEM00041218  24.2620   55.6090  264.9    AL AIN INTL                            41218
AF000040930  35.3170   69.0170 3366.0    NORTH-SALANG                   GSN     40930
AFM00040938  34.2100   62.2280  977.2    HERAT                                  40938
AFM00040948  34.5660   69.2120 1791.3    KABUL INTL                             40948
AFM00040990  31.5000   65.8500 1010.0    KANDAHAR AIRPORT                       40990
AG000060390  36.7167    3.2500   24.0    ALGER-DAR EL BEIDA             GSN     60390
AG000060590  30.5667    2.8667  397.0    EL-GOLEA                       GSN     60590
AG000060611  28.0500    9.6331  561.0    IN-AMENAS                      GSN     60611
AG000060680  22.8000    5.4331 1362.0    TAMANRASSET                    GSN     60680
AGE00135039  35.7297    0.6500   50.0    ORAN-HOPITAL MILITAIRE                      
AGE00147704  36.9700    7.7900  161.0    ANNABA-CAP DE GARDE                         
AGE00147705  36.7800    3.0700   59.0    ALGIERS-VILLE/UNIVERSITE                    
AGE00147706  36.8000    3.0300  344.0    ALGIERS-BOUZAREAH                           
AGE00147707  36.8000    3.0400   38.0    ALGIERS-CAP CAXINE                          
AGE00147708  36.7200    4.0500  222.0    TIZI OUZOU                             60395
AGE00147709  36.6300    4.2000  942.0    FORT NATIONAL                               
AGE00147710  36.7500    5.1000    9.0    BEJAIA-BOUGIE (PORT)                   60401
AGE00147711  36.3697    6.6200  660.0    CONSTANTINE                                 
AGE00147712  36.1700    1.3400  112.0    ORLEANSVILLE (CHLEF)                        
AGE00147713  36.1800    5.4000 1081.0    SETIF                                       
AGE00147714  35.7700    0.8000   78.0    ORAN-CAP FALCON                             
AGE00147715  35.4200    8.1197  863.0    TEBESSA                                     
AGE00147716  35.1000   -1.8500   83.0    NEMOURS (GHAZAOUET)                    60517
AGE00147717  35.2000    0.6300  476.0    SIDI-BEL-ABBES                              
AGE00147718  34.8500    5.7200  125.0    BISKRA                                 60525
AGE00147719  33.7997    2.8900  767.0    LAGHOUAT                               60545[...]
See the full dataset at https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
[...]

ghcnd-stations.txt from the Global Historical Climatology Network

To load this dataset, input the URL in quotes to the read_fwf function.

import pandas as pd
df = read_fwf("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt")

	ACW00011604	17.1167	-61.7833	10.1	ST JOHNS COOLIDGE FLD	Unnamed: 5	Unnamed: 6	Unnamed: 7
0	ACW00011647	17.1333	-61.7833	19.2	ST JOHNS	NaN	NaN	NaN
1	AE000041196	25.3330	55.5170	34.0	SHARJAH INTER. AIRP	NaN	GSN	41196.0
2	AEM00041194	25.2550	55.3640	10.4	DUBAI INTL	NaN	NaN	41194.0
3	AEM00041217	24.4330	54.6510	26.8	ABU DHABI INTL	NaN	NaN	41217.0
4	AEM00041218	24.2620	55.6090	264.9	AL AIN INTL	NaN	NaN	41218.0
...	...	...	...	...	...	...	...	...
119497	ZI000067969	21.0500	29.3670	861.0	WEST NICHOLSON	NaN	NaN	67969.0
119498	ZI000067975	20.0670	30.8670	1095.0	MASVINGO	NaN	NaN	67975.0
119499	ZI000067977	21.0170	31.5830	430.0	BUFFALO RANGE	NaN	NaN	67977.0
119500	ZI000067983	20.2000	32.6160	1132.0	CHIPINGE	NaN	GSN	67983.0
119501	ZI000067991	22.2170	30.0000	457.0	BEITBRIDGE	NaN	NaN	67991.0

Loading a fixed-width file from a URL

However, we see our first row of data has ended up as our column names. To fix this, we can use the names parameter of the read_fwf function. This parameter takes in a list of the column names.

We can find the name of each column in the guidebook in section IV.

Then we can do the same thing as before, now just with additional data.

import pandas as pd
df = read_fwf("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
              names=["ID", "LATITUDE", "LONGITUDE", "ELEVATION", "STATE", "NAME", "GSN FLAG", "HCN/CRN FLAG"])

	ID	LATITUDE	LONGITUDE	ELEVATION	STATE	NAME	GSN FLAG	HCN/CRN FLAG
0	ACW00011604	17.1167	-61.7833	10.1	ST JOHNS COOLIDGE FLD	NaN	NaN	NaN
1	ACW00011647	17.1333	-61.7833	19.2	ST JOHNS	NaN	NaN	NaN
2	AE000041196	25.3330	55.5170	34.0	SHARJAH INTER. AIRP	NaN	GSN	41196.0
3	AEM00041194	25.2550	55.3640	10.4	DUBAI INTL	NaN	NaN	41194.0
4	AEM00041217	24.4330	54.6510	26.8	ABU DHABI INTL	NaN	NaN	41217.0
...	...	...	...	...	...	...	...	...
119498	ZI000067969	21.0500	29.3670	861.0	WEST NICHOLSON	NaN	NaN	67969.0
119499	ZI000067975	20.0670	30.8670	1095.0	MASVINGO	NaN	NaN	67975.0
119500	ZI000067977	21.0170	31.5830	430.0	BUFFALO RANGE	NaN	NaN	67977.0
119501	ZI000067983	20.2000	32.6160	1132.0	CHIPINGE	NaN	GSN	67983.0
119502	ZI000067991	22.2170	30.0000	457.0	BEITBRIDGE	NaN	NaN	67991.0

Loading a fixed-width file using column names

Reading in a fixed-width file from your local computer

Some datasets are not accessible online directly by Python and needs to be downloaded. To read in a fixed-width file from your computer, you often want to first make sure the file is in the same folder as the python file you are working with.

One fixed-width dataset is a dataset that contains information about the human protein sequences, and specifically about the gene sequence on chromosome 1. It can be found as part of UniProt's Swiss-Prot Protein Knowledgebase. The file begins with multiple lines of information at the top and bottom of the file that are not included in the dataset that provide additional information to the user:

----------------------------------------------------------------------------
        UniProt - Swiss-Prot Protein Knowledgebase
        SIB Swiss Institute of Bioinformatics; Geneva, Switzerland
        European Bioinformatics Institute (EBI); Hinxton, United Kingdom
        Protein Information Resource (PIR); Washington DC, USA
----------------------------------------------------------------------------

Description: Human chromosome 1: entries, gene names and
             cross-references to MIM
Name:        humchr01.txt
Release:     2022_02 of 25-May-2022

----------------------------------------------------------------------------

This documents lists all the human protein sequence entries whose genes
are known to be encoded on chromosome 1 in this release of UniProtKB/Swiss-Prot.

Number of UniProtKB/Swiss-Prot entries encoded on chromosome 1: 2066

Reference for the chromosome sequence:
Nature 441:315-321(2006).
PubMed=16710414; DOI=10.1038/nature04727;

For more information on chromosome 1 see:

Sanger  : http://www.sanger.ac.uk/about/history/hgp/chr1.html
Ensembl : https://www.ensembl.org/Homo_sapiens/mapview?chr=1
NCBI    : https://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=1
OMIM    : https://www.ncbi.nlm.nih.gov/Omim/getmap.cgi?chromosome=1
DOE     : http://www.ornl.gov/sci/techresources/Human_Genome/launchpad/chrom01.shtml


______________ _______________ ______________________ ______ ______________________
Gene           Chromosomal     Swiss-Prot             MIM    Description
name           position        AC        Entry name   code
______________ _______________ ______________________ ______ ______________________
A3GALT2       1p35.1          U3KPV4     A3LT2_HUMAN        Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S]
AADACL3       1p36.21         Q5VUY0     ADCL3_HUMAN        Arylacetamide deacetylase-like 3 (EC 3.1.1.-)
AADACL4       1p36.21         Q5VUY2     ADCL4_HUMAN        Arylacetamide deacetylase-like 4 (EC 3.1.1.-)
ABCA4         1p21-p22.1      P78363     ABCA4_HUMAN 601691 Retinal-specific phospholipid-transporting ATPase ABCA4 (EC 7.6.2.1) (ATP-binding cassette sub-family A member 4) (RIM ABC transporter) (RIM proteinv) (RmP) (Retinal-specific ATP-binding cassette transporter) (Stargardt disease protein) [ABCR]
ABCB10        1q42            Q9NRK6     ABCBA_HUMAN 605454 ATP-binding cassette sub-family B member 10, mitochondrial precursor (ABC-mitochondrial erythroid protein) (ABC-me protein) (ATP-binding cassette transporter 10) (ABC transporter 10 protein) (Mitochondrial ATP-binding cassette 2) (M-ABC2)[...]
See the full dataset at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/humchr01.txt
[...]
ZSWIM5        1p34.1          Q9P217     ZSWM5_HUMAN        Zinc finger SWIM domain-containing protein 5 [KIAA1511]
ZYG11A        1p32.3          Q6WRX3     ZY11A_HUMAN 618675 Protein zyg-11 homolog A [ZYG11]
ZYG11B        1p32.3          Q9C0D3     ZY11B_HUMAN 618673 Protein zyg-11 homolog B [KIAA1730]
ZZZ3          1p31.1          Q8IYH5     ZZZ3_HUMAN         ZZ-type zinc finger-containing protein 3
-             1p32.1          Q0P140     YA037_HUMAN        Putative uncharacterized protein HSD52 [HSD52]
-             1p34.3          Q6XCG6     YA011_HUMAN        Putative uncharacterized protein PP632 [PP632]
-             1p34.3          Q7L0L9     YA043_HUMAN        Transmembrane protein LOC653160
-             1p36.13         P0DMU3     F231L_HUMAN        FAM231A/C-like protein LOC102723383
-             1q21.3          A8MUI8     YA034_HUMAN        Putative UPF0607 protein ENSP00000383783

--------------------------------------------------------------------------------
Copyrighted by the UniProt Consortium, see https://www.uniprot.org/terms
Distributed under the Creative Commons Attribution (CC BY 4.0) License
--------------------------------------------------------------------------------

https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/humchr01.txt

There's 35 lines at the top we need to skip and 5 at the bottom. To do this, we will use the skiprows and skipfooter parameters in the read_fwf function. That will look like this:

import pandas as pd
df = read_fwf('humchr01.txt', skiprows=35, skipfooter=5)

	A3GALT2	1p35.1	U3KPV4	A3LT2_HUMAN	Unnamed: 4	Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S]	Unnamed: 6	Unnamed: 7
0	AADACL3	1p36.21	Q5VUY0	ADCL3_HUMAN	NaN	Arylacetamide deacetylase-like 3 (EC 3.1.1.-)	NaN	NaN
1	AADACL4	1p36.21	Q5VUY2	ADCL4_HUMAN	NaN	Arylacetamide deacetylase-like 4 (EC 3.1.1.-)	NaN	NaN
2	ABCA4	1p21-p22.1	P78363	ABCA4_HUMAN	601691.0	Retinal-specific phospholipid-transporting ATP...	NaN	NaN
3	ABCB10	1q42	Q9NRK6	ABCBA_HUMAN	605454.0	ATP-binding cassette sub-family B member 10, m...	NaN	NaN
4	ABCD3	1p21-p22	P28288	ABCD3_HUMAN	170995.0	ATP-binding cassette sub-family D member 3 (EC...	NaN	NaN
...	...	...	...	...	...	...	...	...
2065	-	1p32.1	Q0P140	YA037_HUMAN	NaN	Putative uncharacterized protein HSD52 [HSD52]	NaN	NaN
2066	-	1p34.3	Q6XCG6	YA011_HUMAN	NaN	Putative uncharacterized protein PP632 [PP632]	NaN	NaN
2067	-	1p34.3	Q7L0L9	YA043_HUMAN	NaN	Transmembrane protein LOC653160	NaN	NaN
2068	-	1p36.13	P0DMU3	F231L_HUMAN	NaN	FAM231A/C-like protein LOC102723383	NaN	NaN
2069	-	1q21.3	A8MUI8	YA034_HUMAN	NaN	Putative UPF0607 protein ENSP00000383783	NaN	NaN

Loading a fixed-width file with skipped rows

We again see the first row of data has been used as column names. Again, we will use the names parameter to provide the function with a list of column names.

However this time, when we use the names parameter, it makes the first two fields into the index column. We don't want this. So instead, we will set the index_col parameter to False to specify that we haven't included a column for the index, so the dataframe must generate one itself.

Putting it all together, our program will look like this:

import pandas as pd
df = read_fwf('humchr01.txt', skiprows=35, skipfooter=5, index_col=False,
              names=['gene_name', 'chromosomal_position', 'uniprot', 'entry_name', 'mtm_code', 'description'])

	gene_name	chromosomal_position	uniprot	entry_name	mtm_code	description
0	A3GALT2	1p35.1	U3KPV4	A3LT2_HUMAN	NaN	Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87...
1	AADACL3	1p36.21	Q5VUY0	ADCL3_HUMAN	NaN	Arylacetamide deacetylase-like 3 (EC 3.1.1.-)
2	AADACL4	1p36.21	Q5VUY2	ADCL4_HUMAN	NaN	Arylacetamide deacetylase-like 4 (EC 3.1.1.-)
3	ABCA4	1p21-p22.1	P78363	ABCA4_HUMAN	601691.0	Retinal-specific phospholipid-transporting ATP...
4	ABCB10	1q42	Q9NRK6	ABCBA_HUMAN	605454.0	ATP-binding cassette sub-family B member 10, m...
...	...	...	...	...	...	...
2066	-	1p32.1	Q0P140	YA037_HUMAN	NaN	Putative uncharacterized protein HSD52 [HSD52]
2067	-	1p34.3	Q6XCG6	YA011_HUMAN	NaN	Putative uncharacterized protein PP632 [PP632]
2068	-	1p34.3	Q7L0L9	YA043_HUMAN	NaN	Transmembrane protein LOC653160
2069	-	1p36.13	P0DMU3	F231L_HUMAN	NaN	FAM231A/C-like protein LOC102723383
2070	-	1q21.3	A8MUI8	YA034_HUMAN	NaN	Putative UPF0607 protein ENSP00000383783

Loading a fixed-width file with formatting

The dataset for this section and suggestions on formatting can be found in this article by Amy Rask.

Pandas Documentation

The full documentation for read_fwf is available in the pandas documentation.