Creating a DataFrame from a Fixed-Width File using Pandas


Some datasets are provided in a fixed-width file format (common extension is .txt, but includes many others as well). The pd.read_fwf function provides the functionality to read fixed-width file formats.

Fixed-width files are those whose format is specified through the width of the columns, where each column will always have a certain number of characters.

Reading in a fixed-width file from a URL

With many publicly available datasets, you can access the dataset with a URL. One example of this is a dataset from the National Oceanic and Atmospheric Administration that gives the location of every station that is a part of the Global Historical Climatology Network, which collects climate data. The dataset contains a row for every weather stations in a fixed-width format:

ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       
ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    
AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196
AEM00041194  25.2550   55.3640   10.4    DUBAI INTL                             41194
AEM00041217  24.4330   54.6510   26.8    ABU DHABI INTL                         41217
AEM00041218  24.2620   55.6090  264.9    AL AIN INTL                            41218
AF000040930  35.3170   69.0170 3366.0    NORTH-SALANG                   GSN     40930
AFM00040938  34.2100   62.2280  977.2    HERAT                                  40938
AFM00040948  34.5660   69.2120 1791.3    KABUL INTL                             40948
AFM00040990  31.5000   65.8500 1010.0    KANDAHAR AIRPORT                       40990
AG000060390  36.7167    3.2500   24.0    ALGER-DAR EL BEIDA             GSN     60390
AG000060590  30.5667    2.8667  397.0    EL-GOLEA                       GSN     60590
AG000060611  28.0500    9.6331  561.0    IN-AMENAS                      GSN     60611
AG000060680  22.8000    5.4331 1362.0    TAMANRASSET                    GSN     60680
AGE00135039  35.7297    0.6500   50.0    ORAN-HOPITAL MILITAIRE                      
AGE00147704  36.9700    7.7900  161.0    ANNABA-CAP DE GARDE                         
AGE00147705  36.7800    3.0700   59.0    ALGIERS-VILLE/UNIVERSITE                    
AGE00147706  36.8000    3.0300  344.0    ALGIERS-BOUZAREAH                           
AGE00147707  36.8000    3.0400   38.0    ALGIERS-CAP CAXINE                          
AGE00147708  36.7200    4.0500  222.0    TIZI OUZOU                             60395
AGE00147709  36.6300    4.2000  942.0    FORT NATIONAL                               
AGE00147710  36.7500    5.1000    9.0    BEJAIA-BOUGIE (PORT)                   60401
AGE00147711  36.3697    6.6200  660.0    CONSTANTINE                                 
AGE00147712  36.1700    1.3400  112.0    ORLEANSVILLE (CHLEF)                        
AGE00147713  36.1800    5.4000 1081.0    SETIF                                       
AGE00147714  35.7700    0.8000   78.0    ORAN-CAP FALCON                             
AGE00147715  35.4200    8.1197  863.0    TEBESSA                                     
AGE00147716  35.1000   -1.8500   83.0    NEMOURS (GHAZAOUET)                    60517
AGE00147717  35.2000    0.6300  476.0    SIDI-BEL-ABBES                              
AGE00147718  34.8500    5.7200  125.0    BISKRA                                 60525
AGE00147719  33.7997    2.8900  767.0    LAGHOUAT                               60545
ghcnd-stations.txt from the Global Historical Climatology Network

To load this dataset, input the URL in quotes to the read_fwf function.

import pandas as pd
df = read_fwf("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt")
ACW0001160417.1167-61.783310.1ST JOHNS COOLIDGE FLDUnnamed: 5Unnamed: 6Unnamed: 7
0ACW0001164717.1333-61.783319.2ST JOHNSNaNNaNNaN
1AE00004119625.333055.517034.0SHARJAH INTER. AIRPNaNGSN41196.0
2AEM0004119425.255055.364010.4DUBAI INTLNaNNaN41194.0
3AEM0004121724.433054.651026.8ABU DHABI INTLNaNNaN41217.0
4AEM0004121824.262055.6090264.9AL AIN INTLNaNNaN41218.0
...........................
119497ZI00006796921.050029.3670861.0WEST NICHOLSONNaNNaN67969.0
119498ZI00006797520.067030.86701095.0MASVINGONaNNaN67975.0
119499ZI00006797721.017031.5830430.0BUFFALO RANGENaNNaN67977.0
119500ZI00006798320.200032.61601132.0CHIPINGENaNGSN67983.0
119501ZI00006799122.217030.0000457.0BEITBRIDGENaNNaN67991.0
Loading a fixed-width file from a URL

However, we see our first row of data has ended up as our column names. To fix this, we can use the names parameter of the read_fwf function. This parameter takes in a list of the column names.

We can find the name of each column in the guidebook in section IV.

Then we can do the same thing as before, now just with additional data.

import pandas as pd
df = read_fwf("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
              names=["ID", "LATITUDE", "LONGITUDE", "ELEVATION", "STATE", "NAME", "GSN FLAG", "HCN/CRN FLAG"])
IDLATITUDELONGITUDEELEVATIONSTATENAMEGSN FLAGHCN/CRN FLAG
0ACW0001160417.1167-61.783310.1ST JOHNS COOLIDGE FLDNaNNaNNaN
1ACW0001164717.1333-61.783319.2ST JOHNSNaNNaNNaN
2AE00004119625.333055.517034.0SHARJAH INTER. AIRPNaNGSN41196.0
3AEM0004119425.255055.364010.4DUBAI INTLNaNNaN41194.0
4AEM0004121724.433054.651026.8ABU DHABI INTLNaNNaN41217.0
...........................
119498ZI00006796921.050029.3670861.0WEST NICHOLSONNaNNaN67969.0
119499ZI00006797520.067030.86701095.0MASVINGONaNNaN67975.0
119500ZI00006797721.017031.5830430.0BUFFALO RANGENaNNaN67977.0
119501ZI00006798320.200032.61601132.0CHIPINGENaNGSN67983.0
119502ZI00006799122.217030.0000457.0BEITBRIDGENaNNaN67991.0
Loading a fixed-width file using column names

Reading in a fixed-width file from your local computer

Some datasets are not accessible online directly by Python and needs to be downloaded. To read in a fixed-width file from your computer, you often want to first make sure the file is in the same folder as the python file you are working with.

One fixed-width dataset is a dataset that contains information about the human protein sequences, and specifically about the gene sequence on chromosome 1. It can be found as part of UniProt's Swiss-Prot Protein Knowledgebase. The file begins with multiple lines of information at the top and bottom of the file that are not included in the dataset that provide additional information to the user:

----------------------------------------------------------------------------
        UniProt - Swiss-Prot Protein Knowledgebase
        SIB Swiss Institute of Bioinformatics; Geneva, Switzerland
        European Bioinformatics Institute (EBI); Hinxton, United Kingdom
        Protein Information Resource (PIR); Washington DC, USA
----------------------------------------------------------------------------

Description: Human chromosome 1: entries, gene names and
             cross-references to MIM
Name:        humchr01.txt
Release:     2022_02 of 25-May-2022

----------------------------------------------------------------------------

This documents lists all the human protein sequence entries whose genes
are known to be encoded on chromosome 1 in this release of UniProtKB/Swiss-Prot.

Number of UniProtKB/Swiss-Prot entries encoded on chromosome 1: 2066

Reference for the chromosome sequence:
Nature 441:315-321(2006).
PubMed=16710414; DOI=10.1038/nature04727;

For more information on chromosome 1 see:

Sanger  : http://www.sanger.ac.uk/about/history/hgp/chr1.html
Ensembl : https://www.ensembl.org/Homo_sapiens/mapview?chr=1
NCBI    : https://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=1
OMIM    : https://www.ncbi.nlm.nih.gov/Omim/getmap.cgi?chromosome=1
DOE     : http://www.ornl.gov/sci/techresources/Human_Genome/launchpad/chrom01.shtml


______________ _______________ ______________________ ______ ______________________
Gene           Chromosomal     Swiss-Prot             MIM    Description
name           position        AC        Entry name   code
______________ _______________ ______________________ ______ ______________________
A3GALT2       1p35.1          U3KPV4     A3LT2_HUMAN        Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S]
AADACL3       1p36.21         Q5VUY0     ADCL3_HUMAN        Arylacetamide deacetylase-like 3 (EC 3.1.1.-)
AADACL4       1p36.21         Q5VUY2     ADCL4_HUMAN        Arylacetamide deacetylase-like 4 (EC 3.1.1.-)
ABCA4         1p21-p22.1      P78363     ABCA4_HUMAN 601691 Retinal-specific phospholipid-transporting ATPase ABCA4 (EC 7.6.2.1) (ATP-binding cassette sub-family A member 4) (RIM ABC transporter) (RIM proteinv) (RmP) (Retinal-specific ATP-binding cassette transporter) (Stargardt disease protein) [ABCR]
ABCB10        1q42            Q9NRK6     ABCBA_HUMAN 605454 ATP-binding cassette sub-family B member 10, mitochondrial precursor (ABC-mitochondrial erythroid protein) (ABC-me protein) (ATP-binding cassette transporter 10) (ABC transporter 10 protein) (Mitochondrial ATP-binding cassette 2) (M-ABC2)ZSWIM5        1p34.1          Q9P217     ZSWM5_HUMAN        Zinc finger SWIM domain-containing protein 5 [KIAA1511]
ZYG11A        1p32.3          Q6WRX3     ZY11A_HUMAN 618675 Protein zyg-11 homolog A [ZYG11]
ZYG11B        1p32.3          Q9C0D3     ZY11B_HUMAN 618673 Protein zyg-11 homolog B [KIAA1730]
ZZZ3          1p31.1          Q8IYH5     ZZZ3_HUMAN         ZZ-type zinc finger-containing protein 3
-             1p32.1          Q0P140     YA037_HUMAN        Putative uncharacterized protein HSD52 [HSD52]
-             1p34.3          Q6XCG6     YA011_HUMAN        Putative uncharacterized protein PP632 [PP632]
-             1p34.3          Q7L0L9     YA043_HUMAN        Transmembrane protein LOC653160
-             1p36.13         P0DMU3     F231L_HUMAN        FAM231A/C-like protein LOC102723383
-             1q21.3          A8MUI8     YA034_HUMAN        Putative UPF0607 protein ENSP00000383783

--------------------------------------------------------------------------------
Copyrighted by the UniProt Consortium, see https://www.uniprot.org/terms
Distributed under the Creative Commons Attribution (CC BY 4.0) License
--------------------------------------------------------------------------------
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/humchr01.txt

There's 35 lines at the top we need to skip and 5 at the bottom. To do this, we will use the skiprows and skipfooter parameters in the read_fwf function. That will look like this:

import pandas as pd
df = read_fwf('humchr01.txt', skiprows=35, skipfooter=5)
A3GALT21p35.1U3KPV4A3LT2_HUMANUnnamed: 4Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S]Unnamed: 6Unnamed: 7
0AADACL31p36.21Q5VUY0ADCL3_HUMANNaNArylacetamide deacetylase-like 3 (EC 3.1.1.-)NaNNaN
1AADACL41p36.21Q5VUY2ADCL4_HUMANNaNArylacetamide deacetylase-like 4 (EC 3.1.1.-)NaNNaN
2ABCA41p21-p22.1P78363ABCA4_HUMAN601691.0Retinal-specific phospholipid-transporting ATP...NaNNaN
3ABCB101q42Q9NRK6ABCBA_HUMAN605454.0ATP-binding cassette sub-family B member 10, m...NaNNaN
4ABCD31p21-p22P28288ABCD3_HUMAN170995.0ATP-binding cassette sub-family D member 3 (EC...NaNNaN
...........................
2065-1p32.1Q0P140YA037_HUMANNaNPutative uncharacterized protein HSD52 [HSD52]NaNNaN
2066-1p34.3Q6XCG6YA011_HUMANNaNPutative uncharacterized protein PP632 [PP632]NaNNaN
2067-1p34.3Q7L0L9YA043_HUMANNaNTransmembrane protein LOC653160NaNNaN
2068-1p36.13P0DMU3F231L_HUMANNaNFAM231A/C-like protein LOC102723383NaNNaN
2069-1q21.3A8MUI8YA034_HUMANNaNPutative UPF0607 protein ENSP00000383783NaNNaN
Loading a fixed-width file with skipped rows

We again see the first row of data has been used as column names. Again, we will use the names parameter to provide the function with a list of column names.

However this time, when we use the names parameter, it makes the first two fields into the index column. We don't want this. So instead, we will set the index_col parameter to False to specify that we haven't included a column for the index, so the dataframe must generate one itself.

Putting it all together, our program will look like this:

import pandas as pd
df = read_fwf('humchr01.txt', skiprows=35, skipfooter=5, index_col=False,
              names=['gene_name', 'chromosomal_position', 'uniprot', 'entry_name', 'mtm_code', 'description'])
gene_namechromosomal_positionuniprotentry_namemtm_codedescription
0A3GALT21p35.1U3KPV4A3LT2_HUMANNaNAlpha-1,3-galactosyltransferase 2 (EC 2.4.1.87...
1AADACL31p36.21Q5VUY0ADCL3_HUMANNaNArylacetamide deacetylase-like 3 (EC 3.1.1.-)
2AADACL41p36.21Q5VUY2ADCL4_HUMANNaNArylacetamide deacetylase-like 4 (EC 3.1.1.-)
3ABCA41p21-p22.1P78363ABCA4_HUMAN601691.0Retinal-specific phospholipid-transporting ATP...
4ABCB101q42Q9NRK6ABCBA_HUMAN605454.0ATP-binding cassette sub-family B member 10, m...
.....................
2066-1p32.1Q0P140YA037_HUMANNaNPutative uncharacterized protein HSD52 [HSD52]
2067-1p34.3Q6XCG6YA011_HUMANNaNPutative uncharacterized protein PP632 [PP632]
2068-1p34.3Q7L0L9YA043_HUMANNaNTransmembrane protein LOC653160
2069-1p36.13P0DMU3F231L_HUMANNaNFAM231A/C-like protein LOC102723383
2070-1q21.3A8MUI8YA034_HUMANNaNPutative UPF0607 protein ENSP00000383783
Loading a fixed-width file with formatting

The dataset for this section and suggestions on formatting can be found in this article by Amy Rask.

Pandas Documentation

The full documentation for read_fwf is available in the pandas documentation.