Creating a DataFrame from a Fixed-Width File using Pandas
Some datasets are provided in a fixed-width file format (common extension is .txt
, but includes many others as well). The pd.read_fwf
function provides the functionality to read fixed-width file formats.
Fixed-width files are those whose format is specified through the width of the columns, where each column will always have a certain number of characters.
Reading in a fixed-width file from a URL
With many publicly available datasets, you can access the dataset with a URL. One example of this is a dataset from the National Oceanic and Atmospheric Administration that gives the location of every station that is a part of the Global Historical Climatology Network, which collects climate data. The dataset contains a row for every weather stations in a fixed-width format:
To load this dataset, input the URL in quotes to the read_fwf
function.
import pandas as pd
df = read_fwf("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt")
ACW00011604 | 17.1167 | -61.7833 | 10.1 | ST JOHNS COOLIDGE FLD | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | |
---|---|---|---|---|---|---|---|---|
0 | ACW00011647 | 17.1333 | -61.7833 | 19.2 | ST JOHNS | NaN | NaN | NaN |
1 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH INTER. AIRP | NaN | GSN | 41196.0 |
2 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI INTL | NaN | NaN | 41194.0 |
3 | AEM00041217 | 24.4330 | 54.6510 | 26.8 | ABU DHABI INTL | NaN | NaN | 41217.0 |
4 | AEM00041218 | 24.2620 | 55.6090 | 264.9 | AL AIN INTL | NaN | NaN | 41218.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
119497 | ZI000067969 | 21.0500 | 29.3670 | 861.0 | WEST NICHOLSON | NaN | NaN | 67969.0 |
119498 | ZI000067975 | 20.0670 | 30.8670 | 1095.0 | MASVINGO | NaN | NaN | 67975.0 |
119499 | ZI000067977 | 21.0170 | 31.5830 | 430.0 | BUFFALO RANGE | NaN | NaN | 67977.0 |
119500 | ZI000067983 | 20.2000 | 32.6160 | 1132.0 | CHIPINGE | NaN | GSN | 67983.0 |
119501 | ZI000067991 | 22.2170 | 30.0000 | 457.0 | BEITBRIDGE | NaN | NaN | 67991.0 |
However, we see our first row of data has ended up as our column names. To fix this, we can use the names
parameter of the read_fwf
function. This parameter takes in a list of the column names.
We can find the name of each column in the guidebook in section IV.
Then we can do the same thing as before, now just with additional data.
import pandas as pd
df = read_fwf("https://www.ncei.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt",
names=["ID", "LATITUDE", "LONGITUDE", "ELEVATION", "STATE", "NAME", "GSN FLAG", "HCN/CRN FLAG"])
ID | LATITUDE | LONGITUDE | ELEVATION | STATE | NAME | GSN FLAG | HCN/CRN FLAG | |
---|---|---|---|---|---|---|---|---|
0 | ACW00011604 | 17.1167 | -61.7833 | 10.1 | ST JOHNS COOLIDGE FLD | NaN | NaN | NaN |
1 | ACW00011647 | 17.1333 | -61.7833 | 19.2 | ST JOHNS | NaN | NaN | NaN |
2 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH INTER. AIRP | NaN | GSN | 41196.0 |
3 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI INTL | NaN | NaN | 41194.0 |
4 | AEM00041217 | 24.4330 | 54.6510 | 26.8 | ABU DHABI INTL | NaN | NaN | 41217.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
119498 | ZI000067969 | 21.0500 | 29.3670 | 861.0 | WEST NICHOLSON | NaN | NaN | 67969.0 |
119499 | ZI000067975 | 20.0670 | 30.8670 | 1095.0 | MASVINGO | NaN | NaN | 67975.0 |
119500 | ZI000067977 | 21.0170 | 31.5830 | 430.0 | BUFFALO RANGE | NaN | NaN | 67977.0 |
119501 | ZI000067983 | 20.2000 | 32.6160 | 1132.0 | CHIPINGE | NaN | GSN | 67983.0 |
119502 | ZI000067991 | 22.2170 | 30.0000 | 457.0 | BEITBRIDGE | NaN | NaN | 67991.0 |
Reading in a fixed-width file from your local computer
Some datasets are not accessible online directly by Python and needs to be downloaded. To read in a fixed-width file from your computer, you often want to first make sure the file is in the same folder as the python file you are working with.
One fixed-width dataset is a dataset that contains information about the human protein sequences, and specifically about the gene sequence on chromosome 1. It can be found as part of UniProt's Swiss-Prot Protein Knowledgebase. The file begins with multiple lines of information at the top and bottom of the file that are not included in the dataset that provide additional information to the user:
---------------------------------------------------------------------------- UniProt - Swiss-Prot Protein Knowledgebase SIB Swiss Institute of Bioinformatics; Geneva, Switzerland European Bioinformatics Institute (EBI); Hinxton, United Kingdom Protein Information Resource (PIR); Washington DC, USA ---------------------------------------------------------------------------- Description: Human chromosome 1: entries, gene names and cross-references to MIM Name: humchr01.txt Release: 2022_02 of 25-May-2022 ---------------------------------------------------------------------------- This documents lists all the human protein sequence entries whose genes are known to be encoded on chromosome 1 in this release of UniProtKB/Swiss-Prot. Number of UniProtKB/Swiss-Prot entries encoded on chromosome 1: 2066 Reference for the chromosome sequence: Nature 441:315-321(2006). PubMed=16710414; DOI=10.1038/nature04727; For more information on chromosome 1 see: Sanger : http://www.sanger.ac.uk/about/history/hgp/chr1.html Ensembl : https://www.ensembl.org/Homo_sapiens/mapview?chr=1 NCBI : https://www.ncbi.nlm.nih.gov/mapview/maps.cgi?taxid=9606&chr=1 OMIM : https://www.ncbi.nlm.nih.gov/Omim/getmap.cgi?chromosome=1 DOE : http://www.ornl.gov/sci/techresources/Human_Genome/launchpad/chrom01.shtml ______________ _______________ ______________________ ______ ______________________ Gene Chromosomal Swiss-Prot MIM Description name position AC Entry name code ______________ _______________ ______________________ ______ ______________________ A3GALT2 1p35.1 U3KPV4 A3LT2_HUMAN Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S] AADACL3 1p36.21 Q5VUY0 ADCL3_HUMAN Arylacetamide deacetylase-like 3 (EC 3.1.1.-) AADACL4 1p36.21 Q5VUY2 ADCL4_HUMAN Arylacetamide deacetylase-like 4 (EC 3.1.1.-) ABCA4 1p21-p22.1 P78363 ABCA4_HUMAN 601691 Retinal-specific phospholipid-transporting ATPase ABCA4 (EC 7.6.2.1) (ATP-binding cassette sub-family A member 4) (RIM ABC transporter) (RIM proteinv) (RmP) (Retinal-specific ATP-binding cassette transporter) (Stargardt disease protein) [ABCR] ABCB10 1q42 Q9NRK6 ABCBA_HUMAN 605454 ATP-binding cassette sub-family B member 10, mitochondrial precursor (ABC-mitochondrial erythroid protein) (ABC-me protein) (ATP-binding cassette transporter 10) (ABC transporter 10 protein) (Mitochondrial ATP-binding cassette 2) (M-ABC2)[...] See the full dataset at https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/humchr01.txt [...]ZSWIM5 1p34.1 Q9P217 ZSWM5_HUMAN Zinc finger SWIM domain-containing protein 5 [KIAA1511] ZYG11A 1p32.3 Q6WRX3 ZY11A_HUMAN 618675 Protein zyg-11 homolog A [ZYG11] ZYG11B 1p32.3 Q9C0D3 ZY11B_HUMAN 618673 Protein zyg-11 homolog B [KIAA1730] ZZZ3 1p31.1 Q8IYH5 ZZZ3_HUMAN ZZ-type zinc finger-containing protein 3 - 1p32.1 Q0P140 YA037_HUMAN Putative uncharacterized protein HSD52 [HSD52] - 1p34.3 Q6XCG6 YA011_HUMAN Putative uncharacterized protein PP632 [PP632] - 1p34.3 Q7L0L9 YA043_HUMAN Transmembrane protein LOC653160 - 1p36.13 P0DMU3 F231L_HUMAN FAM231A/C-like protein LOC102723383 - 1q21.3 A8MUI8 YA034_HUMAN Putative UPF0607 protein ENSP00000383783 -------------------------------------------------------------------------------- Copyrighted by the UniProt Consortium, see https://www.uniprot.org/terms Distributed under the Creative Commons Attribution (CC BY 4.0) License --------------------------------------------------------------------------------
There's 35 lines at the top we need to skip and 5 at the bottom. To do this, we will use the skiprows
and skipfooter
parameters in the read_fwf
function. That will look like this:
import pandas as pd
df = read_fwf('humchr01.txt', skiprows=35, skipfooter=5)
A3GALT2 | 1p35.1 | U3KPV4 | A3LT2_HUMAN | Unnamed: 4 | Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87) (Isoglobotriaosylceramide synthase) (iGb3 synthase) (iGb3S) [A3GALT2P] [IGBS3S] | Unnamed: 6 | Unnamed: 7 | |
---|---|---|---|---|---|---|---|---|
0 | AADACL3 | 1p36.21 | Q5VUY0 | ADCL3_HUMAN | NaN | Arylacetamide deacetylase-like 3 (EC 3.1.1.-) | NaN | NaN |
1 | AADACL4 | 1p36.21 | Q5VUY2 | ADCL4_HUMAN | NaN | Arylacetamide deacetylase-like 4 (EC 3.1.1.-) | NaN | NaN |
2 | ABCA4 | 1p21-p22.1 | P78363 | ABCA4_HUMAN | 601691.0 | Retinal-specific phospholipid-transporting ATP... | NaN | NaN |
3 | ABCB10 | 1q42 | Q9NRK6 | ABCBA_HUMAN | 605454.0 | ATP-binding cassette sub-family B member 10, m... | NaN | NaN |
4 | ABCD3 | 1p21-p22 | P28288 | ABCD3_HUMAN | 170995.0 | ATP-binding cassette sub-family D member 3 (EC... | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2065 | - | 1p32.1 | Q0P140 | YA037_HUMAN | NaN | Putative uncharacterized protein HSD52 [HSD52] | NaN | NaN |
2066 | - | 1p34.3 | Q6XCG6 | YA011_HUMAN | NaN | Putative uncharacterized protein PP632 [PP632] | NaN | NaN |
2067 | - | 1p34.3 | Q7L0L9 | YA043_HUMAN | NaN | Transmembrane protein LOC653160 | NaN | NaN |
2068 | - | 1p36.13 | P0DMU3 | F231L_HUMAN | NaN | FAM231A/C-like protein LOC102723383 | NaN | NaN |
2069 | - | 1q21.3 | A8MUI8 | YA034_HUMAN | NaN | Putative UPF0607 protein ENSP00000383783 | NaN | NaN |
We again see the first row of data has been used as column names. Again, we will use the names
parameter to provide the function with a list of column names.
However this time, when we use the names parameter, it makes the first two fields into the index column. We don't want this. So instead, we will set the index_col
parameter to False
to specify that we haven't included a column for the index, so the dataframe must generate one itself.
Putting it all together, our program will look like this:
import pandas as pd
df = read_fwf('humchr01.txt', skiprows=35, skipfooter=5, index_col=False,
names=['gene_name', 'chromosomal_position', 'uniprot', 'entry_name', 'mtm_code', 'description'])
gene_name | chromosomal_position | uniprot | entry_name | mtm_code | description | |
---|---|---|---|---|---|---|
0 | A3GALT2 | 1p35.1 | U3KPV4 | A3LT2_HUMAN | NaN | Alpha-1,3-galactosyltransferase 2 (EC 2.4.1.87... |
1 | AADACL3 | 1p36.21 | Q5VUY0 | ADCL3_HUMAN | NaN | Arylacetamide deacetylase-like 3 (EC 3.1.1.-) |
2 | AADACL4 | 1p36.21 | Q5VUY2 | ADCL4_HUMAN | NaN | Arylacetamide deacetylase-like 4 (EC 3.1.1.-) |
3 | ABCA4 | 1p21-p22.1 | P78363 | ABCA4_HUMAN | 601691.0 | Retinal-specific phospholipid-transporting ATP... |
4 | ABCB10 | 1q42 | Q9NRK6 | ABCBA_HUMAN | 605454.0 | ATP-binding cassette sub-family B member 10, m... |
... | ... | ... | ... | ... | ... | ... |
2066 | - | 1p32.1 | Q0P140 | YA037_HUMAN | NaN | Putative uncharacterized protein HSD52 [HSD52] |
2067 | - | 1p34.3 | Q6XCG6 | YA011_HUMAN | NaN | Putative uncharacterized protein PP632 [PP632] |
2068 | - | 1p34.3 | Q7L0L9 | YA043_HUMAN | NaN | Transmembrane protein LOC653160 |
2069 | - | 1p36.13 | P0DMU3 | F231L_HUMAN | NaN | FAM231A/C-like protein LOC102723383 |
2070 | - | 1q21.3 | A8MUI8 | YA034_HUMAN | NaN | Putative UPF0607 protein ENSP00000383783 |
The dataset for this section and suggestions on formatting can be found in this article by Amy Rask.
Pandas Documentation
The full documentation for read_fwf
is available in the pandas documentation.