Closed
Description
Code Sample, a copy-pastable example if possible
1 1 -13.120080 0.229 0.484 -0.378 -0.872
1 2 -1.902843 -0.090 0.256 1.791 0.967
1 3 -22.050698 -0.176 -0.394 0.922 -0.454
1 4 -30.349928 0.081 -0.194 -0.327 -0.981
1 5 -22.204160 -0.168 -0.197 0.984 -0.266
1 6 -28.001753 -0.065 0.597 -0.203 -0.802
1 7 -17.247524 0.108 0.194 0.474 0.774
1 8 -28.014811 0.017 0.994 0.493 0.112
1 9 -13.325491 0.259 0.189 -1.275 0.149
1 10 -10.063621 0.327 0.108 -1.784 0.061
...
115 18 5.697000 0.391 -0.027 0.252 1.000
115 19 8.324000 -0.283 0.132 0.227 -0.216
115 20 48.451000 0.070 -0.041 0.379 -0.082
115 21 0.146000 0.677 0.031 -0.561 -0.149
115 22 1.443000 -0.706 -0.033 -0.222 0.035
115 23 4.595000 0.654 -0.081 0.774 0.997
115 24 0.146000 -0.677 0.031 0.561 -0.149
115 25 4.595000 0.654 -0.081 0.774 0.997
115 26 6.769000 -0.363 -0.093 -0.298 0.996
115 27 24.157000 -0.280 -0.324 -0.142 -0.946
Problem description
I have a long fixed-width file (>100k lines) that whose head and tail are shown above. I want to read this file with pandas. I figure pd.read_fwf
is the way to do this. The issue comes up because it reads the first hundred lines, which start with ' 1'
to say "lets start reading at [2]" whereas the last hundred lines start with 115
, so it skips the initial 11
and starts the line with 5
, so I lose data.
A couple of approaches to solving this issue come to mind, though I'm sure there are others:
- Don't infer until all lines are scanned
- Take as an argument the number of lines to be scanned before concluding the format, including the option to scan all (e.g.
infer_from_all
) - Take as an argument which direction to scan -- top to bottom or bottom to top
Output of pd.show_versions()
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-107-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None