r - Bypassing "ghost" line break or end of file (EOF) in data.table::fread -
i'm loading several large, tab-delimited text files exported (accessible me) database r using data.table::fread. fread handles of files great ease , speed, 1 of files generating regularly-reported fread error:
error in fread(read_problem, encoding = "utf-8", na.strings = "", header = true, : expected sep (' ') new line or eof ends field ... a smaller (2000 rows) version of file containing offending line available here (rds file).
here's how i've tried diagnose problem point:
library(data.table) # i'm using 1.9.7 development (same error 1.9.6) read_problem <- readrds("read_problem.rds") error <- fread(read_problem, encoding = "utf-8", na.strings = "", header = true, sep = "\t", colclasses = rep("character", 44), # simplicity verbose = true) if excise offending line, problem disappears:
cat(read_problem, file = "temp") string_vec <- readlines("temp") clipped_vec <- string_vec[-1027] # rid of problem line 1027 restored <- paste(clipped_vec, collapse = "\n") noerror <- fread(restored, encoding = "utf-8", na.strings = "", header = true, sep = "\t", colclasses = rep("character", 44)) # simplicity class(noerror) [1] "data.table" "data.frame" dim(noerror) [1] 1999 44 the error message seems clear enough: fread looking "\t" finding else in place.
but find nothing obvious closer @ offending line relative around it.
the number of tab characters same
sapply(gregexpr("\t", string_vec[1026:1028]), length) [1] 43 43 43 line break information seems identical
unlist(gregexpr("\n", string_vec[1026:1028])) [1] -1 -1 -1 here's @ offending line string:
string_vec[1027] [1] "urn:cornelllabofornithology:ebird:obs132960387\t29816\tspecies\tnelson's sparrow\tammodramus nelsoni\t\t\t1\t\t\tunited states\tus\tgeorgia\tus-ga\tglynn\tus-ga-127\tus-ga_3181\t\t\tjekyll island\tl140461\th\t31.0464993\t-81.4113007\t1990-11-03\t13:15:00\t\"jekyll island , causeway. partly cloudy, mild, ne wind 8-15 mph. note: did little birding in upland habitats time available rather brief.\" data entered on behalf of paul sykes alison huff (arhuff@uga.edu) on 12-15-11.\tlisted on old georgia field checklist \"sparrow, sharp-tailed.\"\tobsr289931\tpaul\tsykes\ts9336358\tebird - traveling count\tebird\t270\t8.047\t\t1\t1\t\t1\t0\t\t" any advice around problem without manual extraction of offending lines?
with this commit, fixed in v1.9.7, current development version. next stable release should therefore able read using quote="".
require(data.table) #v1.9.7+ fread('"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n', quote="") # v1 v2 v3 # 1: "abcd efgh." ijkl. mnop "qrst uvwx." 45 on 1027th line, @ end of "sparrow, sharp-tailed." there's 1 tab. in other lines, after field, there 2 before "obsr[0-9]" field starts.
the number of tabs seem match because, on line 1027, there's tab before "listed on old georgia field" instead of space..
therefore line 1027 gets 43 cols instead of 44. seems issue.
looking @ again, seems listed on old georgia field checklist "sparrow, sharp-tailed." should read separate column instead being read previous column...
here's smaller reproducible example:
# note there 2 instead of 3 columns fread('"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n') # v1 v2 # 1: abcd efgh." ijkl.\tmnop "qrst uvwx. 45 # add header column , returns same error fread('a\tb\tc\n"abcd efgh." ijkl.\tmnop "qrst uvwx."\t45\n') # error in fread("a\tb\tc\n\"abcd efgh.\" ijkl.\tmnop \"qrst uvwx.\"\t45\n") : # expected sep (' ') new line, eof (or other non printing character) # ends field 1 when detecting types ( first): "abcd efgh." ijkl. mnop # "qrst uvwx." 45 filed 1367.
Comments
Post a Comment