It looks like you're using an Ad Blocker.
Please white-list or disable AboveTopSecret.com in your ad-blocking tool.
Thank you.
Some features of ATS will be disabled while you continue to use an ad-blocker.
Originally posted by idealord
Writing parsers for inconsistent data is impossible or at best problematic. Plus when you add characters that are meaningless in the middle of them - those >> characters aren't ASCII, they're who knows what, then it's just a mess. I was lucky to be able to separate city and state as often as I did.
I'm afraid at some point it'll come down to volunteers hand-fixing the data (which is why I kept the processed raw on the side).
Originally posted by idealord
All done!
parnasse.com...
Google wouldn't let me have a spreadsheet over 50,000 cells, EditGrid wouldn't let me upload a spreadsheet over 2MB, so here's an XLS file from my site zipped.
I fixed ALL the parsing errors by taking the raw data and replacing the bad binary characters with | symbols and then loading it into a spreadsheet program. The file also contains the raw data. I've lost the City/State separation, but because the original data was already non-normalized (read loooooose) we have complete consistency now. It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...
Originally posted by Xtraeme
Just want to give a quick update. I've got some bugs that I'm still trying to work out of the system, but progress is good.
Originally posted by idealord
As far as the sorting stuff, that's a typical sorting error because it's an alphabetical sort. Make sure you're sorting numerically - of course you'll have to get rid of the Page word. I can get rid of that easily with a search and replace and it should sort numerically.
As far as the dupes, I can run it again - heh - and produce new redundancies... sometime.
Do you guys want the 1024x1024 images?
Issac, you can credit the database to my real name (which many people know already) - Jeff Harrington.
Originally posted by kkrattiger
Here is the link.
Project Blue Book Smithsonian Astrophysical Observatory report from LC 1958
Originally posted by idealord
All done!
parnasse.com...
...
It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...
Originally posted by idealord
I can split it into pieces if you want, also...
Originally posted by idealord
Problem is you won't be able to globally sort them if you have 2 different spreadsheets and that'll limit your understanding of the pagination... what are you going to combine them into?
Originally posted by idealord
All done!
parnasse.com...
...
It's saved as Excel 2007 because Excel XP couldn't handle over 65,000 rows. I can save it out as pretty much anything now...
Sample from txt file:
"11443600|… [BLANK] » [BLANK] » [BLANK] » Page 7453
7199601|… 1968 » July » Brooklyn, New York » Page 132
11446642|"
If the file is a "xls" then there's no way he can have more than 65,000 rows.
Originally posted by IsaacKoi
I've watched the video below on converting files from the old xls format (used in your zip file) to the new Excel 2007 xlsx format, but the data remains at about 65,000 rows.
#!/usr/bin/perl -w
if(exists $ARGV[0]) [;
open(INF, "$ARGV[0]") or die "Cannot open $!";
foreach $line (@contents) [
$x = $line;
$x =~ s/,/_/g;
$x =~ s/\x20\xC2\xBB\x20/,/g;
$x =~ s/\x7C\xE2\x80\xA6\x20/,/g;
if($x =~ m/Page 1\s.*$/i) [
print OUF $x;
]
]
close(OUF);
]