Huge UFO Database in DOS - Someone's labour of love, can it be saved?, page 8

posted on Mar, 18 2017 @ 04:34 PM

link

First of all, thank you to Isaac for the screenshots. There are still some bytes that are defying explanation, which is upsetting my OCD

That said, with Isaac's permission, here's a summary of the work to date:

The Data

(i) Data extraction

I have extracted the entire database (and the accompanying source document) into XML format, though I still consider this "beta" due to some unexplained bytes. The file is about 45Mb (compared to the original .rnd at a mere 2Mb!) but has the data split into (mostly) human-understandable fields with human-readable data.

I'm happy to make this available once I've finished tweaking it. The file itself is of limited use on its own due to the size (notepad will hate you for trying to open it, Notepad++ is a bit more forgiving) but is ideal for loading into another program - for which a solution will be available, if you keep reading!

(ii) Latitude/Longitude

Some issues arose out of peculiarities with computers and mathematics. This meant that coordinates extracted directly from the original db and converted to decimal degrees/DMS were raising larger than acceptable deviations. Some of these were up to 100 miles in difference from what you would see if you viewed the data in the original program.

In the end I extracted the text DSM format (ie "10 0 0 N 9 9 9 E") from Isaac's PDF printout, on the basis that this shows the coordinates that the original author intended to display. I then inserted this data into the XML file I had created. I included a conversion to decimal degrees as well, but this way I knew I could return the definitive DMS version that you would see if you had used the original program.

(iii) Encoding

Some encoding issues may still be present in text. By encoding, I refer to the use of special symbols in the text summary field that are not in regular use or are encoded in a different way to more common systems such as UTF-8 (which is the encoding used for the XML file). The way this presents is that some characters may have been converted to something that appears... a bit random. Without manually checking all 18,122 records it's difficult to say if this is a major issue or minor sacrifice!

For now I am using preg-replace to strip out the "command data" that is causing XML to choke up, and using the php UTF-8 encode function to convert the other special characters to something more appropriate for UTF-8. When displayed in HTML, these characters appear as they do in the original database.

(iv) Mystery data

There remain 5 bytes and two nibbles (partial bytes) that are unidentified - though I have my suspicions. The full database includes altitude and elevation data that is not present in the demo version that I have. I expect to find this buried in the data somewhere.

(v) The Great Reference Mystery

The program does something quite interesting with sources. It displays the reference, and then a "pinpoint" or specific part of the reference (ie page number). We have that data. However, the program also identifies whether it is a page number, or issue, or case, or one of a few other kinds of pinpoint. So far I have not been able to work out quite how the program knows this. It is most likely connected to one of the mystery bytes or, more annoyingly, part of one of the mystery bytes, which makes it more difficult to identify.

(vi) Additional "look up" data

The lookup data falls into two categories: the terrain/continent/etc codes, and the sources.

So far I've been manually inputting terrain codes etc but will turn this into XML at some stage for ease of import.

The sources file has already been exported as XML.

(vii) The Location Debate

This is a bit of a pet peeve, possibly, but it's really a function of trying to design a very compact database with limits on the space available. The original db contains a rather... "rough"... collection of locations, breaking it down by continent (though it's actually regions not continents), country, and then a region within the country. It uses custom reference codes for this.

As I am not restrained by those same limitations, I am considering using the lat/long data and the miracle of reverse geocoding to create location fields that align with the modern format set out in ISO 3166-1 A3:

en.wikipedia.org...

I feel this would be more intuitive and that aligning with a widely used and understood format would mean that more could potentially be done with the data.

The question would then become whether it is worth overhauling the existing custom continent/country/region system to a standards-compliant [continent?]/ISO 3166-1 Country/ISO 3166-2 Region.

The LH database is more granular than the standard 2-letter continent codes which may be useful when seeking to refine searches, so this may be better to preserve, but would present an issue in relation to correctly assigning the country codes to this non-standard continent breakdown. Many would directly translate (such as North America/South America) but LH goes further in breaking Europe down into west and east, Africa becomes north and south, etc.

Any thoughts in relation to this are welcome. There is nothing to say that the original data has to be replaced, a modernised dataset could live alongside it, but the question is - is it worth it?

Database

Yes, there is a working mySQL database now living on my test server. It's currently in a "bare" version, essentially mimicking the same record structure as the original. This is far from optimal so I am working on a schema that will give the best flexibility and power in terms of searches.

There is also a site being built so that the entire database will be searchable online. This will also allow me to pass the location data to googlemaps rather than messing around with rebuilding the custom maps from the original program.

To Do

(i) Identify the mystery data!

(ii) Tweak the pre-export parsing to ensure only data that will break the XML is removed. There may be some characters being caught that could be handled better by UTF8 encoding.

(iii) Restructure the XML to something more satisfactory. It's functional, but I think the layout could be optimised.

(iv) Finish building the website with a nice front end to make the data easy to search and pretty to look at!

(v) Design some decent SQL queries rather than the rough and ready test ones I've been throwing into mySQL Workbench

(vi) Some other stuff which I've probably forgotten to mention

If you're wondering about the data structure, there's another post coming up...

edit on Ev34SaturdaySaturdayAmerica/ChicagoSat, 18 Mar 2017 16:34:57 -05009402017b by EvillerBob because: (no reason given)

EvillerBob

posted on Mar, 18 2017 @ 04:57 PM

link

The Record Structure

This is the structure as currently understood, though I haven't assessed Isaac's screenshots yet:

Bytes [2] Year
Bytes [1] Terrain (integer, refers to a terrain lookup table)
Bytes [1] Month (actually split into two nibbles, each 4 bits, one is month the other is a mystery)
Bytes [1] Day (split into two nibbles, a 5 bit day and a 3 bit mystery)
Bytes [1] Time
Bytes [1] YMDT (the accuracy codes [0-3] for Year, Month, Day, Time)
Bytes [1] Duration (in minutes)
Bytes [1] Unknown
Bytes [2] Longitude (1 bit flag and 15 bit integer, convert to get longitude)
Bytes [2] Latitude (1 bit flag and 15 bit integer, convert to get latitude)
Bytes [5] Unknown (likely to include altitude/elevation)
Bytes [1] Continent/Country (two 4 bit nibbles, refers to continent/country lookup tables)
Bytes [4] Location (string, 3 characters acting as a state/province code)
Bytes [8] Attributes (binary, 64 bits that act as flags for various attributes)
Bytes [78] Text (string, the text summary included with each record)
Bytes [1] Reference (integer, refers to the resource file)
Bytes [1] Pinpoint (integer acting as page number/issue/etc of reference)
Bytes [1] Strangeness/Credibility (two 4 bit nibbles, each with a string/hex 0-F)

When I say "look up tables" I think this is really a multi-dimensional look-up array, but that would just be nitpicking

IsaacKoi

posted on Mar, 18 2017 @ 05:28 PM

link

originally posted by: EvillerBob
here's a summary of the work to date:

Cheers EvillerBob. You've obviously been busy on this during the last few days!

I know one other person has been making progress as well, so I've emailed him to let him know about your post in this thread. I hope your work complements your (and vice versa).

EvillerBob

posted on Mar, 18 2017 @ 08:33 PM

link

originally posted by: IsaacKoi

originally posted by: EvillerBob
here's a summary of the work to date:

Cheers EvillerBob. You've obviously been busy on this during the last few days!

I know one other person has been making progress as well, so I've emailed him to let him know about your post in this thread. I hope your work complements your (and vice versa).

I sat on it for a few days when real life work cropped up, or I would be much further ahead at this stage. I can only apologise for the delay.

Great to hear that others are working on it too. More brains crunching the numbers is always a good thing. Having looked at the link that you sent me, I wanted to clarify the following:

4 (0x04) : Sighting day (1 byte): 32 if unknown

This is true for record #2 (the first record in the set with a 0 day) but not others. Records #3, #4, and #6 all return b00000000. This is consistent with the LHD manual which states that, for the day field, "00 = unknown" (p19).

Instead, this byte is actually split into two separate bits of information - in this case, the number "1" and then the day. This extra bit of info was actually my prime suspect for the missing "resource type" flag, but that suspicion was quickly proved wrong!

However, I agree with that person that using modulus to get the minutes out of the time base is better and I will probably update that for the next xml extraction.

EvillerBob

posted on Mar, 18 2017 @ 09:05 PM

link

Just to confirm:

4 out of the 5 bytes marked as "unknown" do indeed correlate to "Elevation in meters" (2 bytes) and "Relative altitude" (2 bytes) as shown in the screenshots provided. This also confirms that the final byte is indeed a single byte, which makes it a bit easier to assess what it is doing.

There appears to be something else happening with those fields, however, so a bit of further tinkering is in order. It may be that only a certain number of the bits are set aside - unfortunately the manual doesn't seem to give a maximum range for this field which would tell us how many bits it needed.

As it's 2am I'm not terribly excited about the idea of investigating it further for tonight!

edit on Ev06SaturdaySaturdayAmerica/ChicagoSat, 18 Mar 2017 21:06:18 -05001292017b by EvillerBob because: (no reason given)

javarome

posted on Mar, 19 2017 @ 09:58 AM

link

a reply to: EvillerBob

Hi EvillerBob,

Isaac notified me about your work and, thanks to your file structure post, I was able to locate and decode lat/long records (and credited you for this). As you said, the values read do not match those in the ASCII export Isaac provided us, so I applied a factor 1.11111111 on them. The result is almost correct (see the example output at github.com...), with a possible +/- 20 seconds delta unfortunately. Maybe you (nablator or somebody else) can help solving this riddle?

Thanks also for correcting my 32=unknown error and deciphering the elevation/relative altitude bytes. I wonder why the "no values" values are different for them (-99 and 999). Maybe this suggests that some higher bits again serve different purposes.

edit on 19-3-2017 by javarome because: (no reason given)

EvillerBob

posted on Mar, 19 2017 @ 12:53 PM

link

originally posted by: javarome
a reply to: EvillerBob

Hi EvillerBob,

Isaac notified me about your work and, thanks to your file structure post, I was able to locate and decode lat/long records (and credited you for this). As you said, the values read do not match those in the ASCII export Isaac provided us, so I applied a factor 1.11111111 on them. The result is almost correct (see the example output at github.com...), with a possible +/- 20 seconds delta unfortunately. Maybe you (nablator or somebody else) can help solving this riddle?

Thanks also for correcting my 32=unknown error and deciphering the elevation/relative altitude bytes. I wonder why the "no values" values are different for them (-99 and 999). Maybe this suggests that some higher bits again serve different purposes.

Welcome aboard!

The credit should go to others on the board - nablator and harpysounds identified the majority of the record structure, for example, including the position and structure of the lat/long bytes. The structure I posted is essentially the same as was posted much earlier in the thread, but adjusted for the extra data found in the full version.

The lat/long issue is, to my mind, essentially a rounding error. It was accurate for many, a third of a minute off for some, a minute or more off for a small handful from the test samples. The closest we have to a definitive answer is the DMS as it appears in the database printout, so I've just captured that data instead and left the database values alone as they are now redundant.

For reference, the decimal degree I'm including is calculated as such:

decimaldegrees = dmsdegrees + (dmsminutes / 60) + (dmsseconds / 3600)

I'm storing the decimal degrees as "DECIMAL(15,12)" in the new database. The original DMS are stored as three TINYINT fields and these are used when displaying a record, rather than play the conversion game needlessly. In fact, I could probably leave the decimal degrees completely as googlemaps appears to support DMS.

It's an unfortunate limitation that I'm not able to cross reference the .rnd file with the output as it appears in the original program. This might give a few clues as to what extra data the records are holding. Isaac has been very kind in supplying screenshots to date, but he might get a bit worn down if I send him requests for hundreds of records

IsaacKoi

posted on Mar, 19 2017 @ 01:38 PM

link

originally posted by: EvillerBob
Isaac has been very kind in supplying screenshots to date, but he might get a bit worn down if I send him requests for hundreds of records

Well, if that's what you need...

I could certainly start by sending you another one or two dozen screenshots (by, say, posting a Wetransfer link here), if that would help iron out the last few wrinkles.

avery51

posted on Mar, 19 2017 @ 02:50 PM

link

If the data is now in XML format then we can remake the application pretty easily, either as a web app for universal access, or even a mobile app if there is enough demand.

EvillerBob

posted on Mar, 19 2017 @ 03:14 PM

link

originally posted by: IsaacKoi

originally posted by: EvillerBob
Isaac has been very kind in supplying screenshots to date, but he might get a bit worn down if I send him requests for hundreds of records

Well, if that's what you need...

I could certainly start by sending you another one or two dozen screenshots (by, say, posting a Wetransfer link here), if that would help iron out the last few wrinkles.

It might be more useful for me to identify some specific records first, based on which part of the data I'm trying to investigate. If it's not taking liberties with your time, I'll draw up a list. It might not be until tomorrow, however.

EvillerBob

posted on Mar, 19 2017 @ 03:21 PM

link

originally posted by: avery51
If the data is now in XML format then we can remake the application pretty easily, either as a web app for universal access, or even a mobile app if there is enough demand.

That's essentially what I am working towards, though via an online portal rather than a standalone application.

The original application also contains a lot more than just the records. It's rather impressive!

On a slight tangent, I believe that the database file we have is not the most recent. From the link to the LH website, we find that "As of June 2006, *U* holds 18,552" records, while our copy goes up to 18,122. The last record we have is dated 19th May 2003.

javarome

posted on Mar, 19 2017 @ 04:54 PM

link

Hey, just to keep you updated, I've just integrated the decoding of the flags (about location, ufo/occupants, etc.).

Those 3-letters flags are arranged 8 in a row so, basically, they are set when for each bit set in the bytes from pos 23 to 30 in a record. However some flags are described to have sub-possibilities, such as "PSH: 1) Pseudo-Human: Possible clone, robot or worse. 2) "Human" seen working with or for alien figures" and I wonder how one can choose between them using a single bit of information.

Feel free to check the details (and output result) on github.com...

Thanks again to Isaac, who provided the screenshots that allowed to decode this.

edit on 19-3-2017 by javarome because: (no reason given)

jeep3r

posted on Mar, 19 2017 @ 05:05 PM

link

a reply to: javarome

Incredible work, guys!

Good to see that the amount of witnesses/observers is also included in the database via the CIV, MIL and GND flags. Could be useful for distinguishing "mass sightings" from single observer sightings at a glance (later on, in case the frontend will include a map).

Keep it up!

EvillerBob

posted on Mar, 19 2017 @ 05:10 PM

link

originally posted by: javarome
Hey, just to keep you updated, I've just integrated the decoding of the flags (about location, ufo/occupants, etc.).

Those 3-letters flags are arranged 8 in a row so, basically, they are set when for each bit set in the bytes from pos 23 to 30 in a record. However some flags are described to have sub-possibilities, such as "PSH: 1) Pseudo-Human: Possible clone, robot or worse. 2) "Human" seen working with or for alien figures" and I wonder how one can choose between them using a single bit of information.

Feel free to check the details (and output result) on github.com...

Thanks again to Isaac, who provided the screenshots that allowed to decode this.

The 64bit Attribute List is set out in the manual at p 27. The Attributes appear to map directly and in order to the matrix displayed on screen - though note that I believe the full version had some slight changes made from the demo version for the last row.

I believe the "sub possibilities" are not separate options but rather to show that both possibilities fall under the same heading.

Also, rather annoyingly, at least one of the three letter codes are reused which caused me a bit of a headache last week. I was using an array to generate the individual attribute elements for the XML file. XML is perfectly happy with duplicate elements so I didn't notice any problem... until I was trying to autogenerate the matching sql fields and started hitting a duplicate field error

EvillerBob

posted on Mar, 19 2017 @ 05:23 PM

link

Also, be careful with the country codes. The byte is actually two 4 bit nibbles. The first is the "continent" code, the second is the "country" code.

For example - the UK is "GBI" in this system. It is in WEU ("Western Europe") which is continent code 3. GBI is the first in the list, so the continent code is 0.

In binary - remembering that the byte is actually two 4 bit nibbles - this becomes:

0011 0000

Converting the entire byte to a decimal brings you to 48 as you say, but it is actually 3 / 0.

Think of it as a multidimensional array.

edit on Ev24SundaySundayAmerica/ChicagoSun, 19 Mar 2017 17:24:07 -05009752017b by EvillerBob because: (no reason given)

javarome

posted on Mar, 19 2017 @ 05:29 PM

link

originally posted by: EvillerBob
Also, be careful with the country codes. The byte is actually two 4 bit nibbles. The first is the "continent" code, the second is the "country" code.

For example - the UK is "GBI" in this system. It is in WEU ("Western Europe") which is continent code 3. GBI is the first in the list, so the continent code is 0.

In binary - remembering that the byte is actually two 4 bit nibbles - this becomes:

0011 0000

Converting the entire byte to a decimal brings you to 48 as you say, but it is actually 3 / 0.

Think of it as a multidimensional array.

Okaay, thanks for the tip. Until then I assumed some hardcoded mapping in the code, not in the db (i.e. GBI -> WEU as there is no other possibilities / storing the data as a "variable" part doesn't really make sense here).

IsaacKoi

posted on Mar, 19 2017 @ 05:40 PM

link

originally posted by: EvillerBob
If it's not taking liberties with your time, I'll draw up a list.

No problem. Just let me know what you need.

I don't expect other people to do all the work.

charlyv

posted on Mar, 19 2017 @ 06:07 PM

link

Just to chime in with a word of respect for this collaboration of effort in preserving this database so the rest of us can some day benefit from it.

I was a fonensics C programmer years ago and helped to back engineer "proprietary" databases that big companies were actually using, without standards or documentation. Once they went belly-up, it took months to figure out what they did and companies using their products suffered severely.

I really appreciate what you have all done, and you should be commended. This is a side of coding that many do not get to see, and even the chronology of postings in this thread, although they are the tip of the iceberg in all the behinds the scenes work you have done, are a testament of the synergy required to do such things. Congrats!

EvillerBob

posted on Mar, 19 2017 @ 06:36 PM

link

originally posted by: javarome...I assumed some hardcoded mapping in the code, not in the db...

It actually appears to be a bit of both. Yes, the db is holding relational data rather than the data itself, but the relational data is almost certainly a direct reference to a hard coded array. I don't recall there being any way to add continent/country information, though local regions can be added - or, more specifically, "screwed up", as the program relies on the user consistently using the same (unique) three letter local region codes for some of the other statistical functions. The very thought makes my eye twitch.

EvillerBob

posted on Mar, 19 2017 @ 06:38 PM

link

originally posted by: charlyv
I was a forensics C programmer years ago and helped to back engineer "proprietary" databases that big companies were actually using, without standards or documentation.

I've always strongly recommended that people never comment their code, it takes all the fun out of things for future programmers...