Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Great Parsing #27

Open
CosmicHorrorDev opened this issue Nov 7, 2021 · 7 comments
Open

The Great Parsing #27

CosmicHorrorDev opened this issue Nov 7, 2021 · 7 comments

Comments

@CosmicHorrorDev
Copy link
Owner

From finding out how to extract contents from .vpk files in #26 we now have over 60k VDF files to test parsing with just from the contents of a few Valve games

The full corpus is much too large and probably a nono to include in here, but I'll hack together a program that tries to parse each file and dump any ones that fail to a separate location. Once I get that running I'll post any failures here

@CosmicHorrorDev
Copy link
Owner Author

Exactly two files (the same exact same contents) use some weird platform tag identifier thing like so

"Foo"
{
    "Bar" [$WIN32]
    {
    }
}

Handling this would probably be a pain especially since I have no clue what possible values there are and I also don't know how all it can be applied (I'm assuming the above would make "Bar" and its value considered Windows 32-bit exclusive, can it also be applied to a value that is a string? Where else could it be used?

@CosmicHorrorDev
Copy link
Owner Author

It seems common to still use \ as a path separator instead of escaping a character. I suppose the easiest way to handle this would be to have escape characters to not be parsed by default and add an option to parse them since they seem incredibly rare

@CosmicHorrorDev
Copy link
Owner Author

It seems somewhat common to include a null byte at the end of the file. Not sure if this is packed file specific and just isn't handled right or if this is present normally (Hopefully it's just the former for consistency)

@CosmicHorrorDev
Copy link
Owner Author

Some files failed to read because they're not UTF-8 encoded. Need to dig into the different encodings used. It may be reasonable to expect users to handle encoding and convert it to UTF-8 for us

@CosmicHorrorDev
Copy link
Owner Author

It looks like the platform specific tags may be more common and do seem to indicate the platform that a value is used for. Here's a snippet from another file

"xpos"	"r223" [$WIN32]
"xpos"	"r223" [$X360HIDEF]
"xpos"	"r220" [$X360LODEF]

This also shows that it can be used on values that are strings as well. The full set of tags that I've seen so far are WIN32, WIN32WIDE, X360, X360HIDEF, X360LODEF, X360WIDE, DEMO, ENGLISH, JAPANESE, KOREAN, etc. and beyond that there looks to be some conditional logic that can be used as well like [$WIN32 && $ENGLISH] or [$WIN32 && !$ENGLISH]

The parsing position is a bit awkward as well since it can appear at the end of a pair for Key-String, but between the two tokens for Key-Obj. With how many different possible values there are it doesn't seem worth trying to parse specifics, we could just return the string for what's inside

Of the 16,353 failures this is included in 345

@CosmicHorrorDev
Copy link
Owner Author

The number of files that used #base are 292 of the 16,353 failures.

Of those files it appears that #base always appears on the top value. I'll have to dig in more to see if #base was ever used with a file that also has a #base

@CosmicHorrorDev
Copy link
Owner Author

Finally the number of files that use \ when not trying to represent an escaped character are 15,079 which makes it a very prevalent issue.

This was referenced Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant