Project file office document embed

Thank you for taking the time in creating this wonderful code. I've been able to download your code and build it using Visual Basics. I'm able to use the viewer and it works when I extract the content of the .doc files I have. I was wondering if there was a way of executing the code through a command prompt?

I have multiple files in separate folders that are .doc and have objects like .doc/.pdf/.xls files in them that I wish to extract the contents from each one and have them be placed in the same location as the source file. However doing this manually is very tedious and time consuming. If I could execute the code using a command prompt I might be able to create a script using powershell to automate the process.

Wondering if this is possible? Sorry for the noob question.

Any help or guidance would be appreciated.

Dmitry Reznikov 2-Aug-17 12:30

Is it possible to use this method to extract macros/macro code from Office files?
Sign in· View Thread

Member 11358539 15-Jan-15 5:36

I was wondering, if it would be simple enough to extract the IconLabel value and use this as the file name when extracting from Word?

Basically, I need to relate the document I have extracted back to the position I found it in the report.

Kees van Spelde 15-Jan-15 7:40

It should be possible, you probably need to read the WordDocument stream. Never looked into that. Can you send me one of the files you are trying to read? Please send them to sicos2002@hotmail.com . I'll look to them this weekend.

Kees van Spelde 15-Jan-15 7:42

Member 11358539 16-Jan-15 4:05

Hmm I tested this when I first set it up, and I found moving the embedding position and saving did seem to reorder the documents in the pool.

Although, this is clearly not happening in the current document I am working on.

I am specifically trying to extract PDF documents, and I need to be able to relate the extracted document to the position it is in the word document.

I have spent all of this morning reading the Compound File format spec and I cannot find any reference to the icon label (in the word api, you can get the icon label of a selected PDF by using ?Selection.InlineShapes(1).OLEFormat.IconLabel), I also can't seem to find how the EMBED field relates to the object in the ObjectPool.

Ok so I think I have got my head round this now, they seem to be sorted in order of the OleObject Name. The reason this is going out of order is I have 11 embedded PDFs in my document and the object names are going up to OleObject11.

The number after the OleObject does relate to the position that those objects appear in the document text.

The sort is ending up like this:
OleObject1 --> EmbeddedObject
OleObject10 --> EmbeddedObject1
OleObject11 --> EmbeddedObject2
OleObject2 --> EmbeddedObject3

They do seem to be reordered in the document if you move the position and save.

So I am using something like

result.Add(ExtractFromStorageNode(compoundFile.RootStorage, outputFolder, Path.GetFileNameWithoutExtension(packagePart.Uri.OriginalString) ));

Kees van Spelde 16-Jan-15 5:09

You are not going to find anything about the icon in the compound file format because it has nothing to do with it. The compound file storage is just a storage format that Word 2003 and older did use to store information. Excel, PowerPoint, Visio, Outlook, etc.. is also using the same file structure.

Kees van Spelde 16-Jan-15 5:12

And if you want to speed things up.

Member 11358539 16-Jan-15 6:02

Thank you Kees, for the purposes I am using this for I think extracting the files based on their 'oleObject' name is going to satisfy my requirements for now - and I needed to get a fix out by the end of today so that has solved my current issue.

Although, like I said I think this probably only works for OOXML documents. I don't necessarily need the icon text, just need to be able to relate the position the Icon is found in the document to the EmbeddedObject that is extracted.

My software basically converts the word document to PDF but it expands all embedded PDFs into the document and scales them down to fit on the page.

Kees van Spelde 16-Jan-15 7:16

Good luck. please consider sharing some code with the open source community if you make something that could be useful to other people.

T Jenniges 10-Jun-14 13:55

I was working with PowerPoint embedded document extraction last week for use in a massively parallel document ingestion prototype. I used OpenMCDF for the compound document parsing.

I see you embedded OpenMCDF library which has a Mozilla license http://www.mozilla.org/MPL/[^] along with your extractor class in the same project.

This might make the your code also MPL license and not CPOL. You might want to separate concerns and just link to the OpenMcdf assembly.

Kees van Spelde 11-Jun-14 3:08

You had a good point about the license, I removed the OpenMCDF code and put it in a new nuget package called CompoundFileStorage. The reason why I didn't use the original OpenMCDF package is because it was written for C# .NET 2.0 and it was a little bit messy code (but still very good code). I upgraded the code to more now a days standards and used things that are only available in .NET 4.0.

Last Visit: 31-Dec-99 18:00 Last Update: 5-Sep-24 22:09

Refresh

General News Suggestion Question Bug Answer Joke Praise Rant Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.