From: Michal H. <ms...@gm...> - 2008-09-08 22:15:44
|
Hi, this is the second attempt for correct text encoding handling in our code. See raw changelog below and separate patches for more details. Jozo, could you have a look so that I can continue? Currently we are displaying raw string from text operators (the one stored in text operator's operands). This is not correct because operands don't contain strings of characters but rather codes which are mapped to glyphs by the font. Simple fonts may have 1:1 mapping where code is the same as character. But pdf creator is really free to choose an arbitrary mapping (e.g. OO numbers code from 1 for all present characters). First patch provides access to the content stream resources. The second one adds factory method for PdfOperator creation rather than hard-coded instantiation. Third one in the row adds TextSimpleOperator which inherits from the SimpleGenericOperator and handles text from its operands in the special way (rather than in generic code). It also removes raw string extraction from the scripting to the TextSimpleOperator. The last patch uses StateUpdater for post-initialization according to the current GfxState. In short, it will store current font data to the operator (name and tag which can be used for GfxFont retreiving fro resources). This is in no way complete solution. I was able to get the correct text from document associated with the bug report (http://pdfedit.petricek.net/bt/view.php?id=253), but I had some other problems with example text created by OO which cotains Slovak, Czech and Russian text (in the CVS as multi_enc.pdf). Text is not displayed and even search engine doesn't work 100% correctly. Current implementation provides only op -> text conversion. Other way around is not implemented yet, but is has solid bases to be done in the next version (we have font data but I haven't a good ide about conversion yet - this will probably require some changes into GfxFont code). Changelog v1 -> v2 ================== * operator_factory.patch added which adds factory function for operators as discussed. * resources.patch added which adds getResources method to the CContentStream to be available from text operators which contain reference to its content stream. This patch comes as separate change, because it can be commited as it is even withou all other patches. It can't produce any regression. * additional data stored in text operator object for its font identification. This will help to get to the font instance anytime later. -- Michal Hocko |
From: Michal H. <ms...@gm...> - 2008-09-08 22:15:44
|
This patch introduces the TextSimpleOperator specialized class for text operators holding a text and replaces script based getTextFromTextOperator function implementation by native TextSimpleOperator::getRawText function. We need this new operator for proper handling of encoding for text objects. QSPdfOperator::getEncodedText is the definitive interface for scripting to provide correctly encoded text used by getTextFromTextOperator scripting function. Currently we are simply providing raw text without any consideration of font encoding. This patch prepares background for proper implementation inside the TextSimpleOperator class. I also think that it is cleaner to have getTextFromTextOperator implemented in the core rather than scripting because this functionality may be required by some code which is not based on scripting (e.g. users of devel package) and it is too core to be scriptable (but this can be arguable) * gui/pdfoperators.qs - getTextFromTextOperator implementation removed and replaced by simple call to getEncodedText * gui/qspdfoperator.{cc,h} - QSPdfOperator::getEncodedText added to enable TextSimpleOperator::getEncodedText method to be exported to the scripting. I am not sure whether this is proper place, but there doesn't seem to better one. * kernel/pdfoperators.{cc,h} - TextSimpleOperator class added - inherits from SimpleGenericOperator - adds getRawText - based on previous getTextFromTextOperator implementation and doesn't implement the encoding conversion yet - createOperator factory function updated to create TextSimpleOperator for text operators * kernel/stateupdater.h - isTextOp added Changelog v2 -> v1 ================== * getEncodedText ranamed to getRawText because we don't implement any encoding related stuff in this patch, so this way it should be more clear Note that QSPdfOperator::getEncodedText name is kept, because it really involves encoding set for application. * createOperator factory updated Index: pdfedit-patches/src/gui/pdfoperator.qs =================================================================== --- pdfedit-patches.orig/src/gui/pdfoperator.qs 2008-09-08 14:42:43.000000000 -0600 +++ pdfedit-patches/src/gui/pdfoperator.qs 2008-09-08 14:46:35.000000000 -0600 @@ -971,37 +971,7 @@ function getTextFromTextOperator( op ) { if (! isTextOp( op )) return ""; - switch (op.getName()) { - case "'": - case "Tj": - if (op.paramCount() == 1) { - return op.params().property(0).value(); - } - break; - case "\"": - if (op.paramCount() == 3) { - return op.params().property(2).value(); - } - break; - case "TJ": - if ((op.paramCount() != 1) || - (op.params().property(0).type() != "Array")) { - break; - } - var p = op.params().property(0); // Array of String and Numbers - var c = p.count(); - var i = 0; - var text = ""; - for ( ; i < c ; ++i ) { - if (p.property(i).getType() == "String") { - text = text + p.property(i).value(); - } - } - return text; - break; - } - - return ""; + return op.getEncodedText(); } /** Move text operator relative [dx,dy] */ Index: pdfedit-patches/src/gui/qspdfoperator.cc =================================================================== --- pdfedit-patches.orig/src/gui/qspdfoperator.cc 2008-09-08 14:42:43.000000000 -0600 +++ pdfedit-patches/src/gui/qspdfoperator.cc 2008-09-08 14:46:35.000000000 -0600 @@ -112,6 +112,16 @@ QString QSPdfOperator::getText() { return util::convertToUnicode(text,util::PDF); } +QString QSPdfOperator::getEncodedText() { + std::string text; + TextSimpleOperator * textOp = dynamic_cast<TextSimpleOperator*>(obj.get()); + if (!textOp) + return QString::null; + // TODO change to return Font encoded text + textOp->getRawText(text); + return util::convertToUnicode(text, util::PDF); +} + /** Create new operator iterator from this PDF operator. The iterator will be initially positioned at this item Index: pdfedit-patches/src/gui/qspdfoperator.h =================================================================== --- pdfedit-patches.orig/src/gui/qspdfoperator.h 2008-09-08 14:42:43.000000000 -0600 +++ pdfedit-patches/src/gui/qspdfoperator.h 2008-09-08 14:46:35.000000000 -0600 @@ -126,6 +126,14 @@ public slots: int childCount(); /*- Return text representation of this pdf operator */ QString getText(); + + // TODO place to somehow better place or leave it here? + /*- + Returns encoded text from text operator. Returns an empty string + for all other operators. + */ + QString getEncodedText(); + /*- Return name of this pdf operator */ QString getName(); /*- Returns parameters of this operator in array */ Index: pdfedit-patches/src/kernel/pdfoperators.cc =================================================================== --- pdfedit-patches.orig/src/kernel/pdfoperators.cc 2008-09-08 14:44:53.000000000 -0600 +++ pdfedit-patches/src/kernel/pdfoperators.cc 2008-09-08 14:47:30.000000000 -0600 @@ -171,6 +171,69 @@ SimpleGenericOperator::init_operands (sh } +void TextSimpleOperator::getRawText(std::string& str)const +{ +using namespace utils; + utilsPrintDbg(debug::DBG_DBG, ""); + std::string name, rawStr; + getOperatorName(name); + Operands ops; + getParameters(ops); + if(name == "'" || name == "Tj") + { + if(ops.size() != 1 || !isString(ops[0])) + { + utilsPrintDbg(debug::DBG_WARN, "Bad operands for operator " + <<name<<" count="<<ops.size() + <<" ops[0] type="<< ops[0]->getType()); + return; + } + rawStr = getStringFromIProperty(ops[0]); + } + else if (name == "\"") + { + if(ops.size() != 3 || !isArray(ops[2])) + { + utilsPrintDbg(debug::DBG_WARN, "Bad operands for operator " + <<name<<" count="<<ops.size() + <<" ops[2] type="<< ops[2]->getType()); + return; + } + rawStr = getStringFromIProperty(ops[2]); + } + else if (name == "TJ") + { + shared_ptr<IProperty> op = ops[0]; + if (!isArray(op) || ops.size() != 1) + { + utilsPrintDbg(debug::DBG_WARN, "Bad operands for TJ operator: ops[type=" + << op->getType() <<" size="<<ops.size()<<"]"); + return; + } + shared_ptr<CArray> opArray = IProperty::getSmartCObjectPtr<CArray>(op); + std::vector<shared_ptr<IProperty> > props; + opArray->_getAllChildObjects(props); + std::vector<shared_ptr<IProperty> >::iterator i; + for(i=props.begin(); i!=props.end(); ++i) + { + shared_ptr<IProperty> p = *i; + + // TODO consider spacing coming from values + if(!(isString(p))) + continue; + rawStr += getStringFromIProperty(p); + } + + }else + { + utilsPrintDbg(debug::DBG_WARN, "Bad operator name="<<name); + return; + } + + str = rawStr; +} + + //========================================================== // Concrete implementations of CompositePdfOperator //========================================================== @@ -306,6 +369,12 @@ boost::shared_ptr<PdfOperator> createOpe // Get operands count size_t argNum = static_cast<size_t> ((chcktp->argNum > 0) ? chcktp->argNum : -chcktp->argNum); + // + // If endTag is "" it is a simple operator, composite otherwise + // + if (isTextOp(*chcktp)) + return shared_ptr<PdfOperator> (new TextSimpleOperator(chcktp->name, argNum, operands)); + if (isSimpleOp(*chcktp)) return shared_ptr<PdfOperator> (new SimpleGenericOperator (chcktp->name, argNum, operands)); Index: pdfedit-patches/src/kernel/pdfoperators.h =================================================================== --- pdfedit-patches.orig/src/kernel/pdfoperators.h 2008-09-08 14:43:39.000000000 -0600 +++ pdfedit-patches/src/kernel/pdfoperators.h 2008-09-08 14:46:35.000000000 -0600 @@ -116,6 +116,31 @@ public: }; // class SimpleGenericOperator +/** Text dedicated operator class. + * This class represents those text operators which contains text to be + * displayed. This is necessary, because text string stored in operator's + * operands is not the same as the displayed one in general and may be + * affected by font encoding. + */ +class TextSimpleOperator: public SimpleGenericOperator +{ +public: + TextSimpleOperator (const char* opTxt, const size_t numOper, Operands& opers) + :SimpleGenericOperator(opTxt, numOper, opers) {} + TextSimpleOperator(const std::string& opTxt, Operands& opers) + :SimpleGenericOperator(opTxt, opers) {} + + virtual ~TextSimpleOperator() {} + + /** Returns string represented by this text operator in raw format. + * Raw format doesn't take care about font used for this operator. + * @param str String to be set. + */ + virtual void getRawText(std::string& str)const; + +}; // class TextSimpleOperator + + //========================================================== // Concrete implementations of CompositePdfOperator Index: pdfedit-patches/src/kernel/stateupdater.h =================================================================== --- pdfedit-patches.orig/src/kernel/stateupdater.h 2008-09-08 14:46:49.000000000 -0600 +++ pdfedit-patches/src/kernel/stateupdater.h 2008-09-08 14:47:03.000000000 -0600 @@ -226,6 +226,22 @@ isSimpleOp (const StateUpdater::CheckTyp { return ('\0' == chck.endTag[0]); } /** + * Is it a text operator (one which holds text to be displayed). + * @param chck Check type structure. + * @return True if chck is a text operator, false otherwise. + */ +inline bool isTextOp(const StateUpdater::CheckTypes& chck) +{ + if (!strcmp(chck.name, "TJ") || + !strcmp(chck.name, "Tj") || + !strcmp(chck.name, "\"") || + !strcmp(chck.name, "'") + ) + return true; + return false; +} + +/** * Check if the operands match the specification and replace operand with * its stronger equivalent. * -- Michal Hocko |
From: Michal H. <ms...@gm...> - 2008-09-08 22:15:44
|
All operators contain reference back to their stream, but they are not able to dig out resources even though CContentStream contains them. Later we will need resources for proper font identification, so this is preparation for that. Index: pdfedit-patches/src/kernel/ccontentstream.h =================================================================== --- pdfedit-patches.orig/src/kernel/ccontentstream.h 2008-08-28 11:13:40.000000000 -0600 +++ pdfedit-patches/src/kernel/ccontentstream.h 2008-08-28 12:00:37.000000000 -0600 @@ -457,6 +457,13 @@ public: gfxres = res; } + /** Returns resources used by this content stream. + * @return Resources instance wrapped by shared pointer. + */ + boost::shared_ptr<GfxResources> getResources()const + { + return gfxres; + } /** * Save content stream to underlying cstream(s) and notify all observers. -- Michal Hocko |
From: Michal H. <ms...@gm...> - 2008-09-08 22:15:44
|
TextSimpleOperator can't be self-contained - in terms of proper initialization because we don't have enough information during instance creation (in constructor) so that we have to wait for StateUpdater which knows current GfxState for each pdf operator and misuse it for TextSimpleOperator after-initialization. [ I don't like that very much, because this is error prone (we can forget to call initEncodedText for some text operators - this is basically copied code spread around op*Update functions). But this seems like the only way how to implement it. ] All data required for later GfxFont instance retrieving are stored in the FontData structure now (initialized by setFontData from state updater). This way getFontData can get font instance anytime and use it for rawText decoding (as well as opposite direction encoded -> raw text transformation). * kernel/pdfoperator.{cc,h} - FontData added - defined in cc file because we don't need to have this interface public - TextSimpleOperator - fontData added - keeps all necessary data for font identification - getCurrentFont added - returns valid GfxFont instance for associated font from resources associated with content stream - setFontData, getFontData added * kernel/stateupdater.cc - opTjUpdate, opSlashUpdate, opTJUpdate call setFontData * gui/qspdfoperator.cc - getEncodedText uses getFontText rather than getRawText Changelog v1 -> v2 ================== * getFontText, getFontName, setFontData added * get rid of cached rawString and encodedString - they may be useful and can be added later, but there is no real point in having them now * initEncodedString removed (calls replaced by setFontData) and font encoding transformation code moved to getFontText * don't initialize encoded text directly, but provide GfxFont and initialize all necessary data to get GfxFont later when needed - initEncodedtext removed and replaced by setFontData Index: pdfedit-patches/src/kernel/stateupdater.cc =================================================================== --- pdfedit-patches.orig/src/kernel/stateupdater.cc 2008-09-08 14:55:31.000000000 -0600 +++ pdfedit-patches/src/kernel/stateupdater.cc 2008-09-08 15:18:10.000000000 -0600 @@ -434,13 +434,19 @@ namespace { } // "Tj" GfxState * - opTjUpdate (GfxState* state, boost::shared_ptr<GfxResources>, const boost::shared_ptr<PdfOperator>, const PdfOperator::Operands& args, BBox* rc) + opTjUpdate (GfxState* state, boost::shared_ptr<GfxResources>, const boost::shared_ptr<PdfOperator> op, const PdfOperator::Operands& args, BBox* rc) { assert (1 <= args.size ()); // This can happen in really damaged pdfs - if (state->getFont()) - StateUpdater::printTextUpdate (state, getStringFromIProperty (args[0]), rc); + if (state->getFont()) { + const TextSimpleOperator *txtOp = dynamic_cast<const TextSimpleOperator*>(op.get()); + assert(txtOp); + txtOp->setFontData(state->getFont()); + std::string rawStr; + txtOp->getRawText(rawStr); + StateUpdater::printTextUpdate (state, rawStr, rc); + } // return changed state return state; @@ -481,7 +487,7 @@ namespace { } // "\" GfxState * - opSlashUpdate (GfxState* state, boost::shared_ptr<GfxResources>, const boost::shared_ptr<PdfOperator>, const PdfOperator::Operands& args, BBox* rc) + opSlashUpdate (GfxState* state, boost::shared_ptr<GfxResources>, const boost::shared_ptr<PdfOperator> op, const PdfOperator::Operands& args, BBox* rc) { assert (3 <= args.size ()); @@ -498,7 +504,13 @@ namespace { double ty = state->getLineY() - state->getLeading(); state->textMoveTo(tx, ty); - StateUpdater::printTextUpdate (state, getStringFromIProperty (args[2]), rc); // to 'rc' save only text bbox + + const TextSimpleOperator *txtOp = dynamic_cast<const TextSimpleOperator*>(op.get()); + assert(txtOp); + txtOp->setFontData(state->getFont()); + std::string rawStr; + txtOp->getRawText(rawStr); + StateUpdater::printTextUpdate (state, rawStr, rc); // Set edge of rectangle from actual position on output devices //state->transform(state->getCurX (), state->getCurY(), & rc->xright, & rc->yright); @@ -508,7 +520,7 @@ namespace { } // "TJ" GfxState * - opTJUpdate (GfxState* state, boost::shared_ptr<GfxResources>, const boost::shared_ptr<PdfOperator>, const PdfOperator::Operands& args, BBox* rc) + opTJUpdate (GfxState* state, boost::shared_ptr<GfxResources>, const boost::shared_ptr<PdfOperator> op, const PdfOperator::Operands& args, BBox* rc) { assert (1 <= args.size ()); @@ -570,6 +582,9 @@ namespace { rc->yright = max( rc->yright, max( h_rc.yleft, h_rc.yright ) ); }// for + const TextSimpleOperator *txtOp = dynamic_cast<const TextSimpleOperator*>(op.get()); + assert(txtOp); + txtOp->setFontData(state->getFont()); // return changed state return state; } Index: pdfedit-patches/src/kernel/pdfoperators.h =================================================================== --- pdfedit-patches.orig/src/kernel/pdfoperators.h 2008-09-08 15:11:58.000000000 -0600 +++ pdfedit-patches/src/kernel/pdfoperators.h 2008-09-08 15:18:10.000000000 -0600 @@ -121,16 +121,32 @@ public: * displayed. This is necessary, because text string stored in operator's * operands is not the same as the displayed one in general and may be * affected by font encoding. + * <br> + * Use getFontText method to retreive text from text operator filtered + * through its font code maps. */ class TextSimpleOperator: public SimpleGenericOperator { + // forward declaration + struct FontData; + + /** Font data for later identification of associated font + */ + mutable FontData* fontData; +protected: + /** Finds current font for operator from fontName. + * Uses resources from content stream to retriev font by name. + * Returned instance must not be deallocated by caller. + * @return Font instance for this operator. + */ + GfxFont* getCurrentFont()const; public: TextSimpleOperator (const char* opTxt, const size_t numOper, Operands& opers) - :SimpleGenericOperator(opTxt, numOper, opers) {} + :SimpleGenericOperator(opTxt, numOper, opers), fontData(NULL) {} TextSimpleOperator(const std::string& opTxt, Operands& opers) - :SimpleGenericOperator(opTxt, opers) {} + :SimpleGenericOperator(opTxt, opers), fontData(NULL) {} - virtual ~TextSimpleOperator() {} + virtual ~TextSimpleOperator(); /** Returns string represented by this text operator in raw format. * Raw format doesn't take care about font used for this operator. @@ -138,6 +154,27 @@ public: */ virtual void getRawText(std::string& str)const; + /** Returns string represented by this text operator converted + * according the font encoding. + * @param str String to be set. + */ + virtual void getFontText(std::string& str)const; + + /** Sets font specific stuff. + * This method should be called from StateUpdater when we do know the + * current font for this operator. + * <br> + * This method doesn't influence operator itself (or its operands). + * @param gfxFont Xpdf GfxFont instance. + */ + void setFontData(GfxFont* gfxFont)const; + + /** Returns font name for this operator. + * May return null if setFontData hasn't been called yet. + * @return Font name or NULL if not initialized yet. + */ + const char* getFontName()const; + }; // class TextSimpleOperator Index: pdfedit-patches/src/kernel/pdfoperators.cc =================================================================== --- pdfedit-patches.orig/src/kernel/pdfoperators.cc 2008-09-08 15:11:58.000000000 -0600 +++ pdfedit-patches/src/kernel/pdfoperators.cc 2008-09-08 15:18:10.000000000 -0600 @@ -233,6 +233,104 @@ using namespace utils; str = rawStr; } +/** Simple class for font data encapsulation. + */ +class TextSimpleOperator::FontData +{ + char * fontName; + char * fontTag; +public: + FontData(GfxFont* font) + { + fontName = strdup(font->getName()->getCString()); + fontTag = strdup(font->getTag()->getCString()); + } + ~FontData() + { + if(fontName) + free(fontName); + if(fontTag) + free(fontTag); + } + + const char * getFontName()const + { + return fontName; + } + + const char * getFontTag()const + { + return fontTag; + } +}; + +GfxFont* TextSimpleOperator::getCurrentFont()const +{ + assert(fontData); + const char* tag = fontData->getFontTag(); + shared_ptr<GfxResources> res = getContentStream()->getResources(); + GfxFont* font = res->lookupFont(tag); + if(!font) + utilsPrintDbg(debug::DBG_ERR, "Unable to get font(name=" + <<fontData->getFontName() + <<", tag="<<fontData->getFontTag() + <<") for operator"); + return font; +} + +void TextSimpleOperator::getFontText(std::string& str)const +{ + std::string rawStr; + getRawText(rawStr); + + int len = rawStr.size(); + GString raw(rawStr.c_str(), len); + GfxFont* font = getCurrentFont(); + if(!font) + return; + utilsPrintDbg(debug::DBG_INFO, "Textoperator uses font="<<fontData->getFontName()); + CharCode code; + Unicode u; + int uSize, uLen; + double dx, dy, originX, originY; + char * p=raw.getCString(); + while(len>0) + { + int n = font->getNextChar(p, len, &code, &u, (int)(sizeof(u) / sizeof(Unicode)), &uLen, + &dx, &dy, &originX, &originY); + for (int i=0; i<uLen; ++i) + str += (&u)[i]; + p += n; + len -= n; + } +} + +TextSimpleOperator::~TextSimpleOperator() +{ + if(fontData) + delete fontData; +} + +const char* TextSimpleOperator::getFontName()const +{ + assert(fontData); + return fontData->getFontName(); +} + +void TextSimpleOperator::setFontData(GfxFont* gfxFont)const +{ + assert(gfxFont); + if (!gfxFont) + { + utilsPrintDbg(debug::DBG_ERR, "Null font encountered"); + return; + } + if(fontData) + delete fontData; + fontData = new FontData(gfxFont); +} + + //========================================================== // Concrete implementations of CompositePdfOperator Index: pdfedit-patches/src/gui/qspdfoperator.cc =================================================================== --- pdfedit-patches.orig/src/gui/qspdfoperator.cc 2008-09-08 15:11:58.000000000 -0600 +++ pdfedit-patches/src/gui/qspdfoperator.cc 2008-09-08 15:18:10.000000000 -0600 @@ -117,8 +117,7 @@ QString QSPdfOperator::getEncodedText() TextSimpleOperator * textOp = dynamic_cast<TextSimpleOperator*>(obj.get()); if (!textOp) return QString::null; - // TODO change to return Font encoded text - textOp->getRawText(text); + textOp->getFontText(text); return util::convertToUnicode(text, util::PDF); } -- Michal Hocko |
From: Michal H. <ms...@gm...> - 2008-09-08 22:15:44
|
Currently we have only SimpleGenericOperator and UnknownCompositeOperators operators classes and we are using simple instantiation. This is, however, far from the best solution, because we have problem when we introduce new specialized operators (like the one in the following patch). Therefore createOperator factory function was introduced. I wasn't sure whether this function can be placed as static in the PdfOperator class, so it is standalone at the moment. I can move it there if you want Jozo. I have run kernel_tests TEST_PDFOPERS TEST_CCONTENTSTREAM for zadani.pdf and pdf specification without any problems. So this change should be safe. * gui/base.cc - Base::createOperator uses factory function for operators rather than hardcoded SimpleGenericOperator * kernel/ccontentstream.cc - createOperator renamed createOperatorFromStream (to distinguish it from factory method) and updated to use pdfoperator factory function rather than hard-coded instance creation (with exception of inline images which need special treatment). Most of the original code moved to the pdfoperator.cc - createOperands renamed to createOperandsFromStream (to be in sync with createOperatorFromStream) - checkAndFix renamed to checkAndFixOperator and moved to podfoperator module * kernel/pdfoperators.{cc,h} - createOperator added - checkAndFixOperator added Index: pdfedit-patches/src/kernel/ccontentstream.cc =================================================================== --- pdfedit-patches.orig/src/kernel/ccontentstream.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/kernel/ccontentstream.cc 2008-09-08 14:56:59.000000000 -0600 @@ -91,68 +91,6 @@ namespace { } } - /** - * Check if the operands match the specification and replace operand with - * its stronger equivalent. - * - * (e.g. When xpdf returns an object with integer type, but the operand can be a real, we have to - * convert it to real.) - * - * @param ops Operator specification - * @param operands Operand stack. - * - * @return True if type and count match, false otherwise. - */ - bool checkAndFix (const StateUpdater::CheckTypes& ops, PdfOperator::Operands& operands) - { - size_t argNum = static_cast<size_t> ((ops.argNum > 0) ? ops.argNum : -ops.argNum); - - // - // Check operator size if > 0 than it is the exact size, maximum - // otherwise - // - if (((ops.argNum >= 0) && (operands.size() != argNum)) - || ((ops.argNum < 0) && (operands.size() > argNum)) ) - { - utilsPrintDbg (DBG_ERR, "Number of operands mismatch.. expected " << ops.argNum << " got: " << operands.size()); - return false; - } - - // - // Check arguments - // - PdfOperator::Operands::reverse_iterator rit = operands.rbegin (); - // Be careful -- buffer overflow - argNum = std::min (argNum, operands.size()); - advance (rit, argNum); - PdfOperator::Operands::iterator it = rit.base (); - // Loop from the first operator to the end - for (int pos = 0; it != operands.end (); ++it, ++pos) - { - if (!isBitSet(ops.types[pos], (*it)->getType())) - { - utilsPrintDbg (DBG_ERR, "Bad " << pos << "-th operand type [" << (*it)->getType() << "] " << hex << " 0x" << ops.types[pos]); - return false; - } - - // - // If xpdf returned an Int, but the operand can be a real convert it - // - if (isInt(*it)) - { - if (isBitSet(ops.types[pos], pReal)) - { // Convert it to real - double dval = 0.0; - dval = IProperty::getSmartCObjectPtr<CInt>(*it)->getValue(); - shared_ptr<IProperty> pIp (new CReal (dval)); - std::replace (operands.begin(), operands.end(), *it, pIp); - } - } - } - - return true; - } - /** * Parse inline image. @@ -248,7 +186,7 @@ namespace { * @return True if everything ok, false if end of stream reached. */ bool - createOperands (CStreamsXpdfReader<CContentStream::CStreams>& streamreader, + createOperandsFromStream (CStreamsXpdfReader<CContentStream::CStreams>& streamreader, PdfOperator::Operands& operands, xpdf::XpdfObject& o) { @@ -287,54 +225,35 @@ namespace { * @param operands Operands of operator. They are shared through subcalls. */ shared_ptr<PdfOperator> - createOperator (CStreamsXpdfReader<CContentStream::CStreams>& streamreader, + createOperatorFromStream (CStreamsXpdfReader<CContentStream::CStreams>& streamreader, PdfOperator::Operands& operands) { // Get operands xpdf::XpdfObject o; - if (!createOperands (streamreader, operands, o)) + if (!createOperandsFromStream (streamreader, operands, o)) return shared_ptr<PdfOperator> (); - // Try to find the op by its name - const StateUpdater::CheckTypes* chcktp = StateUpdater::findOp (o->getCmd()); - // Operator not found, create unknown operator - if (NULL == chcktp) - return shared_ptr<PdfOperator> (new SimpleGenericOperator (string (o->getCmd()),operands)); - - assert (chcktp); - utilsPrintDbg (DBG_DBG, "Operator found. " << chcktp->name); - - // - // Check the type against specification - // - if (!checkAndFix (*chcktp, operands)) - { - //assert (!"Content stream bad operator type."); - throw ElementBadTypeException ("Content stream operator has incorrect operand type."); - } - // // SPECIAL CASE for inline image (stream within a text stream) // - if ( 0 == strncmp (chcktp->name, "BI", 2)) + if ( 0 == strncmp (o->getCmd(), "BI", 2)) { utilsPrintDbg (debug::DBG_DBG, ""); + const StateUpdater::CheckTypes* chcktp = StateUpdater::findOp (o->getCmd()); + assert(chcktp); + if (!checkAndFixOperator (*chcktp, operands)) + { + //assert (!"Content stream bad operator type."); + throw ElementBadTypeException ("Content stream operator has incorrect operand type."); + } shared_ptr<CInlineImage> inimg (getInlineImage (streamreader)); return shared_ptr<PdfOperator> (new InlineImageCompositePdfOperator (chcktp->name, chcktp->endTag, inimg)); } - - // Get operands count - size_t argNum = static_cast<size_t> ((chcktp->argNum > 0) ? chcktp->argNum : -chcktp->argNum); - // - // If endTag is "" it is a simple operator, composite otherwise - // - if (isSimpleOp(*chcktp)) - return shared_ptr<PdfOperator> (new SimpleGenericOperator (chcktp->name, argNum, operands)); - - else // Composite operator - return shared_ptr<PdfOperator> (new UnknownCompositePdfOperator (chcktp->name, chcktp->endTag)); + // factory function for all other operators + std::string name = o->getCmd(); + return createOperator(name, operands); } /** @@ -355,7 +274,7 @@ namespace { parseOp (CStreamsXpdfReader<CContentStream::CStreams>& streamreader, PdfOperator::Operands& operands) { // Create operator with its operands - shared_ptr<PdfOperator> result = createOperator (streamreader, operands); + shared_ptr<PdfOperator> result = createOperatorFromStream (streamreader, operands); if (result && isCompositeOp (result) && !isInlineImageOp (result)) { Index: pdfedit-patches/src/kernel/pdfoperators.cc =================================================================== --- pdfedit-patches.orig/src/kernel/pdfoperators.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/kernel/pdfoperators.cc 2008-09-08 14:55:30.000000000 -0600 @@ -138,7 +138,7 @@ SimpleGenericOperator::clone () assert (ops.size () == _operands.size()); // Create clone - return shared_ptr<PdfOperator> (new SimpleGenericOperator (_opText,ops)); + return createOperator (_opText,ops); } @@ -282,6 +282,44 @@ InlineImageCompositePdfOperator::clone ( // Helper funcions //========================================================== +boost::shared_ptr<PdfOperator> createOperator(const std::string& name, PdfOperator::Operands& operands) +{ + if (name == "BI") + throw NotImplementedException("Inline images not implemented here"); + + // Try to find the op by its name + const StateUpdater::CheckTypes* chcktp = StateUpdater::findOp (name.c_str()); + // Operator not found, create unknown operator + if (NULL == chcktp) + return shared_ptr<PdfOperator> (new SimpleGenericOperator (name ,operands)); + + assert (chcktp); + utilsPrintDbg (DBG_DBG, "Operator found. " << chcktp->name); + // Check the type against specification + // + if (!checkAndFixOperator (*chcktp, operands)) + { + //assert (!"Content stream bad operator type."); + throw ElementBadTypeException ("Content stream operator has incorrect operand type."); + } + + // Get operands count + size_t argNum = static_cast<size_t> ((chcktp->argNum > 0) ? chcktp->argNum : -chcktp->argNum); + + if (isSimpleOp(*chcktp)) + return shared_ptr<PdfOperator> (new SimpleGenericOperator (chcktp->name, argNum, operands)); + + // Composite operator + return shared_ptr<PdfOperator> (new UnknownCompositePdfOperator (chcktp->name, chcktp->endTag)); + +} + +boost::shared_ptr<PdfOperator> createOperator(const char *name, PdfOperator::Operands& operands) +{ + std::string n = name; + return createOperator(n, operands); +} + // //\todo improve performance // Index: pdfedit-patches/src/kernel/pdfoperators.h =================================================================== --- pdfedit-patches.orig/src/kernel/pdfoperators.h 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/kernel/pdfoperators.h 2008-09-08 14:55:30.000000000 -0600 @@ -28,6 +28,7 @@ // static includes #include "kernel/pdfoperatorsbase.h" +#include "kernel/stateupdater.h" //========================================================== namespace pdfobjects { @@ -224,6 +225,29 @@ protected: // Helper funcions - general //========================================================== +/** Factory function for operators creation. + * Creates instance depending on type of the operator. + * <br> + * Note that this function doesn't cover inline images (BI operator). + * @param name Opertor name. + * @param operands Operands for operator. + * @return Valid pdfoperator object. + * @throw ElementBadTypeException if operator or its operands are not valid. + * @throw NotImplementedException if given operator is inline image (BI). + */ +boost::shared_ptr<PdfOperator> createOperator(const std::string& name, PdfOperator::Operands& operands); + +/** Factory function for operators creation. + * Transforms const char parameter to the string and delegates to + * createOperator(std::string&, PdfOperator::Operands&) + * @param name Opertor name. + * @param operands Operands for operator. + * @return Valid pdfoperator object. + * @throw ElementBadTypeException if operator or its operands are not valid. + * @throw NotImplementedException if given operator is inline image (BI). + */ +boost::shared_ptr<PdfOperator> createOperator(const char *name, PdfOperator::Operands& operands); + /** Is an operator a composite. */ inline bool isCompositeOp (const PdfOperator* oper) Index: pdfedit-patches/src/kernel/stateupdater.h =================================================================== --- pdfedit-patches.orig/src/kernel/stateupdater.h 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/kernel/stateupdater.h 2008-09-08 14:55:30.000000000 -0600 @@ -225,6 +225,19 @@ inline bool isSimpleOp (const StateUpdater::CheckTypes& chck) { return ('\0' == chck.endTag[0]); } +/** + * Check if the operands match the specification and replace operand with + * its stronger equivalent. + * + * (e.g. When xpdf returns an object with integer type, but the operand can be a real, we have to + * convert it to real.) + * + * @param ops Operator specification + * @param operands Operand stack. + * + * @return True if type and count match, false otherwise. + */ +bool checkAndFixOperator (const pdfobjects::StateUpdater::CheckTypes& ops, PdfOperator::Operands& operands); //========================================================== Index: pdfedit-patches/src/gui/base.cc =================================================================== --- pdfedit-patches.orig/src/gui/base.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/gui/base.cc 2008-09-08 14:55:30.000000000 -0600 @@ -293,8 +293,7 @@ QSPdfOperator* Base::createOperator(cons std::string opTxt=util::convertFromUnicode(text,util::PDF); PdfOperator::Operands param; parameters->copyTo(param); - boost::shared_ptr<SimpleGenericOperator> op(new SimpleGenericOperator(opTxt,param)); - return new QSPdfOperator(op,this); + return new QSPdfOperator(pdfobjects::createOperator(opTxt, param),this); } /** Index: pdfedit-patches/src/kernel/contentschangetag.cc =================================================================== --- pdfedit-patches.orig/src/kernel/contentschangetag.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/kernel/contentschangetag.cc 2008-09-08 14:55:30.000000000 -0600 @@ -63,7 +63,7 @@ ContentsChangeTag::create () opers.push_back (dict); // Operator - return shared_ptr<SimpleGenericOperator> (new SimpleGenericOperator (CHANGE_TAG_NAME, opers)); + return createOperator (CHANGE_TAG_NAME, opers); } Index: pdfedit-patches/src/kernel/cpagedisplay.cc =================================================================== --- pdfedit-patches.orig/src/kernel/cpagedisplay.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/kernel/cpagedisplay.cc 2008-09-08 14:55:30.000000000 -0600 @@ -263,7 +263,7 @@ CPageDisplay::setTransformMatrix (double operands.push_back (shared_ptr<IProperty> (new CReal (tm[3]))); operands.push_back (shared_ptr<IProperty> (new CReal (tm[4]))); operands.push_back (shared_ptr<IProperty> (new CReal (tm[5]))); - shared_ptr<PdfOperator> cmop (new SimpleGenericOperator ("cm", 6, operands)); + shared_ptr<PdfOperator> cmop = createOperator("cm", operands); // Insert at the beginning _page->contents()->getContentStream((size_t)0)->frontInsertOperator (cmop); Index: pdfedit-patches/src/kernel/textoutputengines.cc =================================================================== --- pdfedit-patches.orig/src/kernel/textoutputengines.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/kernel/textoutputengines.cc 2008-09-08 14:55:30.000000000 -0600 @@ -254,7 +254,7 @@ namespace { string txt = getStringFromIProperty (ip); PdfOperator::Operands opers; opers.push_back (shared_ptr<CString> (new CString (txt))); - shared_ptr<SimpleGenericOperator> newop (new SimpleGenericOperator ("Tj", opers)); + shared_ptr<PdfOperator> newop = createOperator ("Tj", opers); // Set bbox BBox bbox; StateUpdater::printTextUpdate (s.get(), txt, &bbox); Index: pdfedit-patches/src/tests/kernel/testccontentstream.cc =================================================================== --- pdfedit-patches.orig/src/tests/kernel/testccontentstream.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/tests/kernel/testccontentstream.cc 2008-09-08 14:55:30.000000000 -0600 @@ -449,7 +449,7 @@ addcc (UNUSED_PARAM ostream& oss, const PdfOperator::Operands operands; Opers ops; - ops.push_back (boost::shared_ptr<PdfOperator> (new SimpleGenericOperator("lala",0,operands))); + ops.push_back (createOperator("lala",operands)); page->addContentStreamToFront (ops); vector<boost::shared_ptr<CContentStream> > cccs; Index: pdfedit-patches/src/tests/kernel/testpdfoperators.cc =================================================================== --- pdfedit-patches.orig/src/tests/kernel/testpdfoperators.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/tests/kernel/testpdfoperators.cc 2008-09-08 14:55:30.000000000 -0600 @@ -287,7 +287,7 @@ delAllOper (UNUSED_PARAM ostream& oss, c shared_ptr<PdfOperator> oper; string strr; - oper = shared_ptr<PdfOperator> (new SimpleGenericOperator ("BT", 0, operands)); + oper = createOperator ("BT", operands); cs->insertOperator (PdfOperator::Iterator (), oper, false); cs->getPdfOperators (opers); @@ -303,7 +303,7 @@ delAllOper (UNUSED_PARAM ostream& oss, c operands.clear (); operands.push_back (shared_ptr<IProperty> (new CInt (200))); operands.push_back (shared_ptr<IProperty> (new CInt (400))); - oper = shared_ptr<PdfOperator> (new SimpleGenericOperator ("Td", 2, operands)); + oper = createOperator ("Td", operands); cs->insertOperator (PdfOperator::getIterator(opers.back()), oper, false); cs->getPdfOperators (opers); cs->getStringRepresentation (strr); @@ -320,7 +320,7 @@ delAllOper (UNUSED_PARAM ostream& oss, c operands.push_back (shared_ptr<IProperty> (new CReal (1.0))); operands.push_back (shared_ptr<IProperty> (new CReal (0.0))); operands.push_back (shared_ptr<IProperty> (new CReal (0.0))); - oper = shared_ptr<PdfOperator> (new SimpleGenericOperator ("rg", 3, operands)); + oper = createOperator ("rg", operands); cs->insertOperator (PdfOperator::getIterator(opers.back()), oper, false); cs->getPdfOperators (opers); cs->getStringRepresentation (strr); @@ -335,13 +335,13 @@ delAllOper (UNUSED_PARAM ostream& oss, c */ operands.clear (); operands.push_back (shared_ptr<IProperty> (new CString ("halooooooooooo"))); - oper = shared_ptr<PdfOperator> (new SimpleGenericOperator ("Tj", 1, operands)); + oper = createOperator ("Tj", operands); cs->insertOperator (PdfOperator::getIterator(opers.back()), oper, false); cs->getPdfOperators (opers); cs->getStringRepresentation (strr); _working (oss); - oper = shared_ptr<PdfOperator> (new SimpleGenericOperator ("ET", 0, operands)); + oper = createOperator ("ET", operands); cs->insertOperator (PdfOperator::getIterator(opers.back()), oper, false); cs->getPdfOperators (opers); cs->getStringRepresentation (strr); @@ -383,7 +383,7 @@ insertOper (UNUSED_PARAM ostream& oss, c // PdfOperator::Operands operands; operands.push_back (shared_ptr<IProperty> (new CString ("halooooooooooo"))); - shared_ptr<PdfOperator> oper (new SimpleGenericOperator ("Tj", 1, operands)); + shared_ptr<PdfOperator> oper = createOperator ("Tj", operands); string str; //cs->getStringRepresentation (str); Index: pdfedit-patches/src/kernel/stateupdater.cc =================================================================== --- pdfedit-patches.orig/src/kernel/stateupdater.cc 2008-09-08 14:48:05.000000000 -0600 +++ pdfedit-patches/src/kernel/stateupdater.cc 2008-09-08 14:55:31.000000000 -0600 @@ -1169,6 +1169,55 @@ StateUpdater::getEndTag (const string& n return string (chcktp->endTag); } +bool checkAndFixOperator (const StateUpdater::CheckTypes& ops, PdfOperator::Operands& operands) +{ + size_t argNum = static_cast<size_t> ((ops.argNum > 0) ? ops.argNum : -ops.argNum); + + // + // Check operator size if > 0 than it is the exact size, maximum + // otherwise + // + if (((ops.argNum >= 0) && (operands.size() != argNum)) + || ((ops.argNum < 0) && (operands.size() > argNum)) ) + { + utilsPrintDbg (DBG_ERR, "Number of operands mismatch.. expected " << ops.argNum << " got: " << operands.size()); + return false; + } + + // + // Check arguments + // + PdfOperator::Operands::reverse_iterator rit = operands.rbegin (); + // Be careful -- buffer overflow + argNum = std::min (argNum, operands.size()); + advance (rit, argNum); + PdfOperator::Operands::iterator it = rit.base (); + // Loop from the first operator to the end + for (int pos = 0; it != operands.end (); ++it, ++pos) + { + if (!isBitSet(ops.types[pos], (*it)->getType())) + { + utilsPrintDbg (DBG_ERR, "Bad " << pos << "-th operand type [" << (*it)->getType() << "] " << hex << " 0x" << ops.types[pos]); + return false; + } + + // + // If xpdf returned an Int, but the operand can be a real convert it + // + if (isInt(*it)) + { + if (isBitSet(ops.types[pos], pReal)) + { // Convert it to real + double dval = 0.0; + dval = IProperty::getSmartCObjectPtr<CInt>(*it)->getValue(); + shared_ptr<IProperty> pIp (new CReal (dval)); + std::replace (operands.begin(), operands.end(), *it, pIp); + } + } + } + + return true; +} //========================================================== } // namespace pdfobjects //========================================================== -- Michal Hocko |
From: Michal H. <ms...@gm...> - 2008-09-08 22:58:34
Attachments:
encoding_support_v2.tar.gz
|
Here we go with the tar-ball patches version -- Michal Hocko |
From: Jozef M. <mis...@ho...> - 2008-09-09 07:24:48
|
ACK ---------------------------------------- > Date: Mon, 8 Sep 2008 16:58:31 -0600 > From: ms...@gm... > To: pdf...@li... > Subject: Re: [patch 0/4] [RFC] encoding support v2 > > Here we go with the tar-ball patches version > -- > Michal Hocko _________________________________________________________________ Stay up to date on your PC, the Web, and your mobile phone with Windows Live. http://clk.atdmt.com/MRT/go/msnnkwxp1020093185mrt/direct/01/ |
From: Jozef M. <mis...@ho...> - 2008-09-09 07:24:48
|
ACK ---------------------------------------- > Date: Mon, 8 Sep 2008 16:58:31 -0600 > From: ms...@gm... > To: pdf...@li... > Subject: Re: [patch 0/4] [RFC] encoding support v2 > > Here we go with the tar-ball patches version > -- > Michal Hocko _________________________________________________________________ See how Windows connects the people, information, and fun that are part of your life. http://clk.atdmt.com/MRT/go/msnnkwxp1020093175mrt/direct/01/ |
From: Michal H. <ms...@gm...> - 2008-10-29 15:16:22
|
On Tue, Sep 09, 2008 at 07:24:52AM +0000, Jozef Misutka wrote: > > ACK Originally I have planned to continue on this patchset and commit it when txt->op is also ready, but it seems that it is better to commit op->txt now (as I don't know when I do have time to continue on this). So the current version of the patchset is in the CVS and next version v3 will implement txt->op part of the implementation. Please note that this can mangle text with special characters even though they are displayed (more or less) correctly in the text edit tool because back-translation is not done! > > ---------------------------------------- > > Date: Mon, 8 Sep 2008 16:58:31 -0600 > > From: ms...@gm... > > To: pdf...@li... > > Subject: Re: [patch 0/4] [RFC] encoding support v2 > > > > Here we go with the tar-ball patches version > > -- > > Michal Hocko > > _________________________________________________________________ > See how Windows connects the people, information, and fun that are part of your life. > http://clk.atdmt.com/MRT/go/msnnkwxp1020093175mrt/direct/01/ -- Michal Hocko |