Announcement

Collapse
No announcement yet.

Search Index (German Umlaute)

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Search Index (German Umlaute)

    It seems the Search Index does not consider German umlaute like ü, ä, ö or specific German characters like "ß"? Am I right? Or, is there any way I can activate this?

    A search like "hausübung" retrieves no results. "haus?bung" -> no results, "haus*bung" shows N-results all containing the word.

    best regards
    Last edited by Forensik; 08-13-2012, 03:30 PM.

  • #2
    It should work.
    What type of document was this from?
    e.g. a text file, a PDF, E-mail, Word files, etc..

    Comment


    • #3
      Just did some testing to confirm that searching for words with umlauts worked with some correctly encoded HTML files, and it worked correctly here in the test cases.

      So yes, we would need more details such as the type of file that you are searching to further verify. It could be a text file with wrong/unexpected encoding. It could be a PDF file with an unusual text layer that did not match the OCR image, etc. If you can send us a copy of the file, even better.

      Another thing to confirm is if you have changed the "Advanced" setting (under "Step 2" in the Create Index process) and if you have checked the option to "Enable accent/diacritic insensitivity".

      This would cause a word like "heiße" to be indexed as "heisse", so that it would be searchable for both "heiße" and "heisse", but it would lead to wildcard results where it would return for "hei*e" but not "hei?e" (because it is internally considered as two letters - ss).
      Ray
      PassMark Software

      Comment


      • #4
        Hi there,

        I indexed and searched office documents. there were several docs in the case matching the umlaut-word. none was found doing the "hausübung" search. though it worked with "haus*bung". I am quite sure that I did not alter Enable accent/diacritic insensitivity-Option , I might have checked the stemming option for German.

        best regards

        Comment


        • #5
          What exact type of document was this from?
          Word, PDF, Excel, Powerpoint, OpenOffice, etc..
          Also if it was a Microsoft Office format was it the new Office format (e.g .DOCX) or the old format (e.g. .DOC)

          Comment


          • #6
            all sorts of documents mainly PDF, DOC (old), also xls and docx.

            Originally posted by David (PassMark) View Post
            What exact type of document was this from?
            Word, PDF, Excel, Powerpoint, OpenOffice, etc..
            Also if it was a Microsoft Office format was it the new Office format (e.g .DOCX) or the old format (e.g. .DOC)

            Comment


            • #7
              We did some testing with various document and various indexing options.

              It seems there is a bug in the German stemmer. A stemmer is an algorithm that allows you to search for the word, "run" and also find results for "running" and "runs".

              So in some cases the German ß character is not being handled correctly when it gets stemmed.

              A quick solution is to turn off stemming. See the screen shot below for how to do this,


              We are working on a full solution to enable the stemmer to be used. The solution should be in the next V1.2 patch release.

              Comment


              • #8
                This should be fixed in the V1.2 Alpha released today.
                http://www.passmark.com/forum/showth...a-Beta-release

                If you can check this out and confirm this is the case, that would be good.

                Comment


                • #9
                  Hi Mark,

                  I did several searches containing umlaute like ä,ö,ü and ß and documents containg the characters were found in E-Mails, Attachments, DOC, DOCX, PDF, PPT file formats.

                  Thank you very much for the fast fix.

                  Originally posted by David (PassMark) View Post
                  This should be fixed in the V1.2 Alpha released today.
                  http://www.passmark.com/forum/showth...a-Beta-release

                  If you can check this out and confirm this is the case, that would be good.

                  Comment


                  • #10
                    Any chance to export or print the search index result list?

                    On teh screen it is presented as a nice table only offering the possibility to delte single items or the whole list.

                    best regards


                    Originally posted by Forensik View Post
                    Hi Mark,

                    I did several searches containing umlaute like ä,ö,ü and ß and documents containg the characters were found in E-Mails, Attachments, DOC, DOCX, PDF, PPT file formats.

                    Thank you very much for the fast fix.

                    Comment


                    • #11
                      I am guessing you are referring to the search History tab. There is no way to export the contents of the tab from the User Interface.

                      But there is still an option to get all the data. There is an series of XML files that contains the search history for each case and each set of index files.

                      The default location of the file is below. But the location will vary depending on the folder you selected for your case and the name you gave your index.

                      C:\Users\<User Name>\Documents\PassMark\OSForensics\Cases\<Case Name>\Index\<Index name>\History\<history temp file name>.xml"

                      There is one XML file per search performed and the XML files contain the search results as well as the search terms used, so it maybe isn't exactly what you are after.

                      We'll also look at adding a better export to CSV function to the UI.

                      Comment

                      Working...
                      X