Bug #528

Task #490: RCP: 0.7.5 Fix 0.7.5 beta bugs

RCP: 0.7.5, xml/w import module, some xml tags are indexed as words

Added by Alexey Lavrentev over 5 years ago. Updated over 3 years ago.

Status:Closed Start date:01/17/2014
Priority:Normal Due date:
Assignee:Matthieu Decorde % Done:

100%

Category:Import Spent time: -
Target version:TXM 0.7.5

Description

Some xml tags from the source document appear as words in lexical indexes, e.g.

</?ab.*>
in Schiller corpus (check source documents and binary corpus at /SpUV/Schiller).

The same sources were correctly imported with TXM 0.7.2 with the same parameters...

In the BVHEPISTEMON2014 corpus, such misinterpreted tags are very numerous.

History

#1 Updated by Alexey Lavrentev over 5 years ago

  • Description updated (diff)

#2 Updated by Matthieu Decorde over 5 years ago

  • % Done changed from 0 to 70

fix bugs in the SattributeListener class:
- structure depth
- missing properties

#3 Updated by Matthieu Decorde over 5 years ago

  • Parent task set to #490

#4 Updated by Matthieu Decorde over 5 years ago

I've add a test after cwb-encode call to check if the registry file was created or not. This would help people to spot the bug.

#5 Updated by Matthieu Decorde over 5 years ago

  • % Done changed from 70 to 100

#6 Updated by Matthieu Decorde over 5 years ago

  • Status changed from New to Closed

#7 Updated by Matthieu Decorde over 3 years ago

  • Category set to Import

Also available in: Atom PDF