Bug #1328

RCP: 0.7.7, Description displays nothing for corpus CORPUSCONTES

Added by Benedicte Pincemin about 4 years ago. Updated about 4 years ago.

Status:New Start date:05/11/2015
Priority:Normal Due date:
Assignee:- % Done:

0%

Category:Import Spent time: -
Target version:TXM X.X

Description

cf. mail BP, May 11th 2015 https://listes.ens-lyon.fr/sympa/arc/textometrie/2015-05/msg00040.html

FR
Bug observé sur le corpus suivant :
/SpUV/DeniseMalrieu/150430/CORPUSCONTES..txm
(les sources sont dans le répertoire voisin).

Quand je demande une Description du corpus, je n'obtiens aucun affichage de résultat (pas d'onglet).

En 0.7.7 avec le niveau max de log j'ai dans la console :

Requête sur CORPUSCONTES : Q1 <- <body_n>[]
Requête sur CORPUSCONTES : Q2 <- <cit_n>[]
Requête sur CORPUSCONTES : Q3 <- <cit_rend>[]
Requête sur CORPUSCONTES : Q4 <- <div_n>[]
Requête sur CORPUSCONTES : Q5 <- <div_type>[]
Requête sur CORPUSCONTES : Q6 <- <dn_n1>[]
Requête sur CORPUSCONTES : Q7 <- <dn_denom>[]
Requête sur CORPUSCONTES : Q8 <- <dn_n>[]
Requête sur CORPUSCONTES : Q9 <- <dn_denom_n>[]
Requête sur CORPUSCONTES : Q10 <- <glose_n>[]
Requête sur CORPUSCONTES : Q11 <- <head_n>[]
Requête sur CORPUSCONTES : Q12 <- <morale_n>[]
Requête sur CORPUSCONTES : Q13 <- <nd_n>[]
Requête sur CORPUSCONTES : Q14 <- <pb_n>[]
Requête sur CORPUSCONTES : Q15 <- <pr_n1>[]
Requête sur CORPUSCONTES : Q16 <- <pr_n>[]
Requête sur CORPUSCONTES : Q17 <- <q_n>[]
Requête sur CORPUSCONTES : Q18 <- <q_type>[]
Requête sur CORPUSCONTES : Q19 <- <q_rend1>[]
Requête sur CORPUSCONTES : Q20 <- <q_rend2>[]
Requête sur CORPUSCONTES : Q21 <- <q_type2>[]
Requête sur CORPUSCONTES : Q22 <- <q_type1>[]
Requête sur CORPUSCONTES : Q23 <- <q_n1>[]
Requête sur CORPUSCONTES : Q24 <- <q_n2>[]
Requête sur CORPUSCONTES : Q25 <- <q_rend>[]
Requête sur CORPUSCONTES : Q26 <- <seg_n>[]
Requête sur CORPUSCONTES : Q27 <- <seg_rend1>[]
Requête sur CORPUSCONTES : Q28 <- <seg_n1>[]
Requête sur CORPUSCONTES : Q29 <- <seg_ana>[]
Requête sur CORPUSCONTES : Q30 <- <seg_ana1>[]
Requête sur CORPUSCONTES : Q31 <- <seg_rend>[]
Requête sur CORPUSCONTES : Q32 <- <text_id>[]
Requête sur CORPUSCONTES : Q33 <- <text_base>[]
Requête sur CORPUSCONTES : Q34 <- <text_project>[]
Requête sur CORPUSCONTES : Q35 <- <title_n>[]
Requête sur CORPUSCONTES : Q36 <- <title_rend>[]
Requête sur CORPUSCONTES : Q37 <- <txmcorpus_lang>[]
Requête sur CORPUSCONTES : Q38 <- <vd_n>[]

En 0.7.6 :

Requête sur CORPUSCONTES : Qf28ae566-2aae-4c46-860d-daed39bc8a15 <- <body_n>[]
Requête sur CORPUSCONTES : Q17b3e08c-c68c-40ec-bbc7-79e4059ea9f5 <- <cit_n>[]
Requête sur CORPUSCONTES : Q0305a56c-68c8-4445-a589-6ebcd2fba350 <- <cit_rend>[]
Requête sur CORPUSCONTES : Q95a53bee-8439-495c-8e6a-08bc299aa8c2 <- <div_n>[]
Requête sur CORPUSCONTES : Qaedc6ae9-2bbd-48e0-9ed7-9302c06dadc5 <- <div_type>[]
Requête sur CORPUSCONTES : Qf276e4eb-7231-40ff-8c0c-ef6be4ad9369 <- <dn_n1>[]
Requête sur CORPUSCONTES : Q31009cfd-22c3-42a7-bd81-ea816f03c144 <- <dn_denom>[]
Requête sur CORPUSCONTES : Qe018c77f-9f84-4af6-b3fa-ef77c09eff3e <- <dn_n>[]
Requête sur CORPUSCONTES : Q1a528221-6eb2-4c12-bd15-033f210ca4d4 <- <dn_denom_n>[]
Requête sur CORPUSCONTES : Q49287604-b9ef-4520-a3ce-49030c828f60 <- <glose_n>[]
Requête sur CORPUSCONTES : Qe8160574-5da3-4a6d-b381-c42b0be8559c <- <head_n>[]
Requête sur CORPUSCONTES : Qd54d1ebc-c73e-4266-9612-569a18174abd <- <morale_n>[]
Requête sur CORPUSCONTES : Q2e204c9e-3519-45ab-8c87-3ae8b8394e34 <- <nd_n>[]
Requête sur CORPUSCONTES : Q305babd6-14a5-4184-a0f2-c4fb01dc5f04 <- <pb_n>[]
Requête sur CORPUSCONTES : Q9068d756-ca80-493d-84a8-519821044d22 <- <pr_n1>[]
Requête sur CORPUSCONTES : Q86abebfa-395c-41d9-a71d-72f082808787 <- <pr_n>[]
Requête sur CORPUSCONTES : Q08b05a08-7428-4a4e-b1b8-b6328014a29e <- <q_n>[]
Requête sur CORPUSCONTES : Q1e52de72-b354-492b-a3fe-d6b3b43aea78 <- <q_type>[]
Requête sur CORPUSCONTES : Qca2deccf-fc3f-4e2a-aebe-c908bda38d4c <- <q_rend1>[]
Requête sur CORPUSCONTES : Qa886f2b0-96a4-49bd-a1db-fb3b784490a4 <- <q_rend2>[]
Requête sur CORPUSCONTES : Qdf2766fb-1fa2-41f3-bb49-5ff7e3f843b7 <- <q_type2>[]
Requête sur CORPUSCONTES : Q68c9fec2-7312-4664-bb0e-95cae9bc3369 <- <q_type1>[]
Requête sur CORPUSCONTES : Q7380809a-3644-46be-9112-559997171b99 <- <q_n1>[]
Requête sur CORPUSCONTES : Q9873f962-d978-457a-b94a-2254299e5e8f <- <q_n2>[]
Requête sur CORPUSCONTES : Qb89147e4-54bc-42b7-ae3d-13fa529ef4aa <- <q_rend>[]
Requête sur CORPUSCONTES : Q07ea1f1f-8da4-4069-a462-45cceb0f9cd9 <- <seg_n>[]
Requête sur CORPUSCONTES : Qb92d9336-8ed4-4a3c-996b-744f8f1f9401 <- <seg_rend1>[]
Requête sur CORPUSCONTES : Q77f6fdf4-6b57-4b00-b3e4-c33da40742d4 <- <seg_n1>[]
Requête sur CORPUSCONTES : Q00a1930e-0305-4aa0-b9b7-a10312e54624 <- <seg_ana>[]
Requête sur CORPUSCONTES : Q5d37e1be-df5d-4300-a8ff-b4de4c0a5977 <- <seg_ana1>[]
Requête sur CORPUSCONTES : Qc2ab7876-f218-4461-b9a4-6c8a60a56f99 <- <seg_rend>[]
Requête sur CORPUSCONTES : Q1d21230f-3700-479a-9640-5bb67d7e4c8f <- <text_id>[]
Requête sur CORPUSCONTES : Qc606efb0-c7a4-4501-b979-c88f51adbcdb <- <text_base>[]
Requête sur CORPUSCONTES : Q08012218-2515-4490-be8b-974bad680db1 <- <text_project>[]
Requête sur CORPUSCONTES : Q3518d2e8-048c-42b1-8bb8-c9b9c065b9ca <- <title_n>[]
Requête sur CORPUSCONTES : Q38c88053-8b0c-4be7-9dc2-83529c31c6f8 <- <title_rend>[]
Requête sur CORPUSCONTES : Q970e7241-8e18-45f7-8d15-39b765deb650 <- <txmcorpus_lang>[]
Requête sur CORPUSCONTES : Q524e6734-a30d-472a-96c3-d6da0c624907 <- <vd_n>[]
java.lang.NullPointerException
    at org.txm.functions.diagnostic.Diagnostic.htmldump(Diagnostic.java:344)
    at org.txm.functions.diagnostic.Diagnostic.toHTML(Diagnostic.java:574)
    at org.txm.rcpapplication.commands.base.Diagnostique$1.run(Diagnostique.java:162)
    at org.eclipse.core.internal.jobs.Worker.run(Worker.java:53)
Effectué en 1.1 sec.

Autres observations :

1) Ce corpus est composé de trois textes qui ont également fait l'objet d'imports séparés (1 corpus = 1 texte). Sur ces corpus, la Description se lance bien.

2) Je me suis demandé si cela pouvait être lié à l'hétérogénéité possible du balisage au sens où il pourrait arriver qu'un texte soit un peu moins annoté qu'un autre et que donc une structure ne soit pas forcément représentée dans certains textes : mais en discutant avec Alexei ce cas de figure se présente déjà dans la BFM et ne génère pas d'erreur.

Diagnostic

Diagnostic 1

  • Hypothesis : le bug pourrait être lié à un manque de mémoire, car bcp de corpus et de partitions dans le TXM considéré, et beaucoup de structures dans le corpus CORPUSCONTES.
  • Observation : suppression d'environ la moitié des corpus et des partitions présents dans le TXM. Mais le dysfonctionnement persiste (mêmes logs).
  • Conclusion
    • le bug n'est peut-être pas lié à un simple manque de mémoire, mais il faudrait pouvoir le tester de façon plus nette.

History

#1 Updated by Matthieu Decorde about 4 years ago

  • Category set to Commands
  • Target version set to TXM X.X

#2 Updated by Benedicte Pincemin about 4 years ago

  • Description updated (diff)

#3 Updated by Sebastien Jacquot about 4 years ago

  • Category changed from Commands to Import

The problem seems linked to the "pb" tag files. CQP log:

END OF STARTmmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\corpuscontes\data\CORPUSCONTES\pb_n.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute pb_n can't be loaded

After checking, the files in "data" directory "pb.rng", "pb_n.avs", "pb_n.avx", "pb_n.rng", are 0 Ko size, leading to a null pointer exception in org.txm.functions.diagnostic.Diagnostic.htmldump().
Is it normal that these files are empty ? I can see some other empty files in the Graal corpora.
Else this issue seems to be related to import/loading.
A fix could be to test if the value is null in org.txm.functions.diagnostic.Diagnostic.htmldump(). But this issue leads to some other strange behaviors, for example user can select pb_n in the partition creator dialog, but pb tags do not contain any text nodes.
The internal view on "pb" also says "n=null"

#4 Updated by Sebastien Jacquot about 4 years ago

Well, trying to import form source without the Sorciere.xml file which is the only one containing "<pb/>" tag does not remove this issue, the same null pointer exception occurs. The second thing I notated on this file is that the "pb" tag already contains a "n" attribute which could conflict with "n" attribute added while importing ?

#5 Updated by Serge Heiden about 4 years ago

Even if the problem comes from an import module, the Description command should give a diagnostic message about the corpus inaccessibility - not just display nothing.

#6 Updated by Alexey Lavrentev about 4 years ago

Empty pb_n index file should not clearly be a problem, as this actually happens in all corpora (and with all empty tags). Ideally, empty tags should not be indexed by the search engine to avoid confusion.
The conflict with supplied vs. automatically added @n should not be a problem either. We should investigate further.

#7 Updated by Sebastien Jacquot about 4 years ago

Thanks for the precisions, I understand now.
@Serge: that's because I think the case should never appear I said the problem should be treated in import layer rather than in each command, to resolve all others issues, partition creation, internal view, etc. (but I may don't know enough the import and CQP process at this moment).

#8 Updated by Serge Heiden about 4 years ago

@Seb: "the case should never appear": a bug is a situation in which what should never appear appears.
My point concerns in fact our general diagnostic message policy: in a diagnostic mode (for example for a specific log level), components should not trust data coming from other components and in case of problems should diagnose what is problematic in the data from their point of view. The idea is to help the user to trace back to the original component of a problem.

#9 Updated by Sebastien Jacquot about 4 years ago

@Serge: yes, thanks, you're right.

About the original issue, it seems to come from the tag "<DN_DENOM>" in the XML source files. Maybe the "_" character is the problem. Replacing "DN_DENOM" by "DNDENOM" removes the null exception on my TXM version.

Another possibility is a conflict with "<DN>" tag. The Diagnostic class or Importer may have a problem on this type of branch ("recursive naming" DN, DN_), need to prove this point:


<DN>
<DN_DENOM></DN_DENOM>
</DN>

About the Java/CQP corpus hierarchy we may try to keep a better coherent data state between Java structural unit/properties hierarchy and CQP structural unit/properties hierarchy, here is the CQP Graal log when calling Description command:

mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\front_n.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute front_n can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\hi_rend.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute hi_rend can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\hi_n.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute hi_n can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\lb_ed.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute lb_ed can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\lb_type.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute lb_type can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\lb_n.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute lb_n can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\milestone_n.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute milestone_n can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\milestone_ed.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute milestone_ed can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\milestone_unit.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute milestone_unit can't be loaded
mmapfile()<storage.c>: Can't mmap() file C:\Users\s\TXM\corpora\graal\data\GRAAL\pb_n.rng ...
    You have probably run out of memory / address space!
    Error Message: attributes:load_component(): Warning:
  Data of STRUC component of attribute pb_n can't be loaded

plus, it leads to subcorpus or partition possibilities with no tokens.

#10 Updated by Serge Heiden about 4 years ago

'it seems to come from the tag "<DN_DENOM>" in the XML source files'

XML sources tag names MUST not contain '_' characters because the CQP indexing system uses it to manage pseudo-recursion of XML elements (recursion is flatten by creating new element names by concatenating recursion informations separated by '_').

We must describe exhaustively in the XML import specifications the constraints for the names of:
  • file paths (no accents or space, etc?)
  • file names (.xml extension?)
  • xml element names (no underscore?)
  • xml attribute names
  • xml entities constraints (no entities?)

Then we must enforce that policy BEFORE CQP indexing process because CQP indexing doesn't enforce a strict policy with respect to its own constraints.

Then we must document that policy in the TXM manual.

#11 Updated by Benedicte Pincemin about 4 years ago

Some other remarks :

1) I can get Description on the 3 corpora made for each text separately. In particular, it works for corpus Perrault7 which contains <DN_DENOM> tags.

2) during the import of the corpus, with logs set to max, I get the following lines :
[...]
Démarrage avec la ligne de commande : /usr/lib/TXM/TXM/../cwb/bin/cwb-encode -d /home/bpincemi/TXM/corpora/CORPUSCONTESFROMSRC/data/CORPUSCONTESFROMSRC -f /home/bpincemi/TXM/corpora/CORPUSCONTESFROMSRC/wtc/CORPUSCONTESFROMSRC.wtc -R /home/bpincemi/TXM/corpora/CORPUSCONTESFROMSRC/registry/corpuscontesfromsrc -c utf8 -xsB -P id -P frpos -P frlemma -P n -P type -S body:0+n -S cit:0+rend+n -S div:0+n+type -S dn:1+n -S dn_denom:0+n -S glose:0+n -S head:0+n -S morale:0+n -S nd:0+n -S pb:0+n -S pr:1+n -S q:2+rend+type+n -S seg:1+rend+ana+n -S text:0+id+base+project -S title:0+rend+n -S txmcorpus:0+lang -S vd:0+n
Malformed tag </tei.2>, inserted literally (file /home/bpincemi/TXM/corpora/CORPUSCONTESFROMSRC/wtc/CORPUSCONTESFROMSRC.wtc, line #23192).
Malformed tag </tei.2>, inserted literally (file /home/bpincemi/TXM/corpora/CORPUSCONTESFROMSRC/wtc/CORPUSCONTESFROMSRC.wtc, line #46302).
Malformed tag </tei.2>, inserted literally (file /home/bpincemi/TXM/corpora/CORPUSCONTESFROMSRC/wtc/CORPUSCONTESFROMSRC.wtc, line #57557).
  • Échec de récupération de la dimension du corpus CORPUSCONTES
    [...]

#12 Updated by Serge Heiden about 4 years ago

The previous diagnostic seems to show that:
  • a) it may be possible to use the '_' character in tag names in XML sources (difficult to understand the impact à CQL queries expressing contraints on structural property values and using the special '_' character - eg [_.text_loc="dg"])
  • b) it seems impossible to use the '.' character in tag names in XML sources (the error message is related the closing tag but we can try to extrapolate. Same remark about expressing constraints on structural property values in CQL queries because the '.' character has a special role there - eg [_.text_loc="dg"])

Conclusion: I once sent an email summarizing what I understood of the constraints related to tag names in CQP source code. We should find that, put it in a spec and enforce an input policy as detailed above in the TXM XML import modules.

#13 Updated by Sebastien Jacquot about 4 years ago

This bug seems to have been fixed with last commit by testing if the value is null in Diagnostic.htmldump().
For information, with the original unfixed code:
- I have the same behavior than Bénédicte, each text imported separately works but not together.
- here is the description which shows a possible conflict about the <DE_DENOM> tag, maybe because of the "_" character or the hierarchy <DN><DN_DENOM>.

• dn
◦ denom (1) = "".
◦ n()...
◦ n1()...
• dn_denom
◦ n (6) = 0, 1, 2, 3, 4, 5.

(There are no <DENOM> tags in source files, only <DN_DENOM> tags)

Other notes on current import process that we may write down somewhere:

  • the "_" and "." characters in an XML tag name seem valid if I understand well the W3C recommendations: http://www.w3.org/TR/xml/#sec-common-syn (a few tricky to read)
  • when "." character is present in a source tag name, the structure seems to be ignored while importing in TXM, e.g. here: "TEI.2" tag is simply removed

#14 Updated by Sebastien Jacquot about 4 years ago

Sebastien Jacquot a écrit :

• dn
◦ denom (1) = "".
◦ n()...
◦ n1()...
• dn_denom
◦ n (6) = 0, 1, 2, 3, 4, 5.

(There are no <DENOM> tags in source files, only <DN_DENOM> tags)

Actually this behavior is still present in last SVN. The internal structure hierarchy stores a dn => denom which seems an error because the denom tag is not present in corpora leading to some wrong possibilities, e.g. partition or subcorpus on DN => DENOM.

Also available in: Atom PDF