====== XML Extraccion del POS dins lo fichièr TEI del Congrès ====== La tòca es de traire l'informacion necessari dempuèi un fichièr TEI:XML gigantàs (18,4 Mo) que conten tota l'informacion del basicòt Francés/Lengadocian-Gascon. Lo diccionari presenta l'interès d'aver sos tèrmes amb d'etiquetas POS.\\ Lo basicòt compta 13610 tèrmes. L'attribut "XML:lang" foncionava pas dins l'analisi sintaxica amb la librarià lxml. Doncas l'attribut foguèt remplaçada per "language" dins tot lo fichièr. ===== Extrach del fichièr basicunif_complet_language.xml===== ... ...
abaisser v.
abaishar
baishar
abaissar
baissar
diminuer
amermar
...
===== Programa basic.py ===== D'unes còps lo còde TEI es pas forçadament plan format quand i a una error, lo programa se deu pas arrestar, l'error es marcada dins la resulta per èsser escartat. #!/usr/bin/env python3 # coding: utf8 import json from lxml import etree tree = etree.parse("basicunif_complet_language.xml") f = open("basicunif_complet_extract.txt",'w') f.write("id,fr,posfr,sens,dialect,oc,posoc\n".encode('utf-8')) for entry in tree.xpath("/TEI/text/body/entry"): entryId = entry.get('n') orth = entry.find("form/orth").text gramNorm = entry.find("form/gramGrp/gram").get('norm') for sense in entry.xpath("sense"): for cit in sense.xpath("cit"): try: langcit = cit.find("form").get('dialect') except AttributeError: langcit = "langcit error" # orthcit orthcit = cit.find("form/orth") try: orthcitText = orthcit.text except AttributeError: orthcitText = "orthcit error" #gramNormcit try: gramNormCit = cit.find("form/gramGrp/gram").get('norm') except AttributeError: gramNormCit = "gramNormCit error" s = "\"%s\", \"%s\",\"%s\",\"%s\",\"%s\",\"%s\",\"%s\"\n" % (entryId, orth, gramNorm, sense.get('n'), langcit, orthcitText, gramNormCit) #print(s) f.write(s.encode('utf-8')) f.close() ===== Resulta ===== La sortida es en format csv per èsser integrat dins una basa de donadas SQL e NoSQL. id,fr,posfr,sens,dialect,oc,posoc "1", "à","AP1","I-A-1-a","oc","a","AP1" "1", "à","AP1","I-A-2-a","oc","per","AP1" "1", "à","AP1","I-A-2-a","oc-gascon","entà","AP1" "1", "à","AP1","I-A-3-a","oc","a","AP1" "1", "à","AP1","I-A-4-a","oc","de","AP1" "1", "à","AP1","I-A-5-a","oc","de","AP1" "1", "à","AP1","I-A-6-a","oc-gascon","entà","AP1" "1", "à","AP1","I-A-6-a","oc-lengadoc","a","AP1" "2", "à cloche-pied","AV0000","I-A-1-a","oc-gascon","a sautapè","AV0000" "2", "à cloche-pied","AV0000","I-A-1-a","oc-lengadoc","a pè-ranquet","AV0000" "3", "à contrecoeur","AV0000","I-A-1-a","oc","a contracòr","AV0000" "3", "à contrecoeur","AV0000","I-A-1-a","oc-gascon","d’arrèrcòr","AV0000" "3", "à contrecoeur","AV0000","I-A-1-a","oc-lengadoc","de rèirecòr","AV0000" "4", "a fortiori","AV0000","I-A-1-a","oc","a fortiori","AV0000" "5", "à jeun","AV0000","I-A-1-a","oc-gascon","dejun","AJ0110" "5", "à jeun","AV0000","I-A-1-a","oc-lengadoc","dejun","AJ0110" "6", "à peu près","AP1","I-A-1-a","oc-gascon","haut o baish","AP1" "6", "à peu près","AP1","I-A-1-a","oc-gascon","mei o mensh","AP1" "6", "à peu près","AP1","I-A-1-a","oc-lengadoc","a pauc près","AP1" "6", "à peu près","AP1","I-A-1-a","oc-lengadoc","pauc se'n manca","AP1" "7", "à rebours","AV0000","I-A-1-a","oc-gascon","au reboish","AV0000" "7", "à rebours","AV0000","I-A-1-a","oc-lengadoc","a revèrs","AV0000" "8", "à reculons","AV0000","I-A-1-a","oc-gascon","d'arreculas","AV0000" "8", "à reculons","AV0000","I-A-1-a","oc-gascon","de reculas","AV0000" "8", "à reculons","AV0000","I-A-1-a","oc-lengadoc","de reculons","AV0000" "9", "à tâtons","AV0000","I-A-1-a","oc-gascon","a paupas","AV0000" "9", "à tâtons","AV0000","I-A-1-a","oc-lengadoc","a palpas","AV0000" "10", "à-coup","N1110","I-A-1-a","oc-gascon","sacat","N1110" "10", "à-coup","N1110","I-A-1-a","oc-lengadoc","bassacada","N1210" "11", "à-peu-près","N1110","I-A-1-a","oc","a-pauc-près","N1110" "11", "à-peu-près","N1110","I-A-2-a","oc","aproximacion","N1110" "12", "abaisser","V000050000000","I-A-1-a","oc-gascon","abaishar","V000050000000" "12", "abaisser","V000050000000","I-A-1-a","oc-gascon","baishar","V000050000000" "12", "abaisser","V000050000000","I-A-1-a","oc-lengadoc","abaissar","V000050000000" "12", "abaisser","V000050000000","I-A-1-a","oc-lengadoc","baissar","V000050000000" "12", "abaisser","V000050000000","I-A-2-a","oc","amermar","V000050000000" ... ===== Tèrmes en error ===== |3034 | de | AT3000|I-A-1-a|langcit error | orthcit error|gramNormCit error| |3720 | du |AT3100 |I-A-1-a |langcit error |orthcit error |gramNormCit error| |13591 | je |PD312011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error| |13593 | tu |PD211011500022 |I-A-1-a |langcit error |orthcit error |gramNormCit error| |13595 |il |PD311011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error| |13596 | je |PD312011500002 |I-A-1-a |langcit error |orthcit error |gramNormCit error| ===== CSV botat dins una taula de MySQL ===== lo nom de la basa de donadas 'cplo_basic_unificat'. ===== Comolacion dels POS per un sens donat===== SELECT idterm, fr, posfr, group_concat( sens ) AS sens, group_concat( dialect ) AS dialectes, group_concat( oc ) AS occitans, group_concat( posoc ) AS posoccitans FROM `cplo_basic_unificat` WHERE dialect != 'langcit error' GROUP BY idterm, sens ===== Extracion en CSV =====